Efficient Computational methods for generating differentially private synthetic tabular data - ON-660

Project type: Research
Desired discipline(s): Engineering - computer / electrical, Engineering, Computer science, Mathematical Sciences, Mathematics
Company: Mastercard (AI Garage)
Project Length: 6 months to 1 year
Preferred start date: 05/01/2022
Language requirement: English
Location(s): Toronto, ON, Canada; Vancouver, BC, Canada; Canada
No. of positions: 1
Desired education level: Master'sPhDPostdoctoral fellowRecent graduate
Open to applicants registered at an institution outside of Canada: No

About the company: 

We work to connect and power an inclusive, digital economy that benefits everyone, everywhere by making transactions safe, simple, smart, and accessible. Using secure data and networks, partnerships, and passion, our innovations and solutions help individuals, financial institutions, governments, and businesses realize their greatest potential. Our decency quotient, or DQ, drives our culture and everything we do inside and outside of our company. We cultivate a culture of inclusion for all employees that respects their individual strengths, views, and experiences. We believe that our differences enable us to be a better team – one that makes better decisions, drives innovation, and delivers better business results. At AI Garage, we use state-of-the-art AI techniques to solve some of the most important problems in the financial world.

Describe the project.: 

Generating synthetic data is important for a number of machine learning problems at Mastercard, especially in the areas of additional data generation for imbalanced problems, data sharing, etc. The data is mostly tabular in nature and a number of techniques exist for generating tabular data in the literature. However, most of these techniques do not work on large datasets or fail to generate differentially private datasets. We already have done some work in this regard (see https://link.springer.com/chapter/10.1007/978-3-030-92310-5_60 ). However, the problem is not “solved” yet as it is difficult to generate differentially private datasets from large training sets and metrics like machine learning efficacy can be abysmally lower. The intern would be asked to work on improving the current algorithms available in the literature both from a privacy and accuracy standpoint.

Required expertise/skills: 

Programming Skills in Python and a basic understanding of Machine Learning especially around GANs.