A Principled Approach to Developing Machine Learning Models for the Synthesis of Structured Health Data
Under the current pandemic of Covid-19, sharing health record data has tremendous benefits to control the spread of the infection and save lives globally. In medical research and discovery, Electronic medical records (EMRs) play the essential role for medical discovery in two categories, namely 1) cross-sectional study and 2) longitudinal study. Cross-sectional study compares different population groups at a single point in time while in longitudinal study, researchers conduct several observations of the same subjects over a period of time. Sharing EMRs across medical institutes in a wide scale, both risk the privacy limit of patients. Recent research has been developed to mitigate risk including record simulation via advanced neural networks. While showing promise in certain applications, these models have limitations in handling cross sectional heterogeneous data and have not been applied to longitudinal EMRs. This proposal aims to develop a principled approach with rigorous methodology to derive (a) machine learning models to synthesize EMRs of health data and (b) utility analysis of the data synthesis.