Robust identification of protected heath information in unstructured data

A large amount of health-related data is available only in unstructured form (“free-form text”). To share this data for secondary purposes, it is necessary to de-identify it to protect against inappropriate disclosure of personal health information (PHI). PARAT Text is Privacy Analytics’ de-identification software for unstructured data. It automatically discovers and marks PHI in a variety of document formats using gazetteers and a bunch of rules. The primary problem of this tool is that it is limited by the knowledge of human experts, gazetteer lists, and lack of contextual knowledge. I plan to explore unsupervised and semi-supervised machine learning approaches to make the PHI discovery more robust. This will provide elegant and robust methods to deal with text data, which might broaden the partner organization’s consumer base.

Faculty Supervisor:

Diana Inkpen


Varada Kolhatkar


Privacy Analytics


Engineering - computer / electrical


Information and communications technologies


University of Ottawa



Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects