Robust identification of protected heath information in unstructured data

A large amount of health-related data is available only in unstructured form (free-form text). To share this data for secondary purposes, it is necessary to de-identify it to protect against inappropriate disclosure of personal health information (PHI). PARAT Text is Privacy Analytics de-identification software for unstructured data. It automatically discovers and marks PHI in a variety of document formats using gazetteers and a bunch of rules. The primary problem of this tool is that it is limited by the knowledge of human experts, gazetteer lists, and lack of contextual knowledge. I plan to explore unsupervised and semi-supervised machine learning approaches to make the PHI discovery more robust. This will provide elegant and robust methods to deal with text data, which might broaden the partner organizations consumer base.

Faculty Supervisor:

Diana Inkpen

Student:

Varada Kolhatkar

Partner:

Privacy Analytics

Discipline:

Engineering - computer / electrical

Sector:

Information and communications technologies

University:

University of Ottawa