Automated Data Labeller

All data-driven solutions today start with the ingestion (input) of data. Typically that data is messy and unlabelled. However, downstream consumers of data benefit from well-labelled data. Data labelling (assigning categories, data types, privacy and sensitivity tags, source characteristics, etc.) is usually an error-prone, time-consuming, manual effort. There are no readily available off-the-shelf tools that perform reliable data labelling today. This project aims to design and build a configurable, scalable, automated tool for classifying data fields given a data source. The automated tool will be a software product that generates label(s) for given input data sources and data fields including at least information such as entity type, entity context, and privacy/sensitivity tags, and does so using natural language processing (NLP) in conjunction with a heuristic-based expert rule system. The tool developed as part of this project will provide the partner organisation with a competitive edge in the data science/ML/AI market, helping it grow its customer base by attracting companies and organisations that do not have the technical skillset to build their own data processing systems, which in turn will lead to increased revenue for the partner organisation.

Faculty Supervisor:

Mark Chignell

Student:

Partner:

Scribble Data Inc.

Discipline:

Computer science

Sector:

Information and cultural industries

University:

University of Toronto

Program: