Big Data Cleaning

Poor data quality is a barrier to effective, high-quality decision-making based on data. Declarative data cleaning has emerged as an effective tool for both assessing and improving the quality of data. In this work, we will address some important challenges in applying declarative data cleaning to big data, challenges that arise due to the scale, complexity, and massive heterogeneity of such data. First, we will investigate the use of domain ontologies to enhance declarative data cleaning. Second, given the dynamic nature of big data, we will develop new continuous data cleaning methods. Third, recognizing the massive heterogeneity of big data and recognizing that automation will rarely provide 100% accuracy, we will develop new techniques to explain the provenance of data cleaning solutions. Provenance permits users to understand why (and how) a cleaning decision was derived and can enable users to debug and correct automated solutions.

Intern: 
Jaroslaw Szlichta
Faculty Supervisor: 
Dr. Renee Miller
Project Year: 
2014
Province: 
Ontario
Sector: 
Discipline: 
Program: