Character n-gram based clustering of noisy documents with interactive supervision


Mining aviation safety reports is a critical task in improving safety of civilian aviation. NASA maintains a publicly available database of aviation safety-related incident reports, the Aviation Safety Reporting System (ASRS). It is of great practical interest to provide automated tools that assist aviation safety analysts in identifying and analyzing reports about particular safety aspects, such as turbulence injuries, air traffic management miscommunication, airplane load mismanagement and crew incapacitation. The text in ASRS reports is often noisy, contains adhoc abbreviations, and is grammatically poorly formed. As a result, standard natural language processing techniques, which assume well-formed and grammatically correct narrative text are not applicable to these reports. As the safety aspects of interest are not fixed categories, we are dealing with document clustering, not classification. We will explore the feasibility of character n-grams as text features, as opposed to the commonly used word or term features. We expect character n-grams to be more robust to poorly formed text. While this project will mostly focus on text mining aspects, it will also consider problems at the intersection of text mining, text visualization and human-computer interaction. Clustering will be approached from an interactive perspective, providing access points to the user to provide limited supervision to the clustering algorithm.

Faculty Supervisor:

Dr. Vlado Keselj, Dr. Evangelos Milios, Dr. Kirstie Hawkey


Morteza Zihayat Kermani


AeroInfo Systems - A Boeing Company


Computer science


Aerospace and defense


Dalhousie University



Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects