Character n-gram based clustering of noisy documents with interactive supervision

 

Mining aviation safety reports is a critical task in improving safety of civilian aviation. NASA maintains a publicly available database of aviation safety-related incident reports, the Aviation Safety Reporting System (ASRS). It is of great practical interest to provide automated tools that assist aviation safety analysts in identifying and analyzing reports about particular safety aspects, such as turbulence injuries, air traffic management miscommunication, airplane load mismanagement and crew incapacitation. The text in ASRS reports is often noisy, contains adhoc abbreviations, and is grammatically poorly formed. As a result, standard natural language processing techniques, which assume well-formed and grammatically correct narrative text are not applicable to these reports. As the safety aspects of interest are not fixed categories, we are dealing with document clustering, not classification. We will explore the feasibility of character n-grams as text features, as opposed to the commonly used word or term features. We expect character n-grams to be more robust to poorly formed text. While this project will mostly focus on text mining aspects, it will also consider problems at the intersection of text mining, text visualization and human-computer interaction. Clustering will be approached from an interactive perspective, providing access points to the user to provide limited supervision to the clustering algorithm.

Faculty Supervisor:

Dr. Vlado Keselj, Dr. Evangelos Milios, Dr. Kirstie Hawkey

Student:

Morteza Zihayat Kermani

Partner:

AeroInfo Systems - A Boeing Company

Discipline:

Computer science

Sector:

Aerospace and defense

University:

Dalhousie University

Program:

Accelerate

Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects