An automated system to identify and extract key structural components in academic written texts or genres

To streamline knowledge acquisition, indexing, dissemination, and synthesis—especially important to the future of libraries—a fundamental understanding of knowledge storage and communication is required. In a textual body of knowledge, relevant qualities include layout and structure; headings, chapters, sections, and paragraphs; figures, tables, lists, captions, and illustrations; authorship information and references; and, most importantly, the relationship between these semantic components. We propose research to develop a series of software model pipelines capable of producing a JSON file of an input document’s relevant semantic qualities. This inter-disciplinary project combines computer vision, natural language processing (NLP), and computational linguistics to research optical character recognition, document classification, document object detection and segmentation, document layout recognition and classification, and semantic labelling. Deep learning, statistical machine learning, and traditional computer vision-based methods for these topics will be researched and evaluated.

Logan Markewich;Yubin Xing;Hao Zhang
Faculty Supervisor: 
Seok-bum Ko;Zhi Li;Roy Ka-Wei Lee
Partner University: