Learning PDF Document Structures using Recursive Neural Networks

Portable Document Format or PDF is the de facto standard for presenting textual-visual content. In this project, we aim to develop a machine learning framework for PDF document understanding. Despite the recent proliferation of deep learning-based methods for the analysis and processing of natural images, there have been considerably less efforts on designing similar approaches for highly structured data such as documents. Our project will explore two novel ideas. First, we will develop a structured and organizational representation of PDF documents which is built on labeled content blocks (e.g., heading, figure, list, caption, etc.). Second, we will investigate how recursive neural networks (RvNN), one type of deep neural networks that have been utilized to language parsing, can be adopted and formulated for learning PDF document structures.

Faculty Supervisor:

Richard Hao Zhang

Student:

Partner:

Apryse

Discipline:

Computer science

Sector:

Information and Communications Technology; New and Digital Media

University:

Simon Fraser University

Program: