Learning PDF Document Structures using Recursive Neural Networks

Portable Document Format or PDF is the de facto standard for presenting textual-visual content. In this project, we aim to develop a machine learning framework for PDF document understanding. Despite the recent proliferation of deep learning-based methods for the analysis and processing of natural images, there have been considerably less efforts on designing similar approaches for highly structured data such as documents. Our project will explore two novel ideas. First, we will develop a structured and organizational representation of PDF documents which is built on labeled content blocks (e.g., heading, figure, list, caption, etc.). Second, we will investigate how recursive neural networks (RvNN), one type of deep neural networks that have been utilized to language parsing, can be adopted and formulated for learning PDF document structures.

Faculty Supervisor:

Richard Zhang

Student:

Chenyang Zhu

Partner:

PDFTron Systems

Discipline:

Computer science

Sector:

Information and communications technologies

University:

Program:

Accelerate

Current openings

Find the perfect opportunity to put your academic skills and knowledge into practice!

Find Projects