Improved Model Compression Techniques and Processes for Large Scale Pretrained Language Models

Over the past few years, the abilities and performances of deep-learning natural language processing (NLP) have evolved dramatically. A main reason for those improvements is the scaling of the number of parameters in models to tens or hundreds of billions. However, it also becomes impractical to deploy those models in production as the computation cost becomes prohibitive for the vast majority of applications. The project aims to develop a structured compression process incorporating the state-of-the-art techniques (and potentially new ones) allowing quick move from large-scale models to smaller, faster production-ready ones. Additionally, we will investigate the impacts and traits of the devices in industry to further optimize the compression techniques. The ultimate goal is to result in the development of general-purpose compression techniques which can be applied to the whole range of Turing language models. This project can benefit several NLP-related Microsoft services on various devices.

Faculty Supervisor:

Eyal de Lara

Student:

Partner:

Microsoft Canada

Discipline:

Computer science

Sector:

Information and cultural industries

University:

University of Toronto

Program: