Efficient Learning Methods for Multimodal Understanding

The main focus of this research is to develop representation learning architectures and algorithms that can help perform various multimodal understanding tasks, and at the same time reduce the need for human supervision in the form of costly annotations. To achieve this goal, a learning system must be able to: (1) learn new tasks or concepts with a few examples; (2) effectively use the knowledge already acquired by the system; (3) rely on one modality (e.g., text/audio) to fill in gaps in another modality (e.g., vision); and finally (4) retain high performance on all tasks/concepts learned previously. The goal of this research is to develop methods that enable efficient learning for multimodal tasks (e.g., dialogue-based image/video retrieval) in scenarios with limited annotated data. As part of this work, we will focus on answering two key questions: (1) How to leverage the problem structure in order to enable learning algorithms that use limited annotations more effectively? and (2) What are the mechanisms that enable efficient learning from few examples?

Faculty Supervisor:

Animesh Garg

Student:

Nikita Dvornik

Partner:

Samsung Electronics Canada

Discipline:

Computer science

Sector:

Manufacturing

University:

University of Toronto