Exploring a Multi-modal Audio Foundational Model

Boson AI is an early-stage startup of 30 scientists. They are building large language tools for interaction and entertainment. They are seeking interns from the University of Toronto to join them in their Toronto office. The intern will work on modeling and training LLMs, understanding and interpreting model behavior and aligning models to human values. The company will provide comprehensive guidance and resource support during the internship.
During this project, the partner wants to build an AI model to understand and speak in a natural and emotionally competent manner. Humans often prefer voice communication over text. This has to do with the immediacy of the response and the amount of emotional nuance and subtext that can be conveyed beyond the actual words. Details are only obtainable through the audio (tone, laughter, pauses, volume, etc.), and humans are very good at picking this up. Computers are typically less good at this.
The project is expected to bring significant social and economic benefits by enhancing AI’s ability to understand and express emotions naturally in speech. This advancement will improve human-AI interactions, making virtual assistants, customer service bots, and social robots more engaging and intuitive.

Faculty Supervisor:

Paul Gries

Student:

Partner:

Boson AI

Discipline:

Computer science

Sector:

Education; Information and cultural industries

University:

University of Toronto

Program: