Conditional Flow-based Speech-Driven Gesture Synthesis

Speech-driven gesture synthesis is the process of automatically generating relatable and realistic gestures given speech and high-level attributes such as speaker’s style. It is an active research area with applications in video games, animated movies, communicative agents, and human-computer interaction. Commonly, a database of gestures is manually created which are then triggered at different times by markup in dialog. This is a significantly time consuming and tedious step in animation pipelines. Recently, with the power of machine learning approaches, character animation has been pushed forward towards new boundaries. Yet, modelling speech-driven gesture synthesis using machine learning architectures has been proved to be difficult due to the particular characteristics and the nature of the human gesture. To this end, this project aims at pushing forward state-of-the-art speech-driven gesture synthesis performance through: (1) proposing a novel generative machine learning approach that can model natural variations of human motion and can modulate speaker’s style into the generated gestures, (2) capturing a new data set containing a large variation of gestures and styles, and (3) a qualitative evaluation of the proposed approach by comparing it with other baselines.

Saeed Ghorbani
Faculty Supervisor: 
Nikolaus F Troje
Partner University: