Multi-Channel Audio-Visual Speech Separation

Speech separation – isolating different speech sources from a given mixed input signal – is a fundamental task in speech processing. It is a fundamental step in many applications, such as automatic speech and speaker recognition. However, speech separation is also among the most challenging tasks in speech processing. Recent studies involving deep models to solve the monaural (single-channel) speech separation problem provide improved results over traditional methods. However, monaural speech separation methods still suffer from low performance while dealing with acoustic variations resulting from the presence of reverberation. Speech separation can benefit from multi-channel processing due to availability of directional information. Other recent studies have reported that using visual cues can also improve monaural speech separation. In this project, we plan to investigate the use of both visual features and multi-channel processing in speech separation. The proposed approach consists of the following model-based steps. First, an audio-only speech separation (single or multi-channel) model will perform separation. The separated signals will get matched with their corresponding visual features by the audio-visual match model. Then, the predicted spectral magnitudes will get refined by an audio-visual magnitude refinement model…To be continued.

Faculty Supervisor:

Ivan Bajic

Student:

Partner:

Singular Software

Discipline:

Engineering

Sector:

Information and cultural industries

University:

Simon Fraser University

Program: