Auditory Inspired Methods for Multiple Speaker Localization and Tracking Using a Circular Microphone Array

PhD Student 
Research Area

In today's world, hands-free communication has become an essential part of day-to-day activities. It exists as an acoustic front end of telephony and speech dialog systems to name a few. In practice, these systems are placed in adverse acoustic environments with ambient noise. Moreover, the distance between the speaker and microphones decreases the power level of recorded speech signal resulting in poor quality signal acquisition. The emergence of array signal processing techniques is offering improved system performance for multiple input systems. The multi-channel system allows to solve problems, such as source localization and tracking, which is difficult with single-channel systems.

An accurate detection, localization, and tracking of speakers is also essential for media processing tasks: for steering a video camera towards an active speaker, for conference telephony systems, for speech enhancement of the active stream using the microphone array beamforming for distant speech recognition, and to provide accumulated information for speaker identification.

This thesis deals with localization and tracking tasks in meeting room scenarios equipped with multiple sensors. A uniform circular microphone array is used to record various events taking place in such environment. The meeting room environment, however, poses a number of challenges such as multiple concurrent speakers, short utterances, and background noise sources. Different techniques originating from computational auditory scene analysis and statistical models have been investigated and combined to develop algorithms that can localize and track active speech sources in such scenarios.

Illustration of the speech localization problem in a reverberant environment.
Illustration of the speech localization problem in a reverberant environment. With the given setup, the aim is to localize and track one or more active sources using only the data acquired from the microphone array.
This thesis is supervised by Harald Romsdorfer and conducted within the COMET K-Project Advanced Audio Processing program.