Multichannel Voice Activity Detection for ASR

home › theses & projects › Multichannel Voice Activity Detection for ASR

Multichannel Voice Activity Detection for ASR

Status

Finished

Type

Master Thesis

Announcement date

12 Mar 2014

Student

Florian Iglisch

Mentors

Martin Hagmüller
Juan Andrés Morales Cordovilla

Research Areas

Speech Communication

The European Project `Distant-speech Interaction for Robust Home Applications’ (DIRHA) aims to create a system to control a house, using a network of microphones, with German voice commands. Three important parts of this project are: the automatic speech recognizer (ASR), the voice activity detector (VAD) and the speech localizer (SLOC). The Deep-Belief-Network VAD (DBN-VAD) [Zhang2013] is a state-of-the-art VAD and has the ability of fussing many features (pitch, spectrogram, AMS, etc.) to decide if a frame is speech or non-speech. These features can be improved by adding a noise reduction or enhancement stage. For the moment, we are only working with a single-channel signal. The purpose of this master’s thesis is to improve the enhancement, and then the VAD by using spatial information derived from the multi-channel signal [Habib2013]. The final aim is the fusion of the voice activity detector with the source localizer using a pyschoacustical approach [Ma2012].

Your Tasks

Study literature
Obtain baseline results with current VAD and ASR on a simulated database
Improve the VAD using multi-channel information
Implement real-time system in the kitchen-cocktail party room of the SPSC

Your Profile

Speech Communication 1 & 2
Interest in speech recognition and source separation inspired in psycho-acoustical approaches
Matlab
Unix-Shell, HTK is a plus, but not necessary

References

[Zhang2013] X. Zhang and J. Wu. Deep Belief Networks Based Voice Activity Detection. IEEE Transactions on Audio, Speech, and Language Processing, 2013.

[Habib2013] T. Habib and H. Romsdorfer. Auditory inspired methods for localization of multiple concurrent speakers. Computer Speech & Language, 2013.

[Ma2012] N. Ma, J. Barker, H. Christensen and P. Green. Combining speech fragment decoding and adaptive noise floor modelling, IEEE Transactions on Audio, Speech, and Language Processing, 2012.