Signal Processing and Speech Communication Laboratory
hometheses & projects › Multichannel Voice Activity Detection for ASR

Multichannel Voice Activity Detection for ASR

Master Thesis
Announcement date
12 Mar 2014
Florian Iglisch
Research Areas

The European Project `Distant-speech Interaction for Robust Home Applications’ (DIRHA) aims to create a system to control a house, using a network of microphones, with German voice commands. Three important parts of this project are: the automatic speech recognizer (ASR), the voice activity detector (VAD) and the speech localizer (SLOC). The Deep-Belief-Network VAD (DBN-VAD) [Zhang2013] is a state-of-the-art VAD and has the ability of fussing many features (pitch, spectrogram, AMS, etc.) to decide if a frame is speech or non-speech. These features can be improved by adding a noise reduction or enhancement stage. For the moment, we are only working with a single-channel signal. The purpose of this master’s thesis is to improve the enhancement, and then the VAD by using spatial information derived from the multi-channel signal [Habib2013]. The final aim is the fusion of the voice activity detector with the source localizer using a pyschoacustical approach [Ma2012].

Your Tasks

  • Study literature
  • Obtain baseline results with current VAD and ASR on a simulated database
  • Improve the VAD using multi-channel information
  • Implement real-time system in the kitchen-cocktail party room of the SPSC

Your Profile

  • Speech Communication 1 & 2
  • Interest in speech recognition and source separation inspired in psycho-acoustical approaches
  • Matlab
  • Unix-Shell, HTK is a plus, but not necessary


[Zhang2013] X. Zhang and J. Wu. Deep Belief Networks Based Voice Activity Detection. IEEE Transactions on Audio, Speech, and Language Processing, 2013.

[Habib2013] T. Habib and H. Romsdorfer. Auditory inspired methods for localization of multiple concurrent speakers. Computer Speech & Language, 2013.

[Ma2012] N. Ma, J. Barker, H. Christensen and P. Green. Combining speech fragment decoding and adaptive noise floor modelling, IEEE Transactions on Audio, Speech, and Language Processing, 2012.