Signal Processing and Speech Communication Laboratory
hometheses & projects › Multichannel Voice Activity Detection for ASR

Multichannel Voice Activity Detection for ASR

Status
Finished
Type
Master Thesis
Announcement date
12 Mar 2014
Student
Florian Iglisch
Mentors
Research Areas

The European Project `Distant-speech Interaction for Robust Home Applications’ (DIRHA) aims to create a system to control a house, using a network of microphones, with German voice commands. Three important parts of this project are: the automatic speech recognizer (ASR), the voice activity detector (VAD) and the speech localizer (SLOC). The Deep-Belief-Network VAD (DBN-VAD) [Zhang2013] is a state-of-the-art VAD and has the ability of fussing many features (pitch, spectrogram, AMS, etc.) to decide if a frame is speech or non-speech. These features can be improved by adding a noise reduction or enhancement stage. For the moment, we are only working with a single-channel signal. The purpose of this master’s thesis is to improve the enhancement, and then the VAD by using spatial information derived from the multi-channel signal [Habib2013]. The final aim is the fusion of the voice activity detector with the source localizer using a pyschoacustical approach [Ma2012].

Your Tasks

  • Study literature
  • Obtain baseline results with current VAD and ASR on a simulated database
  • Improve the VAD using multi-channel information
  • Implement real-time system in the kitchen-cocktail party room of the SPSC

Your Profile

  • Speech Communication 1 & 2
  • Interest in speech recognition and source separation inspired in psycho-acoustical approaches
  • Matlab
  • Unix-Shell, HTK is a plus, but not necessary

References

[Zhang2013] X. Zhang and J. Wu. Deep Belief Networks Based Voice Activity Detection. IEEE Transactions on Audio, Speech, and Language Processing, 2013.

[Habib2013] T. Habib and H. Romsdorfer. Auditory inspired methods for localization of multiple concurrent speakers. Computer Speech & Language, 2013.

[Ma2012] N. Ma, J. Barker, H. Christensen and P. Green. Combining speech fragment decoding and adaptive noise floor modelling, IEEE Transactions on Audio, Speech, and Language Processing, 2012.