Auditory inspired methods for localization of multiple concurrent speakers

TitleAuditory inspired methods for localization of multiple concurrent speakers
Publication TypeJournal Article
Year of Publication2012
AuthorsHabib, T., & Romsdorfer H.
JournalComputer Speech & Language
Month9/2012
Abstract

The use of microphone arrays offers enhancements of speech signals recorded in meeting rooms and office spaces. A common solution for speech enhancement in realistic environments with ambient noise and multi-path propagation is the application of so-called beamforming techniques. Such beamforming algorithms enhance signals at the desired angle using constructive interference while attenuating signals coming from other directions by destructive interference. However, these techniques require as a priori the time difference of arrival information of the source. Therefore, the source localization and tracking algorithms are an integral part of such a system. The conventional localization algorithms deteriorate in realistic scenarios with multiple concurrent speakers. In contrast to conventional methods, the techniques presented in this paper make use of pitch information of speech signals in addition to the location information. This “position–pitch”-based algorithm pre-processes the speech signals by a multiband gammatone filterbank that is inspired from the auditory model of the human inner ear. The role of this gammatone filterbank is analyzed and discussed in details. For a robust localization of multiple concurrent speakers, a frequency-selective criterion is explored that is based on a study of the human neural system's use of correlations between adjacent sub-band frequencies. This frequency-selective criterion leads to improved localization performance. To further improve localization accuracy, an algorithm based on grouping of spectro-temporal regions formed by pitch cues is presented. All proposed speaker localization algorithms are tested using a multichannel database where multiple concurrent speakers are active. The real-world recordings were made with a 24-channel uniform circular microphone array using loudspeakers and human speakers under various acoustic environments including moving concurrent speaker scenarios. The proposed techniques produced a localization performance that was significantly better than the state-of-the-art baseline in the scenarios tested.

DOI10.1016/j.csl.2012.09.003
Short TitleComputer Speech & Language
Citation Key2543
SPSC cross-references
Research Area: