Localization, Characterization, and Tracking of Harmonic Sources: With Applications to Speech Signal Processing
A major goal in distant-speech recognition is to transform speech signals of a target speaker into symbols in order to trigger a dialog manager. Spatio-temporal filters, so called beamformers, usually enhance the target speaker’s speech signals in a noisy and reverberant environment. However, a beamformer requires information on the target speaker’s position. A source localizer provides this information, which facilitates steering a beam into the direction of the target speaker. Unfortunately, the beamformer also captures noise and reverberation, especially from the target speaker’s direction. To additionally reduce these artifacts, one can employ bandpass filters in order to emphasize the target speaker’s harmonic components. But these bandpass filters require information on the target speaker’s fundamental frequency. The problem becomes more challenging in case of two or more target speakers. This is where a joint estimator has to be used.
Two new and intuitive algorithms robustly localize and characterize simultaneously active acoustic harmonic sources intersecting in the spatial and frequency domains. They jointly determine the sources’ fundamental frequencies, their respective amplitudes, and their directions of arrival based on a non-parametric signal representation. Variable-scale sampling of unbiased cross-correlation functions facilitates the representation of these three parameters in a joint parameter space. An even better solution is to employ the chirp z-transform, compute the cross-spectrum between pairs of microphone signals, and weight the cross-spectrum’s magnitudes by considering a relative phase-delay mask. In both cases, a multidimensional maxima detector sparsifies the joint parameter space. In comparison to alternative approaches based on cross-correlation functions and model-based dictionaries, the new algorithms solve the issue of pitch-period doubling, they cope with one or more harmonic sources, and they associate the determined parameters to their corresponding sources in a multidimensional sparse joint parameter space. State-of-the-art multiple-target trackers, e.g., trackers based on the probability hypothesis density recursion and the multi-Bernoulli recursion, track these parameters over time. Experiments based on synthetically generated harmonic signals, synthetically filtered speech signals under varying reverberant and noisy conditions, and real recordings yield promising results. A unique, comprehensive multi-sensor Austrian German speech corpus with moving and non-moving speakers provides recordings labeled with spatial and temporal information. This corpus facilitates the evaluation of estimators that jointly determine a speaker’s spatial and temporal parameters, including fundamental frequencies.
The joint recall measure, the root-mean-square error, and the cumulative distribution function of fundamental frequencies and/or directions of arrival serve as performance measures. The optimal subpattern assignment distance and its components, e.g., the localization error and the labeling error, serve as a performance measure for the multiple-target trackers. The evaluations show promising results: On average, both algorithms solve problems, which cannot be solved by their predecessors and other algorithms. The two algorithms outperform existing algorithms in terms of the joint recall measure and the root-mean-square error, and they achieve root-mean-square errors of one Hertz or one degree and smaller, which facilitates, e.g., distant-speech enhancement or source separation for automatic speech recognition. The optimal subpattern assignment distance as well as visualized tracks show that the sparse joint parameter space can be directly fed into a multiple-target tracker yielding smooth tracks.