Speech Enhancement Using Deep Neural Beamformers
Defining speech as the most important means of human communication, the ability to separate audio signals from background interference is crucial. The human auditory system has a remarkable ability separating one sound source from others. We can effortlessly follow one speaker in the presence of other speakers and background noise. But obviously, machines can’t. There, automatic speech recognition (ASR) does not work reliably in noisy environments. Even the most advanced AI technology of today has severe problems understanding and transcribing speech in the presence of background noise and overlapping speech. Background noise occurs frequently in everyday situations – in households, in cafes, public transport services and other busy places. Especially in industrial areas people are exposed to noise frequently. In addition of being dangerous, noise exposure leads to fatigue, increased number of accidents and makes communication difficult. The use of noise protection systems, noise cancellation and noise sensitive ASR systems, help to overcome these problems. Effective noise cancellation and noise sensitive ASR is ill-posed and difficult to solve.
In this thesis however, we present methods to enhance speech corrupted by noise. In particular, we evaluate state-of-the art AI systems, i.e. deep neural networks (DNNs) to enhance single-channel audio recordings using time-frequency gain masks. Similar to the human auditory system, the use of spatial information is beneficial for noise cancellation. Therefore, we use DNNs to estimate a spectral gain mask from noisy, multi-microphone speech signals. These novel data-driven beamforming methods leverage the perceptual audio quality and speech intelligibility significantly. However, mask-based beamforming methods use real-valued DNNs, requiring an entire block of audio data at a time. During this period, the signal statistics are assumed to be constant. This limits the capability to track moving sound sources. Real-valued DNNs do not unfold the full potential of data- driven phase-aware beamforming. Therefore, we propose fully complex-valued DNN (cDNN) beamformers, using complex-valued long short-term memory networks (LSTM), complex-valued feed-forward layers, as well as complex-valued activation functions. By doing so, we do not need to rely on mask-based beamforming methods anymore. CDNNs are able to predict complex-valued beamforming weights directly from complex-valued microphone signals. Unlike a classical beamformer, the model estimates a set of optimal beamforming weights for each time-frequency bin. This leads to statistical significant improvements in terms of perceptual audio quality and speech intelligibility and lower computational complexity compared to state-of-the art systems.