Towards the Evolution of Neural Acoustic Beamformers
- Status
- Finished
- Student
- Lukas Pfeifenberger
- Mentor
- Franz Pernkopf
- Research Areas
Neural beamforming encompasses the merger of two different scientific disciplines, namely acoustic beamforming, and artificial neural networks. While the former uses statistical signal processing to spatially separate signals such as human speech, the latter uses non-linear function approximators to perform signal classification or regression tasks. Classical beamforming is used in unsupervised tasks such as denoising, or isolating sources with a known position. In these applications, the beam is steered towards the desired source. For tasks such as speaker tracking or blind source separation, the location of the individual speakers is unknown, rendering the problem ill-posed. Neural networks help to solve this class of problems by inferring the missing information from the underlying distribution of the multi-channel audio data. The symbiosis between beamforming and neural networks allows us to tackle hard problems such as the cocktail party scenario.
This thesis explores the evolution of neural beamforming from modest post-filters up to complete blind speaker separation systems, by covering four distinct topics: (i) Mask-based beamforming, which extracts a single speaker from background noise. This method employs a neural network to estimate a speech mask in frequency-domain. This mask is then used to obtain a classical beamformer. Here, we present our Eigennet structure which exploits spatial information embedded in the dominant Eigenvector of the spatial power-spectral density matrix of the noisy microphone inputs. (ii) Complex-valued neural beamforming, where complex-valued neural networks are used to predict beamforming weights in frequency-domain. This enables the beamformer to quickly react to location changes such as speaker movement. This concept outperforms classical beamformers, as the neural network directly optimizes the max-SNR objective of the beamformer. We present our CNBF architecture, which uses Wirtinger calculus to derive complex-valued recurrent network layers and non-holomorphic functions required for beamforming. (iii) Time-domain neural beamforming, where the concept of cross-domain learning is introduced. It allows to formulate the beamforming principle in a latent space, which is learned by a neural network. The enhanced signal is directly synthesized in time-domain. This approach is completely detached from a physical representation of sound waves, or classical beamforming algorithms. Our TDNBF formulation provides solutions for problems such as low-latency beamforming, dereverberation, and non-linear residual echo cancellation. (iv) Blind source separation, where we propose a monolithic, all-in-one solution to perform multi-speaker separation, dereverberation and speaker diarization using a single neural network, termed the BSSD architecture. This approach is capable to solve the cocktail party problem with an unknown number of speakers. It uses an analytic or statistic adaption layer, which virtually moves each identified speech source to the coordinate origin of the microphone array, from where it is extracted and dereverberated using a neural network in time-domain. This system was developed with application-driven constraints in mind, such as a reverberant environment with an unknown number of speakers, low latency, and real-time processing using small blocks of audio at a time.
Throughout this thesis, all methods are experimentally evaluated using multi-channel recordings from a variety of acoustic environments. We demonstrate their respective performance using metrics such as the word error rate or the signal-to-distortion ratio.