Source-Filter Model Based Single Channel Speech Separation

PhD Student 
Research Area

 

In a natural acoustic environment, multiple sources are usually active at the same time. The task of source separation is the estimation of individual source signals from this complex mixture. The challenge of single channel source separation (SCSS) is to recover more than one source from a single observation. Basically, SCSS can be divided in methods that try to mimic the human auditory system and model-based methods, which find a probabilistic representation of the individual sources and employ this prior knowledge for inference. This thesis presents several strategies for the separation of two speech utterances mixed into a single channel and is structured in four parts: The first part reviews factorial models in model-based SCSS and introduces the soft-binary mask for signal reconstruction. This mask shows improved performance compared to the soft and the binary masks in automatic speech recognition (ASR) experiments. The second part addresses the problem of computational complexity in factorial models, which limits its application for online processing. We introduce the fast beam search and the iterated conditional modes (ICM) approximation techniques. They reduce the computational complexity in factorial models by up to two orders of magnitude while maintaining the separation performance. Moreover, there is strong evidence that the ICM algorithm breaks the factorial structure entirely. Consequently, this leads to a linear complexity relationship in the number of hidden states instead of a factorial one. The third part deals with arbitrary mixing levels in factorial models by explicitly modeling the gain for each speech segment, which results in a shape-gain model. Several strategies for parallel estimation of gain and shape are successfully evaluated. Finally, the last part integrates the speech model in model-based systems. This results in a source-filter representation, where the source signal can be linked to the excitation signal of the vocal folds and the filter accounts for the vocal-tract shaping. Our final separation algorithm combines the shape-gain with the source-filter model, reflecting the complete standard speech production model. All presented algorithms are compared to state-of-the-art algorithms and evaluated in both, the target-to-masker ratio and the word error rate of an ASR system and show improvements beyond the state-of-the-art.  

 

This thesis is supervised by Gernot Kubin, Franz Pernkopf.