Source-Filter Model Based Single Channel Speech Separation

home › phd theses › Source-Filter Model Based Single Channel Speech Separation

Source-Filter Model Based Single Channel Speech Separation

Status

Finished

Date

2010-06-09

Student

Michael Stark

Mentor

Franz Pernkopf

Research Areas

In a natural acoustic environment, multiple sources are usually active at the same time. The task of source separation is the estimation of individual source signals from this complex mixture. The challenge of single channel source separation (SCSS) is to recover more than one source from a single observation. Basically, SCSS can be divided in methods that try to mimic the human auditory system and model-based methods, which ﬁnd a probabilistic representation of the individual sources and employ this prior knowledge for inference. This thesis presents several strategies for the separation of two speech utterances mixed into a single channel and is structured in four parts: The ﬁrst part reviews factorial models in model-based SCSS and introduces the soft-binary mask for signal reconstruction. This mask shows improved performance compared to the soft and the binary masks in automatic speech recognition (ASR) experiments. The second part addresses the problem of computational complexity in factorial models, which limits its application for online processing. We introduce the fast beam search and the iterated conditional modes (ICM) approximation techniques. They reduce the computational complexity in factorial models by up to two orders of magnitude while maintaining the separation performance. Moreover, there is strong evidence that the ICM algorithm breaks the factorial structure entirely. Consequently, this leads to a linear complexity relationship in the number of hidden states instead of a factorial one. The third part deals with arbitrary mixing levels in factorial models by explicitly modeling the gain for each speech segment, which results in a shape-gain model. Several strategies for parallel estimation of gain and shape are successfully evaluated. Finally, the last part integrates the speech model in model-based systems. This results in a source-ﬁlter representation, where the source signal can be linked to the excitation signal of the vocal folds and the ﬁlter accounts for the vocal-tract shaping. Our ﬁnal separation algorithm combines the shape-gain with the source-ﬁlter model, reﬂecting the complete standard speech production model. All presented algorithms are compared to state-of-the-art algorithms and evaluated in both, the target-to-masker ratio and the word error rate of an ASR system and show improvements beyond the state-of-the-art.