Signal Processing and Speech Communication Laboratory
homeevents › PhD defense Michael Wohlmayr

PhD defense Michael Wohlmayr

Start date/time
Tue Jun 5 08:00:00 2012
End date/time
Tue Jun 5 08:00:00 2012
CGV seminar room, Inffeldgasse 16c, 2nd floor

Chairman: Ao.Univ.-Prof.Dr. E. BRENNER
Examiner : Assoc.-Prof.Dr. F. PERNKOPF
Examiner: Adjunct-Prof.Dr. T. VIRTANEN (Tampere University of Technology)

Probabilistic Model-Based Multiple Pitch Tracking of Speech
Multiple pitch tracking of speech is an important task for the segregation of multiple speakers in a single-channel recording. In this thesis, a probabilistic model-based approach for estimation and tracking of multiple pitch trajectories is proposed. A probabilistic model that captures pitch-dependent characteristics of the single-speech short-time spectrum is obtained a-priori from training speech. The resulting speaker model, which is based on Gaussian mixture models, can be trained either in a speaker independent (SI) or a speaker dependent (SD) fashion. Speaker models are then combined using a speaker interaction model to obtain a probabilistic description of the observed speech mixture. A factorial hidden Markov model is applied for tracking the pitch trajectories of parallel speakers over time.
The probabilistic model-based approach is capable to explicitly incorporate timbral information and all associated uncertainties of spectral structure into the model. While SI models allow an ad-hoc use in situations where the speakers in a recording are unknown, SD models have the great advantage that pitch trajectories can be assigned to their corresponding speakers. The accuracy of the proposed method is evaluated on two speech databases and compared to a state-of-the-art algorithm for multi-pitch tracking of speech. Two problems related to the proposed approach are adressed: (i) Exact inference has a high computational demand, mainly due to the fact that the solution is obtained by considering all possible pitch combinations across speakers. A novel method for approximate inference based on likelihood pruning is proposed. The method is based on a novel computationally efficient upper and lower bound on the likelihood of pitch combinations. The approximate method is experimentally evaluated in terms of accuracy and time requirements, and results for tracking the pitch of three parallel speakers are demonstrated. (ii) Any mismatch between training and testing conditions (such as different channel conditions or gain mismatches) detoriates the accuracy of multi-pitch tracking. It is desireable to adapt speaker models to novel environmental conditions during multi-pitch tracking, i.e. in situations where only a mixture of parallel speakers is available. We propose a modification of maximum likelihood linear regression (MLLR) where the adaptation of model parameters is constrained to modifications of the spectral envelope. This constraint is beneficial in case few adaptation data is available. Based on this, we propose a novel EM algorithm for adaptation of speaker models from speech mixtures, and demonstrate tracking results obtained for real-room recordings of two parallel speakers.