PhD defense Michael Wohlmayr
- Start date/time
- Tue Jun 5 08:00:00 2012
- End date/time
- Tue Jun 5 08:00:00 2012
- CGV seminar room, Inffeldgasse 16c, 2nd floor
Chairman: Ao.Univ.-Prof.Dr. E. BRENNER
Examiner : Assoc.-Prof.Dr. F. PERNKOPF
Examiner: Adjunct-Prof.Dr. T. VIRTANEN (Tampere University of Technology)
Probabilistic Model-Based Multiple Pitch Tracking of Speech
Multiple pitch tracking of speech is an important task for the segregation of multiple speakers in a single-channel recording. In this thesis, a probabilistic model-based approach for estimation and tracking of multiple pitch trajectories is proposed. A probabilistic model that captures pitch-dependent characteristics of the single-speech short-time spectrum is obtained a-priori from training speech. The resulting speaker model, which is based on Gaussian mixture models, can be trained either in a speaker independent (SI) or a speaker dependent (SD) fashion. Speaker models are then combined using a speaker interaction model to obtain a probabilistic description of the observed speech mixture. A factorial hidden Markov model is applied for tracking the pitch trajectories of parallel speakers over time.
The probabilistic model-based approach is capable to explicitly incorporate timbral information and all associated uncertainties of spectral structure into the model. While SI models allow an ad-hoc use in situations where the speakers in a recording are unknown, SD models have the great advantage that pitch trajectories can be assigned to their corresponding speakers. The accuracy of the proposed method is evaluated on two speech databases and compared to a state-of-the-art algorithm for multi-pitch tracking of speech. Two problems related to the proposed approach are adressed: (i) Exact inference has a high computational demand, mainly due to the fact that the solution is obtained by considering all possible pitch combinations across speakers. A novel method for approximate inference based on likelihood pruning is proposed. The method is based on a novel computationally eﬃcient upper and lower bound on the likelihood of pitch combinations. The approximate method is experimentally evaluated in terms of accuracy and time requirements, and results for tracking the pitch of three parallel speakers are demonstrated. (ii) Any mismatch between training and testing conditions (such as diﬀerent channel conditions or gain mismatches) detoriates the accuracy of multi-pitch tracking. It is desireable to adapt speaker models to novel environmental conditions during multi-pitch tracking, i.e. in situations where only a mixture of parallel speakers is available. We propose a modiﬁcation of maximum likelihood linear regression (MLLR) where the adaptation of model parameters is constrained to modiﬁcations of the spectral envelope. This constraint is beneﬁcial in case few adaptation data is available. Based on this, we propose a novel EM algorithm for adaptation of speaker models from speech mixtures, and demonstrate tracking results obtained for real-room recordings of two parallel speakers.