Signal Processing and Speech Communication Laboratory
homeresearch projects › Probabilistic Graphical Models for Time-Series Signal Mixtures

Probabilistic Graphical Models for Time-Series Signal Mixtures

Period
2013 — 2016
Funding
Austrian Science Fund
Research Areas
Contact
Members

    Robustness against reverberation, noise, and interfering audio signals is one of the grand challenges in speech recognition, speech understanding, and audio analysis technology. One avenue to approach this challenge is single-channel audio separation. Recently, factorial hidden Markov models have won the single-channel speech separation and recognition challenge. These models are capable of modeling acoustic scenes with multiple sources interacting over time. While these models reach super-human performance on specific tasks, there are still serious limitations restricting the applicability in many areas.

    We aim to generalize these models and enhance their applicability in several aspects: (i) Introduction of discriminative large margin learning techniques. This allows to focus the model specification on the most salient differences, i.e.~discriminating information, between interfering sources. (ii) Development of efficient inference approaches. Efficient inference is needed since the computational demands of exact inference in factorial hidden Markov models scale exponentially with the number of sources, i.e. inference is intractable in tasks with many interacting sources. (iii) We are interested in adapting the model parameters during separation to the specific situation (e.g. actual speakers, gain, etc.) using only speech mixture data. Therefore, an expectation-maximization-like iterative adaptation framework  initialized with universal models, e.g. speaker independent models, is proposed. This greatly increases the utility of this model. Currently, source-specific monaural data is required to learn the model. 

    The models and methods derived are applied to single-channel speech separation, tracking of the fundamental frequency of concurrent speakers, and benchmark classification scenarios. The overall goal is to devise methods for next generation time-series models well-suited for monaural audio data generated from multiple interacting sources. These models are also appealing to related fields requiring signal separation. Examples are resolving interactions in brain-scan images or seismic data.