Unsupervised SCSS for conversational speech
- Status
- in work
- Type
- Master Project
- Announcement date
- 01 Oct 2025
- Student
- Mentors
- Research Areas
Modern deep learning approaches to Single Channel Source Separation (SCSS) achieve remarkable results, yet they are predominantly trained on artificial data mixtures. Our previous work revealed that these models suffer a severe performance degradation of up to 6dB when confronted with “in the wild” conversational speech, as captured in our realistic GRASS corpus [1].
The fundamental challenge of applying supervised learning to this real-world problem lies in the prerequisite of clean, isolated target signals for training. In authentic acoustic environments, such ‘clean’ sources are practically unobtainable; every microphone captures a mixture. Consequently, training on realistic but noisy target signals is the death of effective model development, leading to distorted results and unreliable evaluations.
This is where an unsupervised approach becomes not just a preference, but a necessity. By learning separation criteria directly from the mixtures themselves, unsupervised models circumvent the need for unavailable clean reference signals.
This Master’s project builds directly upon these findings. The primary goal is to investigate, implement, and evaluate state-of-the-art unsupervised SCSS architectures, specifically tailored for the complexities of spontaneous conversational speech. The aim is to develop robust models that effectively bridge the performance gap between artificial benchmarks and real-world application.
[1] E. Berger, B. Schuppler, M. Hagmueller, and F. Pernkopf, “Single channel source separation in the wild – conversational speech in realistic environments,” in Speech Communication; 15th ITG Conference, 2023, pp. 96–100. DOI: 10.30420/456164018.