What's so complex about conversational speech? Prosodic prominence and speech recognition challenges
- Status
- Finished
- Date
- 2025-03-06
- Student
- Julian Linke
- Mentors
- DOI
- 10.3217/7yxe5-jpg41
- Research Areas
This thesis presents the analysis and evaluation of acoustic representations and models for conversational speech for two tasks: prosodic prominence classification and automatic speech recognition (ASR). Conversational speech poses unique challenges compared to read or prepared speech due to characteristics such as lively turn-taking, incomplete utterances, disfluencies, and high degree of pronunciation variation. Given these characteristics, both prosodic annotation tools and ASR systems trained on the typical benchmark datasets perform significantly worse on conversational speech. This thesis thus follows two aims, 1) to analyze acoustic representations for conversational speech using explainable machine learning (ML) methods, and 2) to improve the performance of prosodic prominence classification and ASR systems, as measured with standard performance measures. Our experiments on prosodic prominence classification revealed that the main acoustic cues for perceived prominence were the durational features. We introduce novel entropy-based prosodic features, which showed to encode necessary durational information along with information on pitch and loudness, leading to detection performances which aligned with inter-annotator agreements for the different prominence levels. These entropy-based prosodic representations were further used to examine their impact on utterance-level word error rates (WERs) of HMM- and transformer-based ASR systems. Our results reveal significant effects of durational and prosodic features on WER, but also how they interact with pronunciation variation and utterance-level complexity measures. Finally, we developed prominence detectors and prominence-aware ASR systems and explored how prosodic information is encoded through fine-tuning of self-supervised speech representations, indicating the feasibility of integrating prosodic information into ASR. Given that our experiments were based on data from conversational Austrian German, we had to deal with high variation stemming from dealing with a (low-resourced) regional variety of a (well-resourced) language in addition to the high variation between speakers and between different speaker pairs given the casual speaking style. Using clustering methods for shared discrete speech representations we demonstrated their effectiveness in differentiating language and variety aspects and capturing speaker differences across styles. The distances between quantized latent speech representations showed to meaningfully capture fine-grained differences between speakers when producing different speaking styles. Overall, this thesis provides insights into the complexities of conversational speech and demonstrates how the analysis and evaluation of acoustic representations and models deepen our understanding of conversational speech. The findings have implications for various applications such as human-machine interaction, conversation transcription and hearing aid technology.
Links: PhD thesis.
