Integration of prosodic features to ASR systems

home › theses & projects › Integration of prosodic features to ASR systems

Integration of prosodic features to ASR systems

Status

Finished

Type

Master Thesis

Announcement date

18 Oct 2023

Student

Pablo Melendez Abarca

Mentors

Research Areas

Speech Communication

Abstract

Current ASR systems can reach human-like word error rates (WER) for certain types of speech (read and command-like), but that is not the case for conversational speech. Even with recent approaches that make use of pretrained models and fine-tunning (such as Whisper and wav2vec), there is still room for improvement. In this work, the impact that prosodic features have in the performance of an ASR system for Aus- trian German conversational speech is analysed, hoping to an- swer whether these improve ASR for conversational speech. For this purpose, a Kaldi baseline ASR system is first un- tangled, and then adapted to include prosodic features ob- tained with Python packages, while conversations from the Graz Corpus of Read And Spontaneous Speech are used as test data. It is found that speed perturbation improves ASR in around 5.5 to 5.9% and that F0 and chroma-based fea- tures minimally improve performance in around 0.5 and 0.4% respectively. The results show that the durational aspect of speech is especially important when dealing with conversa- tional speech, and that other prosodic features might need other ways of modeling to get the most out of them.

Contact:

Barbara Schuppler (b.schuppler@tugraz.at)