How prosody affects ASR performance in conversational Austrian German
- Fri, Jul 01, 2022
The performance of Automatic Speech Recognition (ASR) systems varies with the speaking style of the data that is to be recognised. Where read speech, voice commands and also broadcast news are nowadays well recognised by standard ASR systems, conversational speech remains to be challenging for multiple reasons.
We compared recognition performance of two Language Models (LMs): 1) an ordinary 4-gram “LMnormal” and 2) an oracle 4-gram “LMoracle” that was trained on all the utterances of a corpus of conversational speech (GRASS), including data of the evaluation set. We analysed specific (mis-)recognised word tokens for their prosodic characteristics. In general, high-frequent words are easy to recognise since they are well-known to both the acoustic models and the language model in various contexts. For both LMs, we found that short, high-frequent words are misrecognised more often than longer words that have lower frequency of occurence in our data. Short, high-frequent tokens are often function words that are deaccented in fluent speech and therefore they are strongly reduced such that they are barely segmentable from their surrounding words, and even become homophoneous in conversational speech.
This work was presented at Speech Prosody 2022 and is published in the (https://www.isca-speech.org/archive/pdfs/speechprosody_2022/wepner22_speechprosody.pdf)[conference proceedings].
Browse the Results of the Month archive.