Signal Processing and Speech Communication Laboratory
homephd theses › (When) Does it Harm to Be Incomplete? Comparing Human and Automatic Speech Recognition of Syntactically Disfluent Utterances

(When) Does it Harm to Be Incomplete? Comparing Human and Automatic Speech Recognition of Syntactically Disfluent Utterances

Status
Finished
Student
Saskia Wepner
Mentors
Research Areas

This thesis presents a corpus-based, comparative analysis of error patterns in human and automatic speech recognition (ASR), based on utterances taken from spontaneous, unscripted face-to-face conversations. The utterances reflect specific patterns that are characteristic of this speaking style: They are disfluent through either a pause, a filler particle (FP), a break in the syntax, or a combination of them. Utterances that originally contained FPs were generally easier to recognise for both humans and ASR – regardless of the FP being cut out or left in the presented stimuli – than disfluent utterances without FPs. In the easier utterances, the best ASR system still had an average word error rate (WER) that was about 4.45% higher than the WER of the average human listener, who – with an average WER of 8.82% – was far from being perfect either. In utterances that were generally more difficult, the best system’s WER was about 6.73% higher than the average human (average 17.98%). A detailed analysis on the utterance level revealed that off-the-shelf transformer-based ASR systems seem robust against disfluencies that consist of FPs, while they are still affected by those disfluencies that are of syntactic nature. While humans compensate for syntactic disfluencies when accompanied by pauses, ASR systems show a tendency to benefit from omitting prior disfluent context. Utterances with syntactic disfluencies were misrecognised more often by both humans and ASR, but for ASR, WERs were generally higher. Since the WER averages over errors in an utterance and thus blurs the variety of errors within an utterance, a word-based analysis gave insight into transcription errors beyond WERs, revealing the main characteristics which correctly vs.~incorrectly recognised words had in common. This thesis further presents a novel visualisation technique suitable for a combined qualitative and quantitative analysis of utterances by encoding quantitative word-level features while maintaining the qualitative contextual information of a word within its utterance. The qualitative analysis suggests that, even in powerful architectures like transformers, a certain floor of errors seems to be persistent and systematic. In summary, this thesis emphasises the need for focusing more on training data that is rich in disfluencies beyond filler particles. This thesis highlights one of the areas where we can still learn from human performance and linguistic knowledge and encourages further research on speech from unscripted, spontaneous conversations.