Comparing automatic and human speech recognition of disfluent structures in spontaneous conversations
- Status
- In work
- Student
- Saskia Wepner
- Mentors
- Research Areas
When speaking spontaneously, we often reduce articulatory precision, put less effort into producing flawless sentences, or utter disfluent structures. As humans, we are usually still able to decode (understand) such imperfect utterances. One reason for that is that we have been learning to deal with spoken language during our lifetime which provides us with powerful speech processing models. An automatic speech recognition (ASR) system, in contrast, is much more limited to the (finite amount of) data it was trained on. Another reason is that humans can fall back on context and the history of a conversation which helps them to evaluate the plausibility of (sequences of) words in given surroundings and thus untangle probable disambiguations. ASR systems achieve accuracies close to human performance for read speech. When it comes to recognising spontaneous speech, however, systems perform much worse. The last notable advancement in ASR has been the introduction of transformer-based systems, which deliver impressive results on established conversational speech databases. These systems were trained on huge amounts of data and make use of a wider context; therefore, one could expect that they should be able to solve the last remaining challenges on conversational speech recognition. When it comes to recognising spontaneous, unscripted conversations, however, they still yield surprisingly high word error rates. This thesis analyses the similarities and differences between transcription errors made by ASR systems and those made by human transcribers. It aims at finding where we can still learn from human performance and linguistic knowledge and encourages further research on speech from unscripted, spontaneous conversations.