Cross-layer pronunciation modeling for conversational speech (FWF Hertha Firnberg Program T572)

The Problem   Automatic speech recognition (ASR) systems were originally designed to cope with carefully pronounced speech. Most real world applications of ASR systems, however, require the recognition of spontaneous, conversational speech (e.g., dialogue systems, voice input aids for physically disabled, medical dictation systems, etc.). Compared to prepared or read speech, conversational speech contains utterances that might be considered 'ungrammatical' and contain disfluencies, such as “...oh, well, I think ahhm exactly …” The pronunciation of the words may depend for instance on the regional background of the speakers, the formality of the situation or the frequency of the word. A highly frequent word like “yesterday” may sound like yeshay and the German word “haben” (“to have”) may sound like ham. This project focused on investigating interdisciplinary methods (including linguistics, phonetics, speech technology) to model the factors on which pronunciation variation depends in everyday speech.


The Methods   In this project, we collected and annotated the first largescale speech database of Austrian German (GRASS). It is a rich resource on pronunciation variation in Austrian German, containing approximately 1900 minutes of speech spoken by 38 speakers from 5 provinces in 3 different speaking styles (read speech, spontaneous commands, and conversational speech). Moreover, it is one of the largest German speech databases with completely unconstrained and casual conversations, and thus is also relevant to speech scientists outside of Austria. We have also developed transcription tools for the corpus and have made both the speech material and the tools available for other researchers.


The Findings  Based on Dutch, German and the collected Austrian German speech material, we found that pronunciation variation does not only depend on well known factors such as the regional background of the speaker and the speaking style, but also on, for example, the grammatical and morphological properties of the words. For instance, whereas in spontaneous speech the German word der is pronounced differently depending on whether it is an article, a demonstrative pronoun or a relative pronoun, in read speech it is always pronounced the same way.  These linguistic findings for pronunciation variation were used to develop methods to improve ASR systems. Most importantly, our work not only demonstrates novel methods for ASR, it introduces a new perspective: Whereas previously, the high degree of pronunciation variation in spontaneous speech was primarily seen as a problem for ASR, we view it as an additional resource which is not present in read speech. This change in perspective will guide our future research plans.


Schuppler Barbara
Institut für Signalverarbeitung und Sprachkommunikation
Funding Program: 
Fonds zur Förderung der wissenschaftlichen Forschung, FWF (Österreich)
Research Area: 
2012 - 2017