Signal Processing and Speech Communication Laboratory
homeresearch projects › Cross-layer pronunciation modeling for conversational speech (FWF Hertha Firnberg Program T572)

Cross-layer pronunciation modeling for conversational speech (FWF Hertha Firnberg Program T572)

Period
2012 — 2017
Funding
Fonds zur Förderung der wissenschaftlichen Forschung, FWF (Österreich)
Partners
  • Martine Adda-Decker
  • Mirjam Ernestus
Research Areas
Contact

The Problem Automatic speech recognition (ASR) systems were originally designed to cope with carefully pronounced speech. Most real world applications of ASR systems, however, require the recognition of spontaneous, conversational speech (e.g., dialogue systems, voice input aids for physically disabled, medical dictation systems, etc.). Compared to prepared or read speech, conversational speech contains utterances that might be considered ‘ungrammatical’ and contain disfluencies, such as “…oh, well, I think ahhm exactly …” The pronunciation of the words may depend for instance on the regional background of the speakers, the formality of the situation or the frequency of the word. A highly frequent word like “yesterday” may sound like yeshay and the German word “haben” (“to have”) may sound like ham. This project focused on investigating interdisciplinary methods (including linguistics, phonetics, speech technology) to model the factors on which pronunciation variation depends in everyday speech.

The Methods In this project, we collected and annotated the first largescale speech database of Austrian German (GRASS). It is a rich resource on pronunciation variation in Austrian German, containing approximately 1900 minutes of speech spoken by 38 speakers from 5 provinces in 3 different speaking styles (read speech, spontaneous commands, and conversational speech). Moreover, it is one of the largest German speech databases with completely unconstrained and casual conversations, and thus is also relevant to speech scientists outside of Austria. We have also developed transcription tools for the corpus and have made both the speech material and the tools available for other researchers.

The Findings Based on Dutch, German and the collected Austrian German speech material, we found that pronunciation variation does not only depend on well known factors such as the regional background of the speaker and the speaking style, but also on, for example, the grammatical and morphological properties of the words. For instance, whereas in spontaneous speech the German word der is pronounced differently depending on whether it is an article, a demonstrative pronoun or a relative pronoun, in read speech it is always pronounced the same way. These linguistic findings for pronunciation variation were used to develop methods to improve ASR systems. Most importantly, our work not only demonstrates novel methods for ASR, it introduces a new perspective: Whereas previously, the high degree of pronunciation variation in spontaneous speech was primarily seen as a problem for ASR, we view it as an additional resource which is not present in read speech. This change in perspective will guide our future research plans.

Related publications
  • Article Schuppler B. & Schrank T. (2018) On the use of acoustic features for automatic disambiguation of homophones in spontaneous German. in Computer speech and language, 52, p. 209-224. [more info] [doi]
  • Article Schuppler B. (2017) Rethinking classification results based on read speech, or. in International Journal of Speech Technology, 20(3), p. 699-713. [more info] [doi]
  • Article Schuppler B., Hagmüller M. & Zahrer A. (2017) A corpus of read and conversational Austrian German. in Speech Communication, 94, p. 62-74. [more info] [doi]
  • Poster Schuppler B. & Schrank T. (2016) Automatic disambiguation of homophones in spontaneous speech.. [more info]
  • Conference contribution Schuppler B., Hagmüller M., Cordovilla J. & Pessentheiner H. (2014) GRASS: The Graz Corpus of Read and Spontaneous Speech. in 9th edition of the Language Resources and Evaluation Conference (pp. 1465-1470). [more info]
  • Conference contribution Zarka D. & Schuppler B. (2014) Spectral balance and spectral emphasis in accented, stressed and unstressed syllables in standard Austrian German read speech.. in Online Proceedings of Leiden Conference on Word Stress and Accent. [more info]
  • Conference contribution Schuppler B., Adda-Decker M. & Cordovilla J. (2014) Pronunciation variation in read and conversational Austrian German.. in Proceedings of Interspeech 2014 (pp. 1453-1457). [more info]
  • Conference contribution Schuppler B., Grill S., Menrath A. & Cordovilla J. (2014) Automatic phonetic transcription in two steps: forced alignment and burst detection. in Proceedings of the International Conference on Statistical Language and Speech Processing (SLSP) (pp. 132-143). [more info]
  • Conference contribution Jackschina A., Schuppler B. & Muhr R. (2014) Where /aR/ the /R/s in Standard Austrian German?. in Proceedings of Interspeech 2014 (pp. 1698-1702). [more info]
  • (Old data) Lecture or Presentation Adda-Decker M., Schuppler B., Lamel L., Morales-Cordovilla J. & Adda G. (2013) What we can learn from ASR errors about low-resourced languages: A case-study of Luxembourgish and Austrian.. [more info]
  • Article Hanique I., Ernestus M. & Schuppler B. (2013) Informal speech processes can be categorical in nature, even if they affect many different words. in The journal of the Acoustical Society of America, 133(3), p. 1644-1655. [more info]