Signal Processing and Speech Communication Laboratory
homeresearch projects › Cross-layer prosody models for conversational speech (FWF Elise Richter Grant V638-N33)

Cross-layer prosody models for conversational speech (FWF Elise Richter Grant V638-N33)

Period
2018 — 2021
Funding
FWF Elise Richter
Partners
  • Dr. Philip N. Garner (IDIAP, Switzerland)
  • Prof. Dina El Zarka (Department of Linguistics, University of Graz)
  • Prof. Dr. Margaret Zellers (Kiel University)
Research Areas
Contact
Members

With currently available Automatic Speech Recognition (ASR) systems, very good recognition performance can be obtained for read speech (word accuracies of 100 – 90%), but not for conversational speech (60 – 80 %). Highly accurate ASR systems for conversational speech are especially relevant for conversational dialogue systems, as they shall become more conversational, interactional and social rather than transactional. Thus, in recent decades, an increasing number of studies have focused on investigating the differences between these speaking styles in order to find ways how to improve ASR performance for conversational speech. One difference between read and conversational speech is that the degree of pronunciation variation in conversational speech is much higher than in read speech. In spontaneous speech, a word like “yesterday” may sound like yeshay and the German word “haben” (“to have”) may sound like ham. The pronunciation of the words depends on well-known factors such as the regional background of the speakers and the formality of the situation. Highly influential, but not so well studied factors are those reflecting the prosodic characteristics of the word in the utterance. In order to untangle these potentially correlating effects of linguistic, extra-linguistic and prosodic structure, elaborate modeling techniques are needed.

Linguistic studies have indicated that the perceptual system accesses meaning from speech by using the most salient sensory information from any combination of levels (or: layers) of formal linguistic analysis. This model reminds of the cross-layered optimization principle in wireless communications. It was introduced as an alternative to the Open Systems Interconnection (OSI) model, where one layer provides services only to its upper layer while exclusively receiving services from the layer below. The term cross-layer refers to this view of how humans access meaning and to the system architecture of the envisioned ASR system.

The main research question of this Elise-Richter project is how prosodic factors interact with other linguistic and extra-linguistic factors with respect to phonetic detail and pronunciation variation in different German speaking styles (from read speech to conversational speech). The aim is to create models which increase the linguistic knowledge about the mechanisms underlying variation in natural conversation and which improve prosody-dependent ASR systems for conversational speech applications. The investigations will be based on speech material from both German and Austrian speakers. Finally, the project will deliver the first prosodically annotated database for conversational Austrian German as well as tools for the automatic production of prosodic annotations.

Related publications
  • Conference paper Schuppler B., Berger E., Kogler X. & Pernkopf F. (2022) Homophone Disambiguation Profits from Durational Information. in 23rd Annual Conference of the International Speech Communication Association (pp. 3198-3202). [more info] [doi]
  • Journal article Ludusan B. & Schuppler B. (2022) An analysis of prosodic boundaries across speaking styles in two varieties of German. in Speech Communication, 141, p. 93-106. [more info]
  • Conference paper Zarka D. & Schuppler B. (2022) A configurational approach to the prosody of topic and focus in Egyptian Arabic. Testing the importance of accent-based and utterance-based acoustic cues. in 1st International Conference on Tone and Intonation (pp. 21-25). [more info] [doi]
  • Conference paper Linke J., Garner P., Kubin G. & Schuppler B. (2022) Conversational Speech Recognition Needs Data? Experiments with Austrian German. (pp. 4684–4691). [more info]
  • Conference paper Wepner S., Schuppler B. & Kubin G. (2022) How prosody affects ASR performance in conversational Austrian German. in Speech Prosody 2022 (pp. 195-199). [more info] [doi]
  • Journal article Žgank A. & Schuppler B. (2020) Towards Building a Cross-Lingual Speech Recognition System for Slovenian and Austrian German. in The Phonetician, 117(Spec. Iss.), p. 19-33. [more info]
  • Conference paper Zarka D., Kelterer A. & Schuppler B. (2020) An analysis of prosodic prominence cues to information structure in Egyptian Arabic. in 21st Annual Conference of the International Speech Communication Association (pp. 1883-1887). [more info] [doi]
  • Conference paper Zellers M. & Schuppler B. (2020) Microprosodic variability in plosives in German and Austrian German. in 21st Annual Conference of the International Speech Communication Association (pp. 656-660). [more info] [doi]
  • Conference paper Schuppler B. & Ludusan B. (2020) An analysis of prosodic boundary detection in German and Austrian German read speech. (pp. 990- 994). [more info] [doi]
  • Conference paper Linke J., Kelterer A., Dabrowski M., Zarka D. & Schuppler B. (2020) Towards automatic annotation of prosodic prominence levels in Austrian German. in 10th International Conference on Speech Prosody (pp. 1000 - 1004). [more info] [doi]
  • Abstract Ludusan B. & Schuppler B. (2019) Automatic detection of prosodic boundaries in two varieties of German.. [more info]
  • Conference paper Zarka D., Schuppler B. & Cangemi F. (2019) Acoustic Cues to Topic and Narrow Focus in Egyptian Arabic. in 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (pp. 1771-1775). [more info]
  • Conference paper Schuppler B. & Zellers M. (2019) Prosodic Effects on Plosive Duration in German and Austrian German. in 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language (pp. 1736-1740). [more info]