Cross-layer prosody models for conversational speech (FWF Elise Richter Grant V638-N33)
- 2018 — 2021
- FWF Elise Richter
- Dr. Philip N. Garner
- Prof. Dina El Zarka
- Prof. Dr. Margaret Zellers
- Research Areas
With currently available Automatic Speech Recognition (ASR) systems, very good recognition performance can be obtained for read speech (word accuracies of 100 – 90%), but not for conversational speech (60 – 80 %). Highly accurate ASR systems for conversational speech are especially relevant for conversational dialogue systems, as they shall become more conversational, interactional and social rather than transactional. Thus, in recent decades, an increasing number of studies have focused on investigating the differences between these speaking styles in order to find ways how to improve ASR performance for conversational speech. One difference between read and conversational speech is that the degree of pronunciation variation in conversational speech is much higher than in read speech. In spontaneous speech, a word like “yesterday” may sound like yeshay and the German word “haben” (“to have”) may sound like ham. The pronunciation of the words depends on well-known factors such as the regional background of the speakers and the formality of the situation. Highly influential, but not so well studied factors are those reflecting the prosodic characteristics of the word in the utterance. In order to untangle these potentially correlating effects of linguistic, extra-linguistic and prosodic structure, elaborate modeling techniques are needed.
Linguistic studies have indicated that the perceptual system accesses meaning from speech by using the most salient sensory information from any combination of levels (or: layers) of formal linguistic analysis. This model reminds of the cross-layered optimization principle in wireless communications. It was introduced as an alternative to the Open Systems Interconnection (OSI) model, where one layer provides services only to its upper layer while exclusively receiving services from the layer below. The term cross-layer refers to this view of how humans access meaning and to the system architecture of the envisioned ASR system.
The main research question of this Elise-Richter project is how prosodic factors interact with other linguistic and extra-linguistic factors with respect to phonetic detail and pronunciation variation in different German speaking styles (from read speech to conversational speech). The aim is to create models which increase the linguistic knowledge about the mechanisms underlying variation in natural conversation and which improve prosody-dependent ASR systems for conversational speech applications. The investigations will be based on speech material from both German and Austrian speakers. Finally, the project will deliver the first prosodically annotated database for conversational Austrian German as well as tools for the automatic production of prosodic annotations.