Predicting human and ASR classification of plosives by their sub-phonemic properties
- Fri, Feb 01, 2013
In conversational speech words are often realized in a reduced way compared to their citation forms. One frequent process in Germanic languages is the deletion of word-final /t/. The German word und_for instance, is often pronounced as _un. In a series of studies, we investigated the role of reduced plosives for human perception compared to its role for automatic speech processing.
In a corpus of Dutch spontaneous conversations, we found that 25% of all final /t/ tokens are completely acoustically absent and that 11.5% of the tokens are produced canonically. This means, that most of the tokens (63.5%) are realized as something in between, not completely absent, but also not fully present. We defined a set of sub-phonemic features for analyzing these realizations of /t/, some of them shown in the figures above (cl = closure, fr = alveolar friction, mb = multiple burst). Even though these examples of /t/ have very different acoustic characteristics, they were both classified as perceptually present when classified by a human listener, however not when classified by an ASR based classification system. Our mixed-effects logistic regression models showed that in general, humans and an ASR system use the same cues for classification (presence of a constriction, one or multiple bursts and alveolar friction), but the ASR system is less sensitive to fine cues (weak bursts, smoothly starting friction) than human listeners and misled by the presence of glottal vibration. Our data inform the further development of models of human and automatic speech processing.
For more information see “How linguistic and probabilistic properties of a word affect the realization of its final /t/: Studies at the phonemic and sub-phonemic level” and “Predicting human perception and ASR classification of word-final [t] by its acoustic sub-segmental properties.”