Developing an Annotation System for Communicative Functions for a Cross-Layer ASR System

Developing an Annotation System for Communicative Functions for a Cross-Layer ASR System

Fri, Oct 01, 2021

The investigation of conversational speech requires the close collaboration of linguists and speech technologists to develop new modeling techniques that allow the incorporation of various knowledge sources. This paper presents a progress report on the ongoing interdisciplinary project “Cross-layer language models for conversational speech” with a focus on the development of an annotation system for communicative functions. We discuss the requirements of such a system for the application in ASR as well as for the use in phonetic studies of talk-in-interaction, and illustrate emerging issues with the example of turn management.

Our annotation system on the communicative functions level has two independent tiers. The IPU tier (“Inter Pausal Units”) and the PCOMP tier (“Points of potential syntactic COMPletion”). The figure shows an example of how PCOMP and IPU annotations are mapped onto each other. In this example, Speaker 2 holds his turn by making a pause at a point of “maximum grammatical control”; labelled as “Incomplete-Hold” on tier b) after the introduction of a new sentence by , and completes his turn after the pause. There are two PCOMPs leading up to the pause (labelled as "Hold" on tier a), neither of which give the impression of being complete based on prosody (i.e., slightly rising pitch in and ‘rush-through’ in ). Even though a pause is produced after , the next PCOMP is reached only after . Thus, the whole sentence starting with is grouped into one PCOMP chunk, regardless of any pauses. Speaker 1 times his backchannel (labelled as "Hearer Response Token") with the pause rather than with the PCOMP just before . It is predominantly short hearer response tokens that are aligned with pauses at syntactically incomplete positions while participants almost never self-select to produce a longer turn in these positions. Currently, 90 minutes in 15 conversations have been annotated at the IPU level and the last revision of these labels is in progress. On the PCOMP level, 60 minutes in 12 conversations are being annotated. These annotations are useful for the goals described above, i.e., for application in ASR and for phonetic studies, as well as for the investigation of various hypotheses about the time alignment of hearer response tokens and self-selection.