Signal Processing and Speech Communication Laboratory
hometheses & projects › Modelling Backchannels for Human-Robot Interaction

Modelling Backchannels for Human-Robot Interaction

Status
finished
Type
Master Thesis
Announcement date
18 Oct 2023
Student
Mentors
Research Areas

Abstract

This thesis aims to deepen our understanding of backchannels and their role in turn-taking within both human-human conversations and human-robot interactions. It provides an overview of the contexts in which backchannels occur and focuses on identifying the prosodic features that influence backchannel behavior and thus contribute to the grounding process and naturalness in spontaneous speech. To achieve these aims, the thesis consists of two main parts.

The first part involves a quantitative analysis using the Graz Corpus of Read and Spontaneous Speech (GRASS), where backchannels were manually annotated. We extracted contextual, durational, and acoustic features of both the backchannels themselves and the speech of the interlocutor’s previous turn, based on Points of Potential Completion (PCOMP). The analysis highlighted the relationship between the durational features of backchannels and the communicative functions of the interlocutor’s speech. Certain communicative functions affect the timing and occurence of backchannels more than others. Additionally, the analysis indicated that prosodic cues, such as articulation rate, in the interlocutor’s speech influence the timing and occurrence of backchannels. These findings support existing research and intuitive expectations regarding the occurrence and timing of backchannels.

The second part involves an experiment aimed at predicting the exact timing of a backchannel based solely on the speech of the interlocutor’s previous turn. For this task, we employed two random forest regression models and one gradient boosting regression model. The model with the lowest Root Mean Square Error (RMSE) and the fastest prediction time was selected for further analysis. Prediction time was a key criterion, as the model is intended for use in a real-time scenario. To explore the importance and impact of the features, we used SHAP (SHapley Additive exPlanations) values. The analysis focused on investigating the role of acoustic features, especially those proposed in the literature, such as articulation rate, and identifying additional features that are practical to compute in real time. The results of the classification experiment show that backchannel timing can be predicted with sufficient precision with an absolute error of about \qty{110}{\milli\second}. Furthermore, it highlights the importance of features like articulation rate, fundamental frequency, and intensity.

Overall, the results demonstrate the importance of the prosodic features articulation rate, fundamental frequency and intensity in influencing backchannel behavior during conversation. These findings provide new insights into human communication patterns and offer potential applications for improving human-robot interaction, laying the groundwork for future advancements in the development of conversational agents.

DOI