CNN-based Homophone Disambiguation for conversational speech

home › theses & projects › CNN-based Homophone Disambiguation for conversational speech

CNN-based Homophone Disambiguation for conversational speech

Status

Finished

Type

Bachelor Project

Announcement date

01 Jan 2022

Student

Emil Berger

Mentors

Research Areas

** Abstract **

Homophones are words which have a different meaning but are pronounced the same way, which is a common problem in automatic speech recognition (ASR). In this thesis, certain machine learning methods were used to disambiguate the word tokens ’ach’, ’ah’, ’auch’, ’eine’ and ’er’ which were all reduced to the sound /a/ in conversational Austrian German. Two approaches have been taken: (a) 128 acoustic and prosodic features were extracted from the tokens and used for different multi-layer-perceptrons (MLP) and a logistic regression model. The data originated from the GRASS corpus [1]. This approach builds on previous work by Xenia Kogler [2] who used a Random Forest. (b) Spectrograms were extracted with and without including the silence before the token of interest. With these spectrograms, different convolutional neural network (CNN) models were created: A greyscale model, an RGB model with the fundamental frequency (F0) represented in the green channel, and a combined model, which took a greyscale spectrogram and F0 as a one-dimensional signal as an input.

The feature based models showed similar results as the random forest, the best MLP model reached an accuracy of 57% and the logistic regression model reached an accuracy of 54%.

The spectrogram based models showed slightly lower accuracy, the best model reached 55% (greyscale spectrogram with included preceding silence). Another observation was the increase of the accuracy when the preceding silence was included - for the greyscale models the accuracy increased by 9%.

For all experiments it can be said that the complex models did not perform as well, due to the small amount of data (4174 tokens for the feature based models, 4221 tokens for the CNN models).

The results of this thesis show, that a spectrogram approach, which uses low computational effort in comparison to the feature extraction, leads to a reasonable classification of homophones. This method is a promising approach for real-life ASR applications to decrease the word error rate.

[1] B. Schuppler, M. Hagmüller, and A. Zahrer, “A corpus of read and conversational Austrian German,” Speech Communication, vol. 94, pp. 62–74, 2017, issn: 0167-6393. doi: https://doi.org/10.1016/j.specom.2017.09.003. [Online]. Available: https://www.sciencedirect.com/science/art icle/pii/S0167639317300535.

[2] X. Kogler, “Classification of homophones in conversational Austrian German via random forest,” Bachelor’s thesis, Graz University of Technology, Oct. 2021.