Signal Processing and Speech Communication Laboratory
hometheses & projects › Creak Detection In Pathological Voices

Creak Detection In Pathological Voices

Status
Finished
Type
Bachelor Project
Announcement date
08 Jul 2020
Student
Anna Viehhauser
Mentors
Research Areas

Abstract

For speech production, typically three phonation types are distinguished: modal, breathy and creaky voice. Creaky voice is a voice quality which may be relevant to phonological and linguistic analyses or carry paralinguistic information like intentions or emotions. The aim of this project is to investigate the performance of a creak detection tool trained on healthy speech on pathological voices. For this purpose, we use creapy, a tool for automatic detection and labelling of creak in conversational speech. The audio files consist of the read German standard text “Der Nordwind und die Sonne”. The pathological dataset consists of 23 speakers divided in three subgroups (non diplophonic, unfrequently diplophonic and frequently diplophonic) depending on their diplophonic rate during the reading of the text. There are 15 speakers in the healthy dataset. Since these datasets differs from the original creapy training data it is necessary to adjust the parameter settings of creapy to our dataset. To evaluate creapy we look at the impact of different parameter settings on the F1-score.

Since the original creapy training data consisted of conversational speech, but our data of a set of read speech, there is a mismatch in speech rate. First, we adjusted the parameter for the minimum creak length. Second, since creak appears less frequently in read speech, we increased the threshold for the creak probability. Third, pathological speakers have more noise in their voices than healthy speakers, thus we adapted the threshold for the short time energy. When changing the parameter settings, the pathological data shows very similar patterns to the healthy dataset, the F1-score for the pathological data is about 5-10% lower than for the healthy data. The maximum F1-score for the pathological dataset is 47.19%, the maximum value for the healthy data is 59.81%. Both maxima are achieved with the parameter setting of the short time energy at the 30%-quantile, a minimum creak length of 60 ms and a creak probability of 80%. For individual results of the speakers we obtain a wide range of the F1-score from 22% to 88%, which results in a high standard deviation. creapy offers two additional gender specific training models, one is trained on only male voices and one on female voices. For healthy male speakers, we achieve the best F1-score (63.97%) with the male training model, for healthy female speakers the model trained on all voices achieves the highest value (63.36%). For healthy female speakers, the other two training models also obtain good results, the one trained on only male voices achieves an F1-score of 54.85%. For the pathological dataset, the model trained on all voices results in the highest F1-score for both female (48.29%) and male (44.43%) voices.

This study shows that the patterns in the pathological dataset are similar to those in the healthy dataset, regarding the parameter setting, with F1-scores for the pathological data being about 5-10% lower than for the healthy data. There is a great variability in F1-scores among individual speakers. To enhance the performance of creapy, a future step would be to train the tool on pathological speech.