Intonation Prediction for Speech Synthesis Using the Fujisaki Model and Machine Learning

Project Type: Master/Diploma Thesis
Student: Wimmer Klaus
Mentor: Gernot Kubin


 Prosody is an important field in which the intelligibility and naturalness of synthetic speech can be improved. This thesis deals with the automatic analysis and synthesis of intonation, with the aim of developing an intonation model for the UPC text-to-speech system. Fujisaki's model of the F0 production process is used to obtain a quantitative representation of the F0 contours from a speech corpus. A relationship is established between the parameters of the Fujisaki-model and linguistic intonation units (accent groups and intonation groups). Several methods for the automatic extraction of Fujisaki-model parameters are tested. The best option is an algorithm which uses constraints related to intonation units in order to limit the number and the position of Fujisaki commands. These Fujisaki commands are then mapped onto linguistic features obtained automatically from text. This information is used to train three machine learning algorithms (decision trees, neural networks and vector clustering using CART) in order to obtain a compact representation of the intonation patterns of the corpus. There is no significant difference in the performance of the learning algorithms. The results of the prediction experiments are comparable to other approaches using the same corpus. The developed intonation model is evaluated through a perceptual test. When compared to the previous, rule-based intonation model, the ratings show a significant improvement.