Using frequency-domain features for speech recognition with variable sampling frequencies

Project Type: Master/Diploma Thesis
Student: Bauerecker Hermann
Mentor: Gernot Kubin


 When a speech recognition system has to deal with signals at different sampling frequencies, multiple acoustic models may have to be maintained. To avoid this drawback, the system can be trained at the highest expected sampling frequency and the acoustic models are posteriorly converted to a new sampling frequency. However, the usual mel-frequency cepstral coefficients are not well suited to this approach since they are not located in the frequency domain. That is why, in this project, we tackle this problem using features resulting from frequency-filtering the logarithmic band energies. Experimental results are given with SpeechDatCar databases, at 16 kHz, 11 kHz, and 8 kHz sampling rates. They show no degradation in terms of recognition performance for 11 and 8 kHz testing signals when the system, trained at 16 kHz, is converted, in an inexpensive way, to 11 or 8 kHz, instead of directly training the system at 11 and 8 kHz. If a voice activity detector and a mean and variance normalisation are included in the system, a reduction of the word error rate by 78 % at 16 kHz is achieved whereas the behavior of the sample rate conversion stays the same. Additional experimental tests with a one stage Wiener filter were only performed at 16 kHz sampling rate for training and test. They showed no meaningful improvement of the recognition performance.