Speech Recognition - A Transfer Learning Approach

home › theses & projects › Speech Recognition - A Transfer Learning Approach

Speech Recognition - A Transfer Learning Approach

Status

Finished

Type

Master Thesis

Announcement date

01 Jan 2020

Student

Raphael Schlüsselbauer

Mentors

Franz Pernkopf

Research Areas

Intelligent Systems

Abstract

utomated speech recognition (ASR) is of major importance as a hands free human-computer interface. Possible applications are voice controlled systems, dialog systems and documentation from dictation. Systems for the English language already have very low word error rates (WERs) due to large corpora being freely available. For the German language there seems to be too little free data available to train an ASR system with comparable accuracy. We suspect that German models with superior accuracy can be trained by leveraging English training data. This hypothesis is evaluated in this work. Therefore, we use several ASR models: (i) a particular type of hidden Markov model hybrid, i.e. a HMM with a factorized time delay neural network (HMM/TDNN-F) (ii) a transformer network, (iii) the Wav2Letter and (iv) DeepSpeech 2 architecture, in a transfer learning ASR setup. Open source frameworks are used to compare the proposed architectures on the English speech dataset Librispeech and the German Mozilla Common Voice dataset. In particular, transfer learning models are initialized with parameters obtained with the English speech corpus and mapped to the German ASR models. In order to align subword representations we adapt the network’s output layer to the vocabulary size and subword units of the German speech corpus. We measure the network’s accuracy in terms of WER and character error rate (CER). The transformer architecture trained without a language model (LM) achieves the best WER on the Librispeech dataset, i.e. a WER of 4.9% on Librispeech test-clean was achieved. The same model trained on the German Mozilla Common Voice dataset reached a WER of 39.9%. Using a transfer learning setup including English speech this accuracy could be improved relatively by 16%. We conclude that the performance of German ASR models is improved significantly by using an English model as weight initialization in a transfer learning setup. This effect is stronger when little training data is available. ASR models only using connectionist temporal classication (CTC) reached WERs of 13.43% (Wav2Letter) and 29.39% (DeepSpeech 2) without a LM on the Librispeech corpus. This indicates that the attention mechanism of the transformer architecture is increasing accuracy and reducing the need for a LM. This paves the way for edge implementations, replacing memory demanding LMs. When analyzing the computational performance of the proposed architectures in terms of network inference we observe that GPU or CPU processing units are still required for fast processing. In particular, we compared network inference time on the CPU of both ESPnet and Kaldi using an Intel Xeon E5-2697v3 @ 2.60GHz CPU. For the evaluation in terms of training and inference performance using GPUs an NVIDIA Tesla K40c GPU with 12GB VRAM was used. The Wav2Letter model had the fastest training time with 4.25 hours per training epoch on Librispeech. DeepSpeech2 had the fastest greedy decoding time on the GPU with 1.33 minutes for 1 hour of audio. Comparably the HMM/TDNN-F oered the fastest greedy decoding time on the CPU with 4.72 minutes for 1 hour of audio. Inference time highly depends on the choice of the LM size and beam size for decoding phoneme and character representations obtained by the acoustic models. The evaluation in terms of inference time exhibits that all evaluated models can decode audio faster than real time if the beam size of the decoder is suffciently small. However, the evaluated models, have to be scaled down signicantly in terms of memory and computational complexity to run on edge devices in realtime.