Phonetic Similarity Matching of Non-Literal Transcripts in Automatic Speech Recognition

PhD Student 
Research Area



 Large vocabulary continuous speech recognition (LVCSR) systems require large amounts of labelled audio data for training. While such literal transcriptions of audio recordings, i.e., highly accurate textual reproductions of the utterances are expensive and therefore only available in limited amounts, non-literal field data from commercial automatic dictation systems can be collected on large scale but with quality limitations. Automatic draft transcriptions from the dictation system contain misrecognitions and the manual corrections of the draft transcriptions produced by professional transcriptionists have been reformulated to comply with stylistic guidelines. In this work, phonetic similarity matching is utilised to bridge this gap between literal and non-literal text resources such that large amounts of non-literal transcripts can be employed for the improvement of LVCSR systems. For the first time, a detailed analysis of the deviations between manual reference transcripts, automatically recognised transcripts, and final corrected documents of a medical transcription environment on orthographic and phonetic level is given. Based on these insights, a novel method for the alignment of recognised transcripts and final corrected documents on multiple levels of segmentation was developed. The alignment is calculated based on the similarity of two phone strings determined with a stochastic string edit distance function trained on task-specific data. The proposed methods are applied for solving two exemplary application-driven problems. First, quasi-literal transcripts of medical dictations are reconstructed out of the non-literal automatically recognised and the final, corrected medical reports. Semantic and phonetic similarity measurements are defined for classifying aligned text chunks as either recognition errors or reformulations introduced by the medical transcriptionist. Language model retraining with a corpus of 50 million reconstructed words resulted in a relative word error rate reduction of 7.8% for a commercial medical transcription system. Second, speaker-specific pronunciation models for non-native speakers are generated from small amounts of available adaptation data. Phonetic similarity matching is utilised for measuring lexical confusability and the accuracy gain of a proposed pronunciation variant such that both effects are balanced for a given lexicon. Recognition tests with speaker-specifically adapted lexica resulted in an average relative word error rate reduction of 1% per speaker for the same commercial medical dictation system.  


This thesis is supervised by Gernot Kubin.