Speech Communication 2

Instructor: 

General Information

This lecture course is complementary to the course Speech Communication 1 in that it addresses the issue of man-machine communication by means of spoken language. This results in an extension of signal processing into symbolic processing where, e.g., the recognition of speech units is performed with statistical detection theory (Hidden Markov Models) that makes use of language models built from large corpora of (symbolic) text. This highly interdisciplinary course reviews methods from speech recognition and synthesis, from computational linguistics including the development of databases for speech corpora, lexica, and grammars at various levels (syntax/semantics/pragmatics), from user interface design and dialogue management. A central goal is to teach how these components contribute to the design of conversational dialogue systems that allow for spoken language access to the information infrastruture.

Contents 

  • Automatic speech recognition (ASR)
  • Introduction
  • Feature extraction
  • Classification
  • Markov models
  • Hidden Markov models (HMMs)
  • Phonetic elements
  • Grammar models
  • Decoding (Viterbi decoder)
  • Selected topics from speech synthesis
  • Harmonic-plus-noise model
  • Oscillator-plus-noise model

Lecture notes

References

 

Speech recognition in general

  • L. Rabiner, B. H. Juang: "Fundamentals of Speech Recognition", Prentice Hall, Englewood Cliffs, NJ, 1993.
  • E.G. Schukat-Talamazzini: "Automatische Spracherkennung", Vieweg Verlag, Braunschweig, 1995.
  • R.A. Cole et al.: Survey of the State of the Art in Human Language Technology, 1996.
  • F. Jelinek: Statistical Methods for Speech Recognition (Language, Speech, and Communication). MIT Press 1999.
  • D. Jurafsky et al: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice-Hall 2000.
  • X. Huang, A. Acero, H.-W. Hon: Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR 2001.

Classification

Gaussian Mixtures / k-means / EM algorithm

  • Tutorial Mixtures of Gaussians for the course Computational Intelligence, TU Graz.
  • Jeff A. Bilmes: A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, International Computer Science Institute, TR-97-021.
  • S. Bengio: An Introduction to Statistical Machine Learning - EM for GMMs, Dalle Molle Institute for Perceptual Artificial Intelligence.

Markov Models / Hidden Markov Models

  • L. R. Rabiner: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, vol. 77, no. 2, Febr. 1989.
  • Tutorial Hidden Markov Models for the course Computational Intelligence, TU Graz.
  • Eric Fosler-Lussier: Markov Models and Hidden Markov Models - A Brief Tutorial, International Computer Science Institute Technical Report TR-98-041
  •  Hervé Bourlard, Sacha Krstulovic, and Mathew Magimai-Doss: EPFL lab notes Introduction to Hidden Markov Models.

Sinusoidal Modeling/Harmonic-plus-Noise Model

  • Thomas F. Quatieri: Discrete-Time Speech Signal Processing, Prentice Hall, 2002.
  • J. Laroche, Y. Stylianou, and E. Moulines: HNS: Speech Modification Based on a Harmonic+Noise Model, Proc. of ICASSP 1993, vol.2, pp.550-553.
  • J. Laroche, Y. Stylianou, and E. Moulines: HNS: A Simple, Efficient Harmonic + Noise Model for Speech, Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 1993, pp.169-172.
  • Y. Stylianou: Applying the harmonic plus noise model in concatenative speech synthesis, IEEE Trans. on Speech and Audio Processing, vol. 9, no.1, pp.21-29, Jan. 2001.
  • G. Bailly: A parametric harmonic + noise model, in Keller et al.: Improvements in Speech Synthesis, Wiley, 2002.
  • E. R. Banga et al.: Concatenative text-to-speech synthesis based on sinusoidal modelling, in Keller et al.: Improvements in Speech Synthesis, Wiley, 2002.
  • D. O'Brian and A. Monaghan: Shape invariant pitch and time-scale modification of speech based on a harmonic model, in Keller et al.: Improvements in Speech Synthesis, Wiley, 2002.
Term: 
Summer
Education Level: 
Master Level