Speech Segmentation of Audio Books

Master Project
Announcement date
01 Oct 2011
Philipp Salletmayr
  • Harald Romsdorfer
Research Areas

An important task in speech processing is the segmentation of speech utterances into the appropriate sequence of phones. This segmentation is traditionally accomplished using some kind of phoneme-based forced alignment algorithm. However, the segmentation of long speech utterances, so-called monologues, is in general a non-trivial issue, cf. [1].

Audio books offer a rich resource of high-quality speech material with accompanying text resources. Unfortunately, the speech material of audio books is a set of very long speech files. Recently, different approaches to the segmentation of monologues were accomplished, e.g. in [2].

This thesis aims to investigate an approach to phone segmentation of long speech monologues using, e.g., a combination of a grapheme-based forced alignment procedure for first sentence- and/or word-level segmentation, followed by a phone-based forced alignment procedure for the final phone level segmentation.


[1] P. J. Moreno and C. Alberti: A factor automaton approach for the forced alignment of long speech recordings. In Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, pages 4869–4872, Taipei, Taiwan, 2009. 13, 14

[2] K. Prahallad: Automatic Building of Synthetic Voices from Audio Books. PhD Thesis, CMU, Pittsburgh, 2010.