Version control for speech corpora

In work
Master Thesis
Announcement date
08 Jan 2021
Vlad Dumitru
Research Areas

A speech corpus consists of a set of audio recordings and, for each of the recordings, a set of annotation layers representing various features of the audio signal.

Developing such a corpus is an expensive task in terms of the amount of work required, and a usual approach is having multiple annotators work in parallel on the same recordings.

As the size and complexity of such a corpus increases, two problems require addressing:

  1. Storing the contents of the corpus in an efficient manner, such that it can be both served as a static package, and allow analysis on the contents, such as by performing queries on the various layers of annotations;
  2. Supporting collaborative work, allowing annotators to work in parallel, by merging their partial results.

In addition, easy interoperability with external tools is desirable.