Version control for speech corpora
- In work
- Master Thesis
- Announcement date
- 08 Jan 2021
- Vlad Dumitru
- Research Areas
A speech corpus consists of a set of audio recordings and, for each of the recordings, a set of annotation layers representing various features of the audio signal.
Developing such a corpus is an expensive task in terms of the amount of work required, and a usual approach is having multiple annotators work in parallel on the same recordings.
As the size and complexity of such a corpus increases, two problems require addressing:
- Storing the contents of the corpus in an efficient manner, such that it can be both served as a static package, and allow analysis on the contents, such as by performing queries on the various layers of annotations;
- Supporting collaborative work, allowing annotators to work in parallel, by merging their partial results.
In addition, easy interoperability with external tools is desirable.