Signal Processing and Speech Communication Laboratory
hometheses & projects › Deep Neural Networks for Multi-Instrument Recognition and Timbre Characterization

Deep Neural Networks for Multi-Instrument Recognition and Timbre Characterization

Master Thesis
Announcement date
01 Oct 2021
Hannes Bradl
Research Areas


Since music databases and private digital music collections grew rapidly in recent years, managing such large amounts of data became a challenging task. For various applications, being able to retrieve recordings with a specific instrumentation is required. Additionally, information regarding the timbre of each instrument in a mix can be helpful as well. In this thesis, a system for automatic identification of instruments in recordings, determination of their respective loudness and characterization of the timbre of each instrument is proposed. The system consists of two main parts – a classifier and a set of timbre estimators. The classifier is able to identify 15 classes of instruments from a polyphonic mixture. Subsequently, the respective timbre estimators predict several timbre descriptors and the loudness of each instrument present in the mix. All models are realized as convolutional neural networks (CNNs), operating on mel-spectrogram representations of short audio chunks. In order to leverage underrepresented classes as well, we constructed a two-level hierarchical taxonomy, comprising eight instrument families and seven additional specific instruments. Training of all models was split into two phases: After pre-training with a music tagging dataset, the neural networks were retrained using a combination of three multi-track datasets. Therefore, we investigated different transfer learning methods and found out that training the fully connected layers from scratch and fine-tuning the convolutional layers yields the best results. Our training examples were produced on-the-fly by mixing of single-instrument tracks from the multi-track datasets. For this purpose, two distinct mixing strategies were examined. It turned out that a combination of both mixing techniques works best. Finally, the classifier and the timbre estimators were evaluated in separate experiments. The classifier shows state-of-the-art F1-scores for classes with sufficient training data. For the timbre estimators we obtained decent performance as well, although the potential of those models is hard to assess, as there is no ground truth or baseline for this task.