Signal Processing and Speech Communication Laboratory
hometheses & projects › Efficient Single-Channel Music Source Separation with Deep Neural Networks

Efficient Single-Channel Music Source Separation with Deep Neural Networks

Status
Finished
Type
Master Thesis
Announcement date
01 Jan 2020
Student
Markus Huber
Mentors
Research Areas

Abstract

The trend of utilizing deep learning techniques to tackle difficult signal processing problems has not spared the scope of single-channel source separation, and modern systems based on neural networks have indeed reached unprecedented levels of separation quality. However, harnessing the power of these large-scale models in typical audio production environments, which frequently offer only limited computing resources while demanding quasi real-time processing, remains challenging. In order to utilize the power of deep neural networks on resource-constrained infrastructures, strategies and architectures have to be considered that ensure low computational requirements and memory footprint, while at the same time preserve (or even improve) accuracy.

This thesis sets out to examine viable solutions to both aspects of the problem within the context of musical audio mixtures, with a particular focus on singing-voice extraction. Various approaches to improve the performance of a state-of-the-art baseline system, the multi-scale multi-band DenseNet, are presented and discussed. These include architectural refinements regarding the multi-band structure, different training objectives, such as mask approximation, multi-task learning and deep clustering, as well as the exploitation and estimation of phase information, which allows for optimization in the time-domain. Specifically, instead of direct spectrogram estimation, experiments prove that using a deep clustering loss to approximate spectral masks results in a considerable performance increase over the baseline implementation. Subsequently, the resource-efficiency of this system is addressed. It is shown that a significant reduction of the model size and its computational requirements can be achieved via an effective use of bottleneck layers and the inference of Mel-scaled masks. In addition, applying parameterized structured pruning of convolutional weights results in a further increase in efficiency.

Based on these findings, a high-quality source separation system can be obtained which is roughly 1.6 times smaller and 7.3 times more efficient than the state-of-the-art baseline while maintaining its separation performance. Moreover, on a 2018 Mac mini machine (i7 core), CPU inference times as low as 4 milliseconds were measured. Since this is well within the range of typical audio buffer block sizes, the real-time capability of this approach is thus confirmed.