Signal Processing and Speech Communication Laboratory
hometheses & projects › Low-Complexity Convolutional Neural Networks for Acoustic Scene Classification

Low-Complexity Convolutional Neural Networks for Acoustic Scene Classification

Master Thesis
Announcement date
01 Oct 2021
Lukas Maier
Research Areas


Acoustic Scene Classification (ASC) is the task of assigning an acoustic scene class to a given audio recording. Modern ASC systems often rely on Convolutional Neural Networks (CNNs) to solve this task. However, on devices with limited computing capabilities, like smartphones, large CNNs may demand more resources than are available. Low-complexity CNNs address this issue.

This thesis investigates different methods of designing low-complexity CNNs. Our goal is to construct a CNN which predicts acoustic scenes with a high classification accuracy, but which only requires a small memory footprint. In order to achieve this goal we first define a new network structure called ASCMobConvNet, which relies on Mobile Inverted Bottleneck Convolutions (MobConvs) as its main building block. We then search for the best-performing features which serve as input to our network. The spectrum of the selected features is corrected in order to normalize the audio spectrum of each feature tensor with respect to the recording device. After that we search for the best data augmentation scheme for our approach.

We determine the best type of normalization layer for our network by replacing the default batch normalization layers in the MobConv blocks with a selected set of normalization layer types. Furthermore, we apply Wasserstein correction to the convolutional layers in the network to reduce the covariate shift between test time activations and training time activations. Then we apply adaptive quantization to our convolutional layers using a straight-through estimator to further reduce the memory requirement of our network. In the final step we increase the size of our network as far as possible without exceeding the given memory requirement and evaluate the performance. All experiments are conducted using the dataset and the requirements of the DCASE 2021 challenge, task 1.A.

Our results show that it is possible to train accurate low-complexity CNNs for the prediction of acoustic scenes. The final version of our ASCMobConvNet reaches a classification accuracy of 69.33%, but only requires 123.45kB of memory for storing its weights.