Signal Processing and Speech Communication Laboratory
hometheses & projects › Acoustic COVID-19 Detection Using Multiple Instance Learning

Acoustic COVID-19 Detection Using Multiple Instance Learning

Status
Finished
Type
Master Thesis
Announcement date
28 Sep 2023
Student
Michael Reiter
Mentors
Research Areas
 In the global COVID-19 pandemic, a key factor for combating its advances was a rigorous testing scheme to slow down the spread of the disease and make informed decisions for other measures like self-isolation or entire country wide lock downs. However, tests are expensive, take time and social distancing guidelines cannot be guaranteed. Furthermore, in many regions, test availability was very limited. A machine learning-based diagnostic tool that reliably predicts infections from audio recordings could enable widespread, low-cost testing without causing significant delays in diagnosis or violating social distancing guidelines. In order to achieve comparability between such algorithms, the DiCOVA challenge has been created, using a crowdsourced data set called Coswara. Three types of sound recordings, namely cough, speech and breath, as well as a fusion of these categories, were provided. This thesis attempts to surpass the challenge winning results using the same blind test set and categories. Recording durations vary greatly between modalities and participants ranging from one second to over a minute. To get the most out of the entire data set, a multiple instance learning (MIL) approach is used. For this purpose, a base model is initially pre-trained on random, short time intervals of the audio recordings. Subsequently, a MIL model is added and fine-tuned to make collective predictions for any number of time segments within an audio recording. A ResNet-based convolutional neural network is selected and enhanced with additional normalization and dropout layers. A Mel-spectrogram is used as a base feature set for each recording and optimal frequency and time resolutions are selected separately for each type of sound. Various augmentation and sampling techniques are applied to address the rather small size and bias of the data set. An extensive hyperparameter search is performed, evaluating the impact of input normalization, learning rate schedulers, loss modifications, outlier exclusion and more. In order to compete in the fusion category of the DiCOVA challenge, we utilize a linear regression approach to combine predictions from the most successful models associated with each sound modality. The application of the MIL approach significantly improves generalizability, decreasing the gap between validation and test performance while improving both. The resulting performances compete in the upper regions of the leaderboard of the challenge. In the fusion category, our best model secures the second position with a score of 88.1%, trailing the first-placed team by only 0.3%. The DiCOVA Challenge did not utilize the entire dataset that Coswara provides. By incorporating this unused data, including the sound modality ’sustained vowel phonation’ and metadata from a questionnaire, we were able to significantly improve our previous results. This is evident in the results of the blind test dataset, where, on average, through 5-fold cross-validation, we achieve AUC ROC values ranging from 87.1% to 89.8% for the individual modalities and 92.2% on the fusion track. Our best model even reaches an AUC ROC score of 93.1% on the fusion track. Finally, the analysis of the results reveals some disparities and potential challenges both in the dataset and in the evaluation of the results. This includes biases for different parts of the dataset, depending on gender, duration of the audio recordings or how the health status of participants was assessed.