Diplophonic Voice: Definitions, models, and detection

home › phd theses › Diplophonic Voice: Definitions, models, and detection

Diplophonic Voice: Definitions, models, and detection

Status

Finished

Date

2014-12-16

Student

Philipp Aichinger

Mentor

Gernot Kubin

Research Areas

Speech Communication

Voice disorders need to be better understood because they may lead to reduced job chances and social isolation. Correct treatment indication and treatment effect measurements are needed to tackle these problems. They must rely on robust outcome measures for clinical intervention studies. Diplophonia is a severe and often misunderstood sign of voice disorders. Depending on its underlying etiology, diplophonic patients typically receive treatment such as logopedic therapy or phonosurgery. In the current clinical practice diplophonia is determined auditively by the medical doctor, which is problematic from the viewpoints of evidence-based medicine and scientific methodology. The aim of this thesis is to work towards objective (i.e., automatic) detection of diplophonia. A database of 40 euphonic, 40 diplophonic and 40 dysphonic subjects has been acquired. The collected material consists of laryngeal high-speed videos and high-quality audio recordings. All material has been annotated for data quality and a non-destructive data pre-selection is applied. Diplophonic vocal fold vibration patterns (i.e., glottal diplophonia) are identified and procedures for automated detection from laryngeal high-speed videos are proposed. Frequency Image Bimodality is based on frequency analysis of pixel intensity time series. It is obtained fully automatically and yields classification accuracies of 78 % for the euphonic negative group and 75 % for the dysphonic negative group. Frequency Plot Bimodality is based on frequency analysis of glottal edge trajectories. It processes spatially segmented videos, which are obtained via manual intervention. Frequency Plot Bimodality obtains slightly higher classification accuracies of 82.9 % for the euphonic negative group and 77.5 % for the dysphonic negative group. A two-oscillator waveform model for analyzing acoustic and glottal area diplophonic waveforms is proposed and evaluated. The model is used to build a detection algorithm for secondary oscillators in the waveform and to define the physiologically interpretable “Diplophonia Diagram”. The Diplophonia Diagram yields a classification accuracy of 87.2 % when distinguishing diplophonia from severely dysphonic voices. In contrast, the performance of conventional hoarseness features is low on this task. Latent class analysis is used to evaluate the used ground truth from a probabilistic point of view. The used expert annotations achieve very high sensitivity (96.5 %) and perfect specificity (100 %). The Diplophonia Diagram is the best available automatic method for detecting diplophonic phonation intervals from speech. The Diplophonia Diagram is based on model structure optimization, audio waveform modeling and analysis-by-synthesis, which enables a more suitable description of diplophonic signals than conventional hoarseness features. Analysis-by-synthesis and waveform modeling had already been carried out in voice research, but systematic investigation of model structure optimization with respect to perceived voice quality is novel. For diplophonia, the switch between one and two oscillators is crucial. Optimal model structure is a qualitative outcome that may be interpreted physiologically and one may conjecture that model structure optimization is also useful for describing other voice phenomena than diplophonia. The obtained descriptors might be more easily accepted by clinicians than the conventional ones. Useful definitions of diplophonia focus on the levels of perception, acoustics and glottal vibration. It is suggested to avoid the sole use of the perceptual definition in clinical voice assessment. The glottal vibration level connects with distal causes, which is of high clinical interest but difficult to assess. The definition at the acoustic level via two-oscillator waveform models is favored and used for in vivo testing. Updating definitions and terminology of voice phenomena with respect to different levels of description is suggested.

The fulltext of this thesis can be found here.