Signal Processing and Speech Communication Laboratory
homeresearch projects › Voice conversion for the processing of pathological speech (FWF Stand-Alone Project PAT5948223)

Voice conversion for the processing of pathological speech (FWF Stand-Alone Project PAT5948223)

Period
2025 — 2029
Funding
Fonds zur Förderung der wissenschaftlichen Forschung, FWF (Österreich)
Partners
  • Philipp Aichinger (Department of Otorhinolaryngology, Medical University of Vienna)
  • Tomoki Toda (University Nagoya, Japan)
Research Areas
Contact
Members

Impaired speech production poses significant communication hurdles, often impacting career prospects and quality of life. This project envisions (i) the prediction of effects of clinical treatment of impaired speech, and (ii) a better sounding substitution speech, i.e., electrolarynx (EL) speech, for individuals with otherwise unfavorable prospects.

Objectives:

Objective 1 is to predict post-treatment speech audio recordings, i.e., readings of a German standard text. Input data is paired text-parallel pre-treatment speech audio recordings, and electronic patient records (EPRs), i.e., unstructured textual clinical reports containing information about the speech impairment and treatment plan. Objective 2 is to improve speech of larygectomees in terms of naturalness and emotional expressiveness, i.e., authentically sounding prosody, with a low processing latency.

Methods:

The overarching methodical approach is voice conversion (VC), which refers to audio signal processing techniques that aim at changing characteristics of an audio recorded speaker, e.g., speaker identity, emotion, or accent.

For objective 1, various VC setups are employed to map pre- to post-treatment speech, including CNNs, a U-Net, and encoder-mapping-decoder VC. Predicted post-treatment speaker embeddings are also utilized to control a multi-speaker text-to-speech synthesizer for VC. In addition, EPRs are used to condition the VC. Data of approximately 25.000 patients is obtained from the Vienna General Hospital, Austria.

For objective 2, first, an EL emulator is trained to convert speech recordings into aligned EL speech audio. Second, an emotional voice conversion (EVC) network is trained to approximate emotional healthy speech from EL speech. Transcribed conversational speech is used as training data, while emo-labels are manually corrected outputs of text2emotion. Third, automatic emotion control of the EVC is attempted combining automatic speech recognition with a language model to approximate intended emotion from input EL speech. Finally, the EVC is trained in a closed-loop setting using as input EL speech with variable pitch. The used data are the transcribed Austrian German Parallel Electro-Larynx – Healthy Speech Corpus (ELHE), and the Graz Corpus of Read and Spontaneous Speech (GRASS).

Level of originality:

Neither prior attempts to predicting post-treatment speech recordings, nor to improving EL speech using EVC are reported in current scientific literature. However, now is the right time to break ground, because VC technology has become mature enough for pursuing the proposed ambitious objectives with excellent prospects.

Primary researchers involved:

Philipp Aichinger leads the project in Vienna, collaborating with a Postdoc and a PhD student dedicated to treatment effect prediction. Martin Hagmüller supervises the project in Graz and supervises a PhD student engaged in EVC for EL speech. Members of the advisory board include Tomoki Toda, Nagoya University Japan