Lightweight and Perceptually-Guided Voice Conversion for Electro-Laryngeal Speech
- Published
- Mon, Jun 01, 2026
- Tags
- rotm
- Contact

Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art, real-time voice conversion model (StreamVC) for EL speech to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL & healthy (HE) speech data. We pretrained it on over 500 hours of healthy German speech, and developed a custom Whisper- and DTW-based alignment pipeline to handle the large acoustic mismatch between EL and healthy recordings. The model was then fine-tuned on aligned EL–healthy speech pairs using perceptual and intelligibility-guided losses. A comparison of loss configurations through automatic metrics and a 22-participant listening test identified the best-performing variant and highlighted prosody and intelligibility as the key remaining challenges in electrolaryngeal to healthy voice conversion.
Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF) drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion (VC) architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.
The paper was accepted at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026, Barcelona, Spain) and appears in the ICASSP 2026 Proceedings published by IEEE.
Browse the Results of the Month archive.
