wav2scape: An easy to use tool for analyzing speech data based on self-supervised representations

home › databases & tools › wav2scape: An easy to use tool for analyzing speech data based on self-supervised representations

wav2scape: An easy to use tool for analyzing speech data based on self-supervised representations

Acronym

ABESPN

Type

Tool

Contact

Research Areas

Speech Communication

Links

wav2scape

Overview

wav2scape is a comprehensive tool for analyzing acoustic similarities and distances between speech categories using state-of-the-art self-supervised speech representations. Built on the wav2vec 2.0 framework [1] and utilizing the multilingual XLSR-53 model trained on 56,000 hours of speech data [2], wav2scape enables researchers to explore natural groupings and patterns in speech data across multiple dimensions. Its methodology is directly informed by recent research from our lab [3].

Main Purpose and Applications

The primary purpose of wav2scape is to process audio recordings and generate similarity matrices based on the frequency usage of shared discrete speech representations. The tool is highly flexible and can be applied to analyze various acoustic dimensions, including:

Languages and language varieties (e.g., comparing different dialects or regional accents)
Speaking styles (e.g., read vs. conversational speech)
Prosodic or prominence features
Individual speakers or speaker groups
Other user-defined or extra-linguistic categories

Technical Approach

wav2scape processes speech data through a sophisticated pipeline

Feature Extraction: Audio is processed through the pre-trained XLSR-53 model
Quantization: Continuous representations are converted to discrete codebook indices using the model’s quantization module
Usage Analysis: Frequency of codebook usage is computed for each category combination (CategoryA_CategoryB), resulting in a 102400-dimensional normalized vector representing the probability distribution over quantized speech representations; i.e., in probabilistic terms, P(codebook_index|CategoryA_CategoryB).
Similarity Computation: Jensen-Shannon Divergence is used to measure similarity between category combinations
Visualization: PCA is applied for dimensionality reduction and visualization

Key Features

User-friendly workflow: Simple drag-and-drop interface with automated analysis
Cross-platform compatibility: Available for Windows, macOS, and Linux
CPU-based processing: No GPU requirements for maximum accessibility
Comprehensive output: Similarity matrices, PCA scatter plots, and detailed statistical logs
Flexible input: Supports multiple audio formats (.wav, .mp3, .ogg, .flac)

Data Requirements

wav2scape works with mono audio files (16 kHz recommended) containing short speech segments or utterances (typically 1-10 seconds). Files must follow a specific naming convention: *_CategoryA_CategoryB.ext, where CategoryA represents the primary grouping dimension and CategoryB the secondary dimension.

You can find the GitHub repository here: GitHub Link

When using wav2scape, please cite [3].

[1] Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449–12460.

[2] Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M. (2021) Unsupervised Cross-Lingual Representation Learning for Speech Recognition. Proc. Interspeech 2021, 2426-2430, doi: 10.21437/Interspeech.2021-329

[3] Linke, J., Kadar, M., Dosinszky, G., Mihajlik, P., Kubin, G., & Schuppler, B. (2023). What do self-supervised speech representations encode? An analysis of languages, varieties, speaking styles and speakers. In Interspeech 2023 (pp. 5371–5375). doi: 10.21437/Interspeech.2023-951