wav2scape: An easy to use tool for analyzing speech data based on self-supervised representations
- Acronym
- ABESPN
- Type
- Tool
- Contact
- Research Areas
- Other Profiles
Overview
wav2scape is a comprehensive tool for analyzing acoustic similarities and distances between speech categories using state-of-the-art self-supervised speech representations. Built on the wav2vec 2.0 framework [1] and utilizing the multilingual XLSR-53 model trained on 56,000 hours of speech data [2], wav2scape enables researchers to explore natural groupings and patterns in speech data across multiple dimensions. Its methodology is directly informed by recent research from our lab [3].
Main Purpose and Applications
The primary purpose of wav2scape is to process audio recordings and generate similarity matrices based on the frequency usage of shared discrete speech representations. The tool is highly flexible and can be applied to analyze various acoustic dimensions, including:
- Languages and language varieties (e.g., comparing different dialects or regional accents)
- Speaking styles (e.g., read vs. conversational speech)
- Prosodic or prominence features
- Individual speakers or speaker groups
- Other user-defined or extra-linguistic categories
Technical Approach
wav2scape processes speech data through a sophisticated pipeline
- Feature Extraction: Audio is processed through the pre-trained XLSR-53 model
- Quantization: Continuous representations are converted to discrete codebook indices using the model’s quantization module
- Usage Analysis: Frequency of codebook usage is computed for each category combination
(CategoryA_CategoryB)
, resulting in a 102400-dimensional normalized vector representing the probability distribution over quantized speech representations; i.e., in probabilistic terms,P(codebook_index|CategoryA_CategoryB)
. - Similarity Computation: Jensen-Shannon Divergence is used to measure similarity between category combinations
- Visualization: PCA is applied for dimensionality reduction and visualization
Key Features
- User-friendly workflow: Simple drag-and-drop interface with automated analysis
- Cross-platform compatibility: Available for Windows, macOS, and Linux
- CPU-based processing: No GPU requirements for maximum accessibility
- Comprehensive output: Similarity matrices, PCA scatter plots, and detailed statistical logs
- Flexible input: Supports multiple audio formats (.wav, .mp3, .ogg, .flac)
Data Requirements
wav2scape works with mono audio files (16 kHz recommended) containing short speech segments or utterances (typically 1-10 seconds). Files must follow a specific naming convention: *_CategoryA_CategoryB.ext
, where CategoryA
represents the primary grouping dimension and CategoryB the secondary dimension.
You can find the GitHub repository here: GitHub Link
[1] Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
[2] Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M. (2021) Unsupervised Cross-Lingual Representation Learning for Speech Recognition. Proc. Interspeech 2021, 2426-2430, doi: 10.21437/Interspeech.2021-329
[3] Linke, J., Kadar, M., Dosinszky, G., Mihajlik, P., Kubin, G., & Schuppler, B. (2023). What do self-supervised speech representations encode? An analysis of languages, varieties, speaking styles and speakers. In Interspeech 2023 (pp. 5371–5375). doi: 10.21437/Interspeech.2023-951