Migration to NVIDIA NeMo Sortformer Diarization
Overview
The application has been updated to use NVIDIA NeMo's Sortformer (nvidia/diar_streaming_sortformer_4spk-v2) instead of the previous pyannote diarization model.
Key Changes
1. Model Architecture
- Old: pyannote.audio pipeline (gated model requiring HF token and acceptance)
- New: NVIDIA NeMo Sortformer (open CC-BY-4.0 license, no gating)
2. Features
- Streaming capability: Real-time processing with configurable latency
- Max 4 speakers: Optimized for up to 4 speakers (performance degrades beyond)
- Better accuracy: State-of-the-art DER (Diarization Error Rate) on benchmark datasets
- No HF token required: Model downloads directly without authentication
3. Technical Improvements
- Arrival-Order Speaker Cache (AOSC): Tracks speakers by arrival time
- Frame-level processing: 80ms frames (0.08 seconds per frame)
- Configurable streaming: Can adjust latency/accuracy trade-off
Installation
Updated Dependencies
pip install -r requirements.txt
Key additions:
nemo_toolkit[asr]- NVIDIA NeMo framework with ASR componentsCythonandpackaging- Required for NeMo installation
System Requirements
# Install system dependencies (Ubuntu/Debian)
apt-get update && apt-get install -y libsndfile1 ffmpeg
Configuration
Environment Variables
Diarization Model
DIARIZATION_MODEL_NAME=nvidia/diar_streaming_sortformer_4spk-v2
Streaming Configuration (80ms frames)
Current preset: High Latency (10 seconds, better accuracy)
DIAR_CHUNK_SIZE=124 # Processing chunk size (frames)
DIAR_RIGHT_CONTEXT=1 # Future frames after chunk
DIAR_FIFO_SIZE=124 # Previous frames before chunk
DIAR_UPDATE_PERIOD=124 # Speaker cache update period
DIAR_CACHE_SIZE=188 # Total speaker cache size
Available Presets
| Preset | Latency | RTF | CHUNK_SIZE | RIGHT_CONTEXT | FIFO_SIZE | UPDATE_PERIOD | CACHE_SIZE |
|---|---|---|---|---|---|---|---|
| Very High Latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
| High Latency (current) | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
| Low Latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
| Ultra Low Latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
RTF = Real-Time Factor (measured on NVIDIA RTX 6000 Ada)
API Changes
Input Format
Same as before - audio file path:
diar_model.diarize(audio=audio_path, batch_size=1)
Output Format
NeMo returns: [[start_seconds, end_seconds, speaker_index], ...]
Example:
[
[0.0, 5.2, 0], # SPEAKER_0 from 0s to 5.2s
[5.3, 10.1, 1], # SPEAKER_1 from 5.3s to 10.1s
[10.2, 15.0, 0], # SPEAKER_0 from 10.2s to 15.0s
]
Converted to:
{
"start": 0.0,
"end": 5.2,
"speaker": "SPEAKER_0"
}
Model Limitations
- Maximum 4 speakers: Performance degrades with 5+ speakers
- English-optimized: Trained primarily on English datasets (but works with other languages)
- Long recordings: May degrade on very long recordings (several hours)
- Audio format: Requires single-channel (mono) 16kHz audio
Performance Benchmarks
DIHARD III Eval (1-4 speakers)
- DER: 13.24%
CALLHOME (2 speakers)
- DER: 6.57%
CALLHOME (full, 2-6 speakers)
- DER: 10.70%
Troubleshooting
NeMo Installation Issues
# Install prerequisites first
pip install Cython packaging
# Then install NeMo
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
Model Download Issues
The model (~700MB) downloads automatically from Hugging Face on first use. If download fails:
- Check internet connection
- Verify disk space (~1GB free recommended)
- Try manual download from: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2
Memory Issues
If running out of memory:
- Reduce
DIAR_CACHE_SIZE(default: 188) - Use "Low Latency" preset (smaller buffers)
- Process shorter audio segments
Migration Steps for Hugging Face Space
- Update requirements.txt: Already done β
- Update app.py: Already done β
- Remove HF_TOKEN requirement: No longer needed for diarization
- Restart/Rebuild Space: Click "Restart" or "Rebuild" in Space settings
- First run: Model will download automatically (~700MB, one-time)
References
- NVIDIA NeMo Repository
- Sortformer Paper
- Streaming Sortformer Paper
- Model Card on Hugging Face
- NeMo Documentation
License
NVIDIA Sortformer is licensed under CC-BY-4.0 (Creative Commons Attribution 4.0 International).
- β Commercial use allowed
- β Modification allowed
- β Distribution allowed
- β Attribution required