# Migration to NVIDIA NeMo Sortformer Diarization ## Overview The application has been updated to use **NVIDIA NeMo's Sortformer** (`nvidia/diar_streaming_sortformer_4spk-v2`) instead of the previous pyannote diarization model. ## Key Changes ### 1. Model Architecture - **Old**: pyannote.audio pipeline (gated model requiring HF token and acceptance) - **New**: NVIDIA NeMo Sortformer (open CC-BY-4.0 license, no gating) ### 2. Features - **Streaming capability**: Real-time processing with configurable latency - **Max 4 speakers**: Optimized for up to 4 speakers (performance degrades beyond) - **Better accuracy**: State-of-the-art DER (Diarization Error Rate) on benchmark datasets - **No HF token required**: Model downloads directly without authentication ### 3. Technical Improvements - **Arrival-Order Speaker Cache (AOSC)**: Tracks speakers by arrival time - **Frame-level processing**: 80ms frames (0.08 seconds per frame) - **Configurable streaming**: Can adjust latency/accuracy trade-off ## Installation ### Updated Dependencies ```bash pip install -r requirements.txt ``` Key additions: - `nemo_toolkit[asr]` - NVIDIA NeMo framework with ASR components - `Cython` and `packaging` - Required for NeMo installation ### System Requirements ```bash # Install system dependencies (Ubuntu/Debian) apt-get update && apt-get install -y libsndfile1 ffmpeg ``` ## Configuration ### Environment Variables #### Diarization Model ```bash DIARIZATION_MODEL_NAME=nvidia/diar_streaming_sortformer_4spk-v2 ``` #### Streaming Configuration (80ms frames) Current preset: **High Latency** (10 seconds, better accuracy) ```bash DIAR_CHUNK_SIZE=124 # Processing chunk size (frames) DIAR_RIGHT_CONTEXT=1 # Future frames after chunk DIAR_FIFO_SIZE=124 # Previous frames before chunk DIAR_UPDATE_PERIOD=124 # Speaker cache update period DIAR_CACHE_SIZE=188 # Total speaker cache size ``` #### Available Presets | Preset | Latency | RTF | CHUNK_SIZE | RIGHT_CONTEXT | FIFO_SIZE | UPDATE_PERIOD | CACHE_SIZE | |--------|---------|-----|------------|---------------|-----------|---------------|------------| | Very High Latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 | | **High Latency** (current) | **10.0s** | **0.005** | **124** | **1** | **124** | **124** | **188** | | Low Latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 | | Ultra Low Latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 | *RTF = Real-Time Factor (measured on NVIDIA RTX 6000 Ada)* ## API Changes ### Input Format Same as before - audio file path: ```python diar_model.diarize(audio=audio_path, batch_size=1) ``` ### Output Format NeMo returns: `[[start_seconds, end_seconds, speaker_index], ...]` Example: ```python [ [0.0, 5.2, 0], # SPEAKER_0 from 0s to 5.2s [5.3, 10.1, 1], # SPEAKER_1 from 5.3s to 10.1s [10.2, 15.0, 0], # SPEAKER_0 from 10.2s to 15.0s ] ``` Converted to: ```json { "start": 0.0, "end": 5.2, "speaker": "SPEAKER_0" } ``` ## Model Limitations 1. **Maximum 4 speakers**: Performance degrades with 5+ speakers 2. **English-optimized**: Trained primarily on English datasets (but works with other languages) 3. **Long recordings**: May degrade on very long recordings (several hours) 4. **Audio format**: Requires single-channel (mono) 16kHz audio ## Performance Benchmarks ### DIHARD III Eval (1-4 speakers) - DER: 13.24% ### CALLHOME (2 speakers) - DER: 6.57% ### CALLHOME (full, 2-6 speakers) - DER: 10.70% ## Troubleshooting ### NeMo Installation Issues ```bash # Install prerequisites first pip install Cython packaging # Then install NeMo pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr] ``` ### Model Download Issues The model (~700MB) downloads automatically from Hugging Face on first use. If download fails: - Check internet connection - Verify disk space (~1GB free recommended) - Try manual download from: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2 ### Memory Issues If running out of memory: - Reduce `DIAR_CACHE_SIZE` (default: 188) - Use "Low Latency" preset (smaller buffers) - Process shorter audio segments ## Migration Steps for Hugging Face Space 1. **Update requirements.txt**: Already done ✓ 2. **Update app.py**: Already done ✓ 3. **Remove HF_TOKEN requirement**: No longer needed for diarization 4. **Restart/Rebuild Space**: Click "Restart" or "Rebuild" in Space settings 5. **First run**: Model will download automatically (~700MB, one-time) ## References - [NVIDIA NeMo Repository](https://github.com/NVIDIA/NeMo) - [Sortformer Paper](https://arxiv.org/abs/2409.06656) - [Streaming Sortformer Paper](https://arxiv.org/abs/2507.18446) - [Model Card on Hugging Face](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2) - [NeMo Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/) ## License NVIDIA Sortformer is licensed under **CC-BY-4.0** (Creative Commons Attribution 4.0 International). - ✓ Commercial use allowed - ✓ Modification allowed - ✓ Distribution allowed - ✓ Attribution required