# Migration to NVIDIA NeMo Sortformer Diarization

## Overview
The application has been updated to use **NVIDIA NeMo's Sortformer** (`nvidia/diar_streaming_sortformer_4spk-v2`) instead of the previous pyannote diarization model.

## Key Changes

### 1. Model Architecture
- **Old**: pyannote.audio pipeline (gated model requiring HF token and acceptance)
- **New**: NVIDIA NeMo Sortformer (open CC-BY-4.0 license, no gating)

### 2. Features
- **Streaming capability**: Real-time processing with configurable latency
- **Max 4 speakers**: Optimized for up to 4 speakers (performance degrades beyond)
- **Better accuracy**: State-of-the-art DER (Diarization Error Rate) on benchmark datasets
- **No HF token required**: Model downloads directly without authentication

### 3. Technical Improvements
- **Arrival-Order Speaker Cache (AOSC)**: Tracks speakers by arrival time
- **Frame-level processing**: 80ms frames (0.08 seconds per frame)
- **Configurable streaming**: Can adjust latency/accuracy trade-off

## Installation

### Updated Dependencies
```bash
pip install -r requirements.txt
```

Key additions:
- `nemo_toolkit[asr]` - NVIDIA NeMo framework with ASR components
- `Cython` and `packaging` - Required for NeMo installation

### System Requirements
```bash
# Install system dependencies (Ubuntu/Debian)
apt-get update && apt-get install -y libsndfile1 ffmpeg
```

## Configuration

### Environment Variables

#### Diarization Model
```bash
DIARIZATION_MODEL_NAME=nvidia/diar_streaming_sortformer_4spk-v2
```

#### Streaming Configuration (80ms frames)
Current preset: **High Latency** (10 seconds, better accuracy)

```bash
DIAR_CHUNK_SIZE=124        # Processing chunk size (frames)
DIAR_RIGHT_CONTEXT=1       # Future frames after chunk
DIAR_FIFO_SIZE=124         # Previous frames before chunk
DIAR_UPDATE_PERIOD=124     # Speaker cache update period
DIAR_CACHE_SIZE=188        # Total speaker cache size
```

#### Available Presets

| Preset | Latency | RTF | CHUNK_SIZE | RIGHT_CONTEXT | FIFO_SIZE | UPDATE_PERIOD | CACHE_SIZE |
|--------|---------|-----|------------|---------------|-----------|---------------|------------|
| Very High Latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
| **High Latency** (current) | **10.0s** | **0.005** | **124** | **1** | **124** | **124** | **188** |
| Low Latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
| Ultra Low Latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |

*RTF = Real-Time Factor (measured on NVIDIA RTX 6000 Ada)*

## API Changes

### Input Format
Same as before - audio file path:
```python
diar_model.diarize(audio=audio_path, batch_size=1)
```

### Output Format
NeMo returns: `[[start_seconds, end_seconds, speaker_index], ...]`

Example:
```python
[
    [0.0, 5.2, 0],      # SPEAKER_0 from 0s to 5.2s
    [5.3, 10.1, 1],     # SPEAKER_1 from 5.3s to 10.1s
    [10.2, 15.0, 0],    # SPEAKER_0 from 10.2s to 15.0s
]
```

Converted to:
```json
{
    "start": 0.0,
    "end": 5.2,
    "speaker": "SPEAKER_0"
}
```

## Model Limitations

1. **Maximum 4 speakers**: Performance degrades with 5+ speakers
2. **English-optimized**: Trained primarily on English datasets (but works with other languages)
3. **Long recordings**: May degrade on very long recordings (several hours)
4. **Audio format**: Requires single-channel (mono) 16kHz audio

## Performance Benchmarks

### DIHARD III Eval (1-4 speakers)
- DER: 13.24%

### CALLHOME (2 speakers)
- DER: 6.57%

### CALLHOME (full, 2-6 speakers)
- DER: 10.70%

## Troubleshooting

### NeMo Installation Issues
```bash
# Install prerequisites first
pip install Cython packaging

# Then install NeMo
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
```

### Model Download Issues
The model (~700MB) downloads automatically from Hugging Face on first use. If download fails:
- Check internet connection
- Verify disk space (~1GB free recommended)
- Try manual download from: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2

### Memory Issues
If running out of memory:
- Reduce `DIAR_CACHE_SIZE` (default: 188)
- Use "Low Latency" preset (smaller buffers)
- Process shorter audio segments

## Migration Steps for Hugging Face Space

1. **Update requirements.txt**: Already done ✓
2. **Update app.py**: Already done ✓
3. **Remove HF_TOKEN requirement**: No longer needed for diarization
4. **Restart/Rebuild Space**: Click "Restart" or "Rebuild" in Space settings
5. **First run**: Model will download automatically (~700MB, one-time)

## References

- [NVIDIA NeMo Repository](https://github.com/NVIDIA/NeMo)
- [Sortformer Paper](https://arxiv.org/abs/2409.06656)
- [Streaming Sortformer Paper](https://arxiv.org/abs/2507.18446)
- [Model Card on Hugging Face](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2)
- [NeMo Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/)

## License

NVIDIA Sortformer is licensed under **CC-BY-4.0** (Creative Commons Attribution 4.0 International).
- ✓ Commercial use allowed
- ✓ Modification allowed
- ✓ Distribution allowed
- ✓ Attribution required