Spaces:

MahmoudElsamadony
/

vtt-with-diariazation

Paused

App Files Files Community

vtt-with-diariazation / NVIDIA_NEMO_MIGRATION.md

Mahmoud Elsamadony

2271844 25 days ago

preview code

raw

history blame contribute delete

5.11 kB

Migration to NVIDIA NeMo Sortformer Diarization

Overview

The application has been updated to use NVIDIA NeMo's Sortformer (nvidia/diar_streaming_sortformer_4spk-v2) instead of the previous pyannote diarization model.

Key Changes

1. Model Architecture

Old: pyannote.audio pipeline (gated model requiring HF token and acceptance)
New: NVIDIA NeMo Sortformer (open CC-BY-4.0 license, no gating)

2. Features

Streaming capability: Real-time processing with configurable latency
Max 4 speakers: Optimized for up to 4 speakers (performance degrades beyond)
Better accuracy: State-of-the-art DER (Diarization Error Rate) on benchmark datasets
No HF token required: Model downloads directly without authentication

3. Technical Improvements

Arrival-Order Speaker Cache (AOSC): Tracks speakers by arrival time
Frame-level processing: 80ms frames (0.08 seconds per frame)
Configurable streaming: Can adjust latency/accuracy trade-off

Installation

Updated Dependencies

pip install -r requirements.txt

Key additions:

nemo_toolkit[asr] - NVIDIA NeMo framework with ASR components
Cython and packaging - Required for NeMo installation

System Requirements

# Install system dependencies (Ubuntu/Debian)
apt-get update && apt-get install -y libsndfile1 ffmpeg

Configuration

Environment Variables

Diarization Model

DIARIZATION_MODEL_NAME=nvidia/diar_streaming_sortformer_4spk-v2

Streaming Configuration (80ms frames)

Current preset: High Latency (10 seconds, better accuracy)

DIAR_CHUNK_SIZE=124        # Processing chunk size (frames)
DIAR_RIGHT_CONTEXT=1       # Future frames after chunk
DIAR_FIFO_SIZE=124         # Previous frames before chunk
DIAR_UPDATE_PERIOD=124     # Speaker cache update period
DIAR_CACHE_SIZE=188        # Total speaker cache size

Available Presets

Preset	Latency	RTF	CHUNK_SIZE	RIGHT_CONTEXT	FIFO_SIZE	UPDATE_PERIOD	CACHE_SIZE
Very High Latency	30.4s	0.002	340	40	40	300	188
High Latency (current)	10.0s	0.005	124	1	124	124	188
Low Latency	1.04s	0.093	6	7	188	144	188
Ultra Low Latency	0.32s	0.180	3	1	188	144	188

RTF = Real-Time Factor (measured on NVIDIA RTX 6000 Ada)

API Changes

Input Format

Same as before - audio file path:

diar_model.diarize(audio=audio_path, batch_size=1)

Output Format

NeMo returns: [[start_seconds, end_seconds, speaker_index], ...]

Example:

[
    [0.0, 5.2, 0],      # SPEAKER_0 from 0s to 5.2s
    [5.3, 10.1, 1],     # SPEAKER_1 from 5.3s to 10.1s
    [10.2, 15.0, 0],    # SPEAKER_0 from 10.2s to 15.0s
]

Converted to:

{
    "start": 0.0,
    "end": 5.2,
    "speaker": "SPEAKER_0"
}

Model Limitations

Maximum 4 speakers: Performance degrades with 5+ speakers
English-optimized: Trained primarily on English datasets (but works with other languages)
Long recordings: May degrade on very long recordings (several hours)
Audio format: Requires single-channel (mono) 16kHz audio

Performance Benchmarks

DIHARD III Eval (1-4 speakers)

DER: 13.24%

CALLHOME (2 speakers)

DER: 6.57%

CALLHOME (full, 2-6 speakers)

DER: 10.70%

Troubleshooting

NeMo Installation Issues

# Install prerequisites first
pip install Cython packaging

# Then install NeMo
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]

Model Download Issues

The model (~700MB) downloads automatically from Hugging Face on first use. If download fails:

Check internet connection
Verify disk space (~1GB free recommended)
Try manual download from: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2

Memory Issues

If running out of memory:

Reduce DIAR_CACHE_SIZE (default: 188)
Use "Low Latency" preset (smaller buffers)
Process shorter audio segments

Migration Steps for Hugging Face Space

Update requirements.txt: Already done ✓
Update app.py: Already done ✓
Remove HF_TOKEN requirement: No longer needed for diarization
Restart/Rebuild Space: Click "Restart" or "Rebuild" in Space settings
First run: Model will download automatically (~700MB, one-time)

References

License

NVIDIA Sortformer is licensed under CC-BY-4.0 (Creative Commons Attribution 4.0 International).

✓ Commercial use allowed
✓ Modification allowed
✓ Distribution allowed
✓ Attribution required