vtt-with-diariazation / NVIDIA_NEMO_MIGRATION.md
Mahmoud Elsamadony
up
2271844

Migration to NVIDIA NeMo Sortformer Diarization

Overview

The application has been updated to use NVIDIA NeMo's Sortformer (nvidia/diar_streaming_sortformer_4spk-v2) instead of the previous pyannote diarization model.

Key Changes

1. Model Architecture

  • Old: pyannote.audio pipeline (gated model requiring HF token and acceptance)
  • New: NVIDIA NeMo Sortformer (open CC-BY-4.0 license, no gating)

2. Features

  • Streaming capability: Real-time processing with configurable latency
  • Max 4 speakers: Optimized for up to 4 speakers (performance degrades beyond)
  • Better accuracy: State-of-the-art DER (Diarization Error Rate) on benchmark datasets
  • No HF token required: Model downloads directly without authentication

3. Technical Improvements

  • Arrival-Order Speaker Cache (AOSC): Tracks speakers by arrival time
  • Frame-level processing: 80ms frames (0.08 seconds per frame)
  • Configurable streaming: Can adjust latency/accuracy trade-off

Installation

Updated Dependencies

pip install -r requirements.txt

Key additions:

  • nemo_toolkit[asr] - NVIDIA NeMo framework with ASR components
  • Cython and packaging - Required for NeMo installation

System Requirements

# Install system dependencies (Ubuntu/Debian)
apt-get update && apt-get install -y libsndfile1 ffmpeg

Configuration

Environment Variables

Diarization Model

DIARIZATION_MODEL_NAME=nvidia/diar_streaming_sortformer_4spk-v2

Streaming Configuration (80ms frames)

Current preset: High Latency (10 seconds, better accuracy)

DIAR_CHUNK_SIZE=124        # Processing chunk size (frames)
DIAR_RIGHT_CONTEXT=1       # Future frames after chunk
DIAR_FIFO_SIZE=124         # Previous frames before chunk
DIAR_UPDATE_PERIOD=124     # Speaker cache update period
DIAR_CACHE_SIZE=188        # Total speaker cache size

Available Presets

Preset Latency RTF CHUNK_SIZE RIGHT_CONTEXT FIFO_SIZE UPDATE_PERIOD CACHE_SIZE
Very High Latency 30.4s 0.002 340 40 40 300 188
High Latency (current) 10.0s 0.005 124 1 124 124 188
Low Latency 1.04s 0.093 6 7 188 144 188
Ultra Low Latency 0.32s 0.180 3 1 188 144 188

RTF = Real-Time Factor (measured on NVIDIA RTX 6000 Ada)

API Changes

Input Format

Same as before - audio file path:

diar_model.diarize(audio=audio_path, batch_size=1)

Output Format

NeMo returns: [[start_seconds, end_seconds, speaker_index], ...]

Example:

[
    [0.0, 5.2, 0],      # SPEAKER_0 from 0s to 5.2s
    [5.3, 10.1, 1],     # SPEAKER_1 from 5.3s to 10.1s
    [10.2, 15.0, 0],    # SPEAKER_0 from 10.2s to 15.0s
]

Converted to:

{
    "start": 0.0,
    "end": 5.2,
    "speaker": "SPEAKER_0"
}

Model Limitations

  1. Maximum 4 speakers: Performance degrades with 5+ speakers
  2. English-optimized: Trained primarily on English datasets (but works with other languages)
  3. Long recordings: May degrade on very long recordings (several hours)
  4. Audio format: Requires single-channel (mono) 16kHz audio

Performance Benchmarks

DIHARD III Eval (1-4 speakers)

  • DER: 13.24%

CALLHOME (2 speakers)

  • DER: 6.57%

CALLHOME (full, 2-6 speakers)

  • DER: 10.70%

Troubleshooting

NeMo Installation Issues

# Install prerequisites first
pip install Cython packaging

# Then install NeMo
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]

Model Download Issues

The model (~700MB) downloads automatically from Hugging Face on first use. If download fails:

Memory Issues

If running out of memory:

  • Reduce DIAR_CACHE_SIZE (default: 188)
  • Use "Low Latency" preset (smaller buffers)
  • Process shorter audio segments

Migration Steps for Hugging Face Space

  1. Update requirements.txt: Already done βœ“
  2. Update app.py: Already done βœ“
  3. Remove HF_TOKEN requirement: No longer needed for diarization
  4. Restart/Rebuild Space: Click "Restart" or "Rebuild" in Space settings
  5. First run: Model will download automatically (~700MB, one-time)

References

License

NVIDIA Sortformer is licensed under CC-BY-4.0 (Creative Commons Attribution 4.0 International).

  • βœ“ Commercial use allowed
  • βœ“ Modification allowed
  • βœ“ Distribution allowed
  • βœ“ Attribution required