--- license: other license_name: nvidia-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ language: - en - es - zh - hi - ar - fr - de - ja - ru - pt - ko - it - nl - pl - tr - uk - ro - cs - hu - sv - da - fi - th - vi - id tags: - automatic-speech-recognition - streaming-asr - cache-aware - multilingual - FastConformer - RNNT - onnx base_model: nvidia/nemotron-3.5-asr-streaming-0.6b library_name: onnxruntime pipeline_tag: automatic-speech-recognition --- # Nemotron-3.5-ASR-Streaming-Multilingual-0.6B — ONNX (INT8) Cache-aware **streaming** multilingual speech recognition. A 0.6 B FastConformer-RNNT encoder with a 128-slot **language prompt**, exported to ONNX with **dynamic INT8** encoder weights (per-channel QInt8). This is the **smallest, fastest, lowest-RAM CPU build** — at a modest, uneven quality cost (see below). For best quality across all languages, use the [FP16 build](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-FP16). - **Architecture**: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8× subsampling) + RNN-T decoder/joint - **Streaming**: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3 - **Languages**: 100+ via the prompt dictionary (`languages.json`); benchmarked on 6 below - **Audio**: 16 kHz mono, 128-bin log-mel front end ## Model | | | |---|---| | Parameters | ~0.6 B | | Format | ONNX (external-data weights) | | Precision | INT8 (per-channel dynamic, encoder) + FP32 decoder/joint | | Bundle size | ~720 MB | | Sample rate | 16 kHz mono | | Chunk / lookahead | 320 ms / 240 ms | ## Files | File | Size | Description | |---|---|---| | `encoder.onnx` + `encoder.onnx.data` | ~627 MB | Cache-aware FastConformer encoder (INT8 weights) | | `decoder.onnx` + `decoder.onnx.data` | ~60 MB | RNN-T prediction network (FP32) | | `joint.onnx` + `joint.onnx.data` | ~38 MB | RNN-T joint network (FP32) | | `config.json` | <1 KB | Model + streaming config (mel, chunk, cache sizes) | | `languages.json` | ~2 KB | Locale → prompt-slot dictionary (128 slots) | | `vocab.json` | ~230 KB | 13 087-token BPE vocabulary | ## Performance FLEURS test, 320 ms streaming, **CPU**, n=30 per language. INT8 (per-channel) versus the FP16 build — Japanese uses **CER**. | Language | INT8 WER % | INT8 CER % | (FP16 WER) | |---|---|---|---| | English (en-US) | 15.49 | 10.12 | 9.92 | | German (de-DE) | 13.72 | 6.74 | 12.68 | | French (fr-FR) | 18.01 | 7.31 | 15.93 | | Arabic (ar-EG) | 14.19 | 3.86 | 14.02 | | Hindi (hi-IN) | 7.37 | 4.17 | 7.37 | | Japanese (ja-JP) | — / 17.01 | 17.01 | 16.28 | **Quality caveat.** INT8 is near-lossless for Arabic / Hindi / Japanese but costs **~+5.6 WER on English** and **~+2 WER on French**. We investigated per-tensor, per-channel, and mixed-precision (attention kept FP32) recipes; English stays ~15 across all of them, so the sensitivity is in the FFN bulk, not attention or quantization granularity. Use this build when size / CPU speed / RAM matter most; otherwise prefer FP16. Resource profile (8.4 s utterance, ONNX Runtime CPU): encoder ~45 ms/chunk (RTF ~0.14), peak RSS ~1.2 GB — roughly **1.9× faster and ~half the RAM** of FP32 on CPU. ## Usage ```python import onnxruntime as ort so = ort.SessionOptions() enc = ort.InferenceSession("encoder.onnx", so, providers=["CPUExecutionProvider"]) dec = ort.InferenceSession("decoder.onnx", so, providers=["CPUExecutionProvider"]) joint = ort.InferenceSession("joint.onnx", so, providers=["CPUExecutionProvider"]) # Pick the language prompt slot from languages.json, e.g. "en-US" -> 0, "ja-JP" -> 10. # Front end: 128-bin log-mel (n_fft=512, win=400, hop=160, preemph=0.97), 16 kHz mono. # Streaming contract (per chunk): feed 320 ms of audio + the carried encoder caches # (attention / conv / pre-cache), then run the RNN-T greedy loop over the 4 emitted frames. ``` Production streaming, cache management and RNN-T greedy decoding are handled by the **[speech-android](https://github.com/soniqo/speech-android)** SDK. ## Source Converted from **[nvidia/nemotron-3.5-asr-streaming-0.6b](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b)** (NVIDIA NeMo). Licensed under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ## Related models | Variant | Repo | |---|---| | ONNX · FP16 | [soniqo/…-ONNX-FP16](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-FP16) | | ONNX · INT8 (this) | `soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8` | | LiteRT · FP16 | [soniqo/…-LiteRT-FP16](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-FP16) | | LiteRT · INT8 | [soniqo/…-LiteRT-INT8](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8) | ## Links - [speech-android](https://github.com/soniqo/speech-android) — Android SDK - [speech-core](https://github.com/soniqo/speech-core) — on-device inference core (C++) - [soniqo.audio](https://soniqo.audio) — website - [blog](https://soniqo.audio/blog) — blog