Nemotron-3.5-ASR-Streaming-Multilingual-0.6B β ONNX (INT8)
Cache-aware streaming multilingual speech recognition. A 0.6 B FastConformer-RNNT encoder with a 128-slot language prompt, exported to ONNX with dynamic INT8 encoder weights (per-channel QInt8). This is the smallest, fastest, lowest-RAM CPU build β at a modest, uneven quality cost (see below). For best quality across all languages, use the FP16 build.
- Architecture: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8Γ subsampling) + RNN-T decoder/joint
- Streaming: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3
- Languages: 100+ via the prompt dictionary (
languages.json); benchmarked on 6 below - Audio: 16 kHz mono, 128-bin log-mel front end
Model
| Parameters | ~0.6 B |
| Format | ONNX (external-data weights) |
| Precision | INT8 (per-channel dynamic, encoder) + FP32 decoder/joint |
| Bundle size | ~720 MB |
| Sample rate | 16 kHz mono |
| Chunk / lookahead | 320 ms / 240 ms |
Files
| File | Size | Description |
|---|---|---|
encoder.onnx + encoder.onnx.data |
~627 MB | Cache-aware FastConformer encoder (INT8 weights) |
decoder.onnx + decoder.onnx.data |
~60 MB | RNN-T prediction network (FP32) |
joint.onnx + joint.onnx.data |
~38 MB | RNN-T joint network (FP32) |
config.json |
<1 KB | Model + streaming config (mel, chunk, cache sizes) |
languages.json |
~2 KB | Locale β prompt-slot dictionary (128 slots) |
vocab.json |
~230 KB | 13 087-token BPE vocabulary |
Performance
FLEURS test, 320 ms streaming, CPU, n=30 per language. INT8 (per-channel) versus the FP16 build β Japanese uses CER.
| Language | INT8 WER % | INT8 CER % | (FP16 WER) |
|---|---|---|---|
| English (en-US) | 15.49 | 10.12 | 9.92 |
| German (de-DE) | 13.72 | 6.74 | 12.68 |
| French (fr-FR) | 18.01 | 7.31 | 15.93 |
| Arabic (ar-EG) | 14.19 | 3.86 | 14.02 |
| Hindi (hi-IN) | 7.37 | 4.17 | 7.37 |
| Japanese (ja-JP) | β / 17.01 | 17.01 | 16.28 |
Quality caveat. INT8 is near-lossless for Arabic / Hindi / Japanese but costs ~+5.6 WER on English and ~+2 WER on French. We investigated per-tensor, per-channel, and mixed-precision (attention kept FP32) recipes; English stays ~15 across all of them, so the sensitivity is in the FFN bulk, not attention or quantization granularity. Use this build when size / CPU speed / RAM matter most; otherwise prefer FP16.
Resource profile (8.4 s utterance, ONNX Runtime CPU): encoder ~45 ms/chunk (RTF ~0.14), peak RSS ~1.2 GB β roughly 1.9Γ faster and ~half the RAM of FP32 on CPU.
Usage
import onnxruntime as ort
so = ort.SessionOptions()
enc = ort.InferenceSession("encoder.onnx", so, providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("decoder.onnx", so, providers=["CPUExecutionProvider"])
joint = ort.InferenceSession("joint.onnx", so, providers=["CPUExecutionProvider"])
# Pick the language prompt slot from languages.json, e.g. "en-US" -> 0, "ja-JP" -> 10.
# Front end: 128-bin log-mel (n_fft=512, win=400, hop=160, preemph=0.97), 16 kHz mono.
# Streaming contract (per chunk): feed 320 ms of audio + the carried encoder caches
# (attention / conv / pre-cache), then run the RNN-T greedy loop over the 4 emitted frames.
Production streaming, cache management and RNN-T greedy decoding are handled by the speech-android SDK.
Source
Converted from nvidia/nemotron-3.5-asr-streaming-0.6b (NVIDIA NeMo). Licensed under the NVIDIA Open Model License.
Related models
| Variant | Repo |
|---|---|
| ONNX Β· FP16 | soniqo/β¦-ONNX-FP16 |
| ONNX Β· INT8 (this) | soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8 |
| LiteRT Β· FP16 | soniqo/β¦-LiteRT-FP16 |
| LiteRT Β· INT8 | soniqo/β¦-LiteRT-INT8 |
Links
- speech-android β Android SDK
- speech-core β on-device inference core (C++)
- soniqo.audio β website
- blog β blog
- Downloads last month
- 12
Model tree for soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b