Nemotron-3.5-ASR-Streaming-Multilingual-0.6B — ONNX (INT8)

Cache-aware streaming multilingual speech recognition. A 0.6 B FastConformer-RNNT encoder with a 128-slot language prompt, exported to ONNX with dynamic INT8 encoder weights (per-channel QInt8). This is the smallest, fastest, lowest-RAM CPU build — at a modest, uneven quality cost (see below). For best quality across all languages, use the FP16 build.

Architecture: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8× subsampling) + RNN-T decoder/joint
Streaming: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3
Languages: 100+ via the prompt dictionary (languages.json); benchmarked on 6 below
Audio: 16 kHz mono, 128-bin log-mel front end

Model


Parameters	~0.6 B
Format	ONNX (external-data weights)
Precision	INT8 (per-channel dynamic, encoder) + FP32 decoder/joint
Bundle size	~720 MB
Sample rate	16 kHz mono
Chunk / lookahead	320 ms / 240 ms

Files

File	Size	Description
`encoder.onnx` + `encoder.onnx.data`	~627 MB	Cache-aware FastConformer encoder (INT8 weights)
`decoder.onnx` + `decoder.onnx.data`	~60 MB	RNN-T prediction network (FP32)
`joint.onnx` + `joint.onnx.data`	~38 MB	RNN-T joint network (FP32)
`config.json`	<1 KB	Model + streaming config (mel, chunk, cache sizes)
`languages.json`	~2 KB	Locale → prompt-slot dictionary (128 slots)
`vocab.json`	~230 KB	13 087-token BPE vocabulary

Performance

FLEURS test, 320 ms streaming, CPU, n=30 per language. INT8 (per-channel) versus the FP16 build — Japanese uses CER.

Language	INT8 WER %	INT8 CER %	(FP16 WER)
English (en-US)	15.49	10.12	9.92
German (de-DE)	13.72	6.74	12.68
French (fr-FR)	18.01	7.31	15.93
Arabic (ar-EG)	14.19	3.86	14.02
Hindi (hi-IN)	7.37	4.17	7.37
Japanese (ja-JP)	— / 17.01	17.01	16.28

Quality caveat. INT8 is near-lossless for Arabic / Hindi / Japanese but costs ~+5.6 WER on English and ~+2 WER on French. We investigated per-tensor, per-channel, and mixed-precision (attention kept FP32) recipes; English stays ~15 across all of them, so the sensitivity is in the FFN bulk, not attention or quantization granularity. Use this build when size / CPU speed / RAM matter most; otherwise prefer FP16.

Resource profile (8.4 s utterance, ONNX Runtime CPU): encoder ~45 ms/chunk (RTF ~0.14), peak RSS ~1.2 GB — roughly 1.9× faster and ~half the RAM of FP32 on CPU.

Usage

import onnxruntime as ort

so = ort.SessionOptions()
enc = ort.InferenceSession("encoder.onnx", so, providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("decoder.onnx", so, providers=["CPUExecutionProvider"])
joint = ort.InferenceSession("joint.onnx", so, providers=["CPUExecutionProvider"])

# Pick the language prompt slot from languages.json, e.g. "en-US" -> 0, "ja-JP" -> 10.
# Front end: 128-bin log-mel (n_fft=512, win=400, hop=160, preemph=0.97), 16 kHz mono.
# Streaming contract (per chunk): feed 320 ms of audio + the carried encoder caches
# (attention / conv / pre-cache), then run the RNN-T greedy loop over the 4 emitted frames.

Production streaming, cache management and RNN-T greedy decoding are handled by the speech-android SDK.

Source

Converted from nvidia/nemotron-3.5-asr-streaming-0.6b (NVIDIA NeMo). Licensed under the NVIDIA Open Model License.

Related models

Variant	Repo
ONNX · FP16	soniqo/…-ONNX-FP16
ONNX · INT8 (this)	`soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8`
LiteRT · FP16	soniqo/…-LiteRT-FP16
LiteRT · INT8	soniqo/…-LiteRT-INT8

Model tree for soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Quantized

(9)

this model

Collection including soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8

ONNX

Collection

ONNX bundles for soniqo.audio. VAD, speech enhancement, ASR, TTS — for Android via ONNX Runtime and cross-platform consumers. • 8 items • Updated 1 day ago • 1

soniqo
/

Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8