Nemotron-3.5-ASR-Streaming-Multilingual-0.6B β€” ONNX (INT8)

Cache-aware streaming multilingual speech recognition. A 0.6 B FastConformer-RNNT encoder with a 128-slot language prompt, exported to ONNX with dynamic INT8 encoder weights (per-channel QInt8). This is the smallest, fastest, lowest-RAM CPU build β€” at a modest, uneven quality cost (see below). For best quality across all languages, use the FP16 build.

  • Architecture: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8Γ— subsampling) + RNN-T decoder/joint
  • Streaming: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3
  • Languages: 100+ via the prompt dictionary (languages.json); benchmarked on 6 below
  • Audio: 16 kHz mono, 128-bin log-mel front end

Model

Parameters ~0.6 B
Format ONNX (external-data weights)
Precision INT8 (per-channel dynamic, encoder) + FP32 decoder/joint
Bundle size ~720 MB
Sample rate 16 kHz mono
Chunk / lookahead 320 ms / 240 ms

Files

File Size Description
encoder.onnx + encoder.onnx.data ~627 MB Cache-aware FastConformer encoder (INT8 weights)
decoder.onnx + decoder.onnx.data ~60 MB RNN-T prediction network (FP32)
joint.onnx + joint.onnx.data ~38 MB RNN-T joint network (FP32)
config.json <1 KB Model + streaming config (mel, chunk, cache sizes)
languages.json ~2 KB Locale β†’ prompt-slot dictionary (128 slots)
vocab.json ~230 KB 13 087-token BPE vocabulary

Performance

FLEURS test, 320 ms streaming, CPU, n=30 per language. INT8 (per-channel) versus the FP16 build β€” Japanese uses CER.

Language INT8 WER % INT8 CER % (FP16 WER)
English (en-US) 15.49 10.12 9.92
German (de-DE) 13.72 6.74 12.68
French (fr-FR) 18.01 7.31 15.93
Arabic (ar-EG) 14.19 3.86 14.02
Hindi (hi-IN) 7.37 4.17 7.37
Japanese (ja-JP) β€” / 17.01 17.01 16.28

Quality caveat. INT8 is near-lossless for Arabic / Hindi / Japanese but costs ~+5.6 WER on English and ~+2 WER on French. We investigated per-tensor, per-channel, and mixed-precision (attention kept FP32) recipes; English stays ~15 across all of them, so the sensitivity is in the FFN bulk, not attention or quantization granularity. Use this build when size / CPU speed / RAM matter most; otherwise prefer FP16.

Resource profile (8.4 s utterance, ONNX Runtime CPU): encoder ~45 ms/chunk (RTF ~0.14), peak RSS ~1.2 GB β€” roughly 1.9Γ— faster and ~half the RAM of FP32 on CPU.

Usage

import onnxruntime as ort

so = ort.SessionOptions()
enc = ort.InferenceSession("encoder.onnx", so, providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("decoder.onnx", so, providers=["CPUExecutionProvider"])
joint = ort.InferenceSession("joint.onnx", so, providers=["CPUExecutionProvider"])

# Pick the language prompt slot from languages.json, e.g. "en-US" -> 0, "ja-JP" -> 10.
# Front end: 128-bin log-mel (n_fft=512, win=400, hop=160, preemph=0.97), 16 kHz mono.
# Streaming contract (per chunk): feed 320 ms of audio + the carried encoder caches
# (attention / conv / pre-cache), then run the RNN-T greedy loop over the 4 emitted frames.

Production streaming, cache management and RNN-T greedy decoding are handled by the speech-android SDK.

Source

Converted from nvidia/nemotron-3.5-asr-streaming-0.6b (NVIDIA NeMo). Licensed under the NVIDIA Open Model License.

Related models

Variant Repo
ONNX Β· FP16 soniqo/…-ONNX-FP16
ONNX Β· INT8 (this) soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8
LiteRT Β· FP16 soniqo/…-LiteRT-FP16
LiteRT Β· INT8 soniqo/…-LiteRT-INT8

Links

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8

Quantized
(9)
this model

Collection including soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8