---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
language:
- en
- es
- zh
- hi
- ar
- fr
- de
- ja
- ru
- pt
- ko
- it
- nl
- pl
- tr
- uk
- ro
- cs
- hu
- sv
- da
- fi
- th
- vi
- id
tags:
- automatic-speech-recognition
- streaming-asr
- cache-aware
- multilingual
- FastConformer
- RNNT
- onnx
base_model: nvidia/nemotron-3.5-asr-streaming-0.6b
library_name: onnxruntime
pipeline_tag: automatic-speech-recognition
---

# Nemotron-3.5-ASR-Streaming-Multilingual-0.6B — ONNX (INT8)

Cache-aware **streaming** multilingual speech recognition. A 0.6 B FastConformer-RNNT
encoder with a 128-slot **language prompt**, exported to ONNX with **dynamic INT8** encoder
weights (per-channel QInt8). This is the **smallest, fastest, lowest-RAM CPU build** — at a
modest, uneven quality cost (see below). For best quality across all languages, use the
[FP16 build](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-FP16).

- **Architecture**: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8× subsampling) + RNN-T decoder/joint
- **Streaming**: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3
- **Languages**: 100+ via the prompt dictionary (`languages.json`); benchmarked on 6 below
- **Audio**: 16 kHz mono, 128-bin log-mel front end

## Model

| | |
|---|---|
| Parameters | ~0.6 B |
| Format | ONNX (external-data weights) |
| Precision | INT8 (per-channel dynamic, encoder) + FP32 decoder/joint |
| Bundle size | ~720 MB |
| Sample rate | 16 kHz mono |
| Chunk / lookahead | 320 ms / 240 ms |

## Files

| File | Size | Description |
|---|---|---|
| `encoder.onnx` + `encoder.onnx.data` | ~627 MB | Cache-aware FastConformer encoder (INT8 weights) |
| `decoder.onnx` + `decoder.onnx.data` | ~60 MB | RNN-T prediction network (FP32) |
| `joint.onnx` + `joint.onnx.data` | ~38 MB | RNN-T joint network (FP32) |
| `config.json` | <1 KB | Model + streaming config (mel, chunk, cache sizes) |
| `languages.json` | ~2 KB | Locale → prompt-slot dictionary (128 slots) |
| `vocab.json` | ~230 KB | 13 087-token BPE vocabulary |

## Performance

FLEURS test, 320 ms streaming, **CPU**, n=30 per language. INT8 (per-channel) versus the
FP16 build — Japanese uses **CER**.

| Language | INT8 WER % | INT8 CER % | (FP16 WER) |
|---|---|---|---|
| English (en-US) | 15.49 | 10.12 | 9.92 |
| German (de-DE) | 13.72 | 6.74 | 12.68 |
| French (fr-FR) | 18.01 | 7.31 | 15.93 |
| Arabic (ar-EG) | 14.19 | 3.86 | 14.02 |
| Hindi (hi-IN) | 7.37 | 4.17 | 7.37 |
| Japanese (ja-JP) | — / 17.01 | 17.01 | 16.28 |

**Quality caveat.** INT8 is near-lossless for Arabic / Hindi / Japanese but costs **~+5.6 WER
on English** and **~+2 WER on French**. We investigated per-tensor, per-channel, and
mixed-precision (attention kept FP32) recipes; English stays ~15 across all of them, so the
sensitivity is in the FFN bulk, not attention or quantization granularity. Use this build when
size / CPU speed / RAM matter most; otherwise prefer FP16.

Resource profile (8.4 s utterance, ONNX Runtime CPU): encoder ~45 ms/chunk (RTF ~0.14),
peak RSS ~1.2 GB — roughly **1.9× faster and ~half the RAM** of FP32 on CPU.

## Usage

```python
import onnxruntime as ort

so = ort.SessionOptions()
enc = ort.InferenceSession("encoder.onnx", so, providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("decoder.onnx", so, providers=["CPUExecutionProvider"])
joint = ort.InferenceSession("joint.onnx", so, providers=["CPUExecutionProvider"])

# Pick the language prompt slot from languages.json, e.g. "en-US" -> 0, "ja-JP" -> 10.
# Front end: 128-bin log-mel (n_fft=512, win=400, hop=160, preemph=0.97), 16 kHz mono.
# Streaming contract (per chunk): feed 320 ms of audio + the carried encoder caches
# (attention / conv / pre-cache), then run the RNN-T greedy loop over the 4 emitted frames.
```

Production streaming, cache management and RNN-T greedy decoding are handled by the
**[speech-android](https://github.com/soniqo/speech-android)** SDK.

## Source

Converted from **[nvidia/nemotron-3.5-asr-streaming-0.6b](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b)**
(NVIDIA NeMo). Licensed under the
[NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).

## Related models

| Variant | Repo |
|---|---|
| ONNX · FP16 | [soniqo/…-ONNX-FP16](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-FP16) |
| ONNX · INT8 (this) | `soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-ONNX-INT8` |
| LiteRT · FP16 | [soniqo/…-LiteRT-FP16](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-FP16) |
| LiteRT · INT8 | [soniqo/…-LiteRT-INT8](https://huggingface.co/soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8) |

## Links

- [speech-android](https://github.com/soniqo/speech-android) — Android SDK
- [speech-core](https://github.com/soniqo/speech-core) — on-device inference core (C++)
- [soniqo.audio](https://soniqo.audio) — website
- [blog](https://soniqo.audio/blog) — blog