Nemotron-3.5 ASR Streaming 0.6B — MLX 8bit

INT8 quantized (mlx.nn.quantize(group_size=64, bits=8)) port of NVIDIA's Nemotron-3.5 streaming ASR for Apple Silicon. Essentially lossless vs bf16 at 40 % smaller disk + 32 % lower streaming peak.

Model


Parameters	600 M (8 bit weights, group=64)
Architecture	FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel
Languages	40
Sample rate	16 kHz mono
Streaming chunk	320 ms
On-disk size	732 MB

Files

File	Size	Description
`model.safetensors`	732 MB	8 bit grouped-quantized weights
`vocab.json`	100 KB	id → SentencePiece piece
`lang2slot.json`	2 KB	Language tag → prompt slot
`config.json`	<1 KB	Quantization config + arch/streaming geometry

Performance

M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test.

Accuracy

lang	WER %	CER %	Δ WER vs fp32 source
en_us	10.28	4.32	+0.95
de_de	10.59	5.06	+0.37
fr_fr	11.20	4.79	+0.07
ar_eg	13.95	3.87	+0.68
hi_in	5.68	4.47	+0.42
ja_jp	17.98	11.97	+1.01

Streaming throughput + memory

metric	value
RTF (encode + decode)	0.044
p50 chunk latency	13.5 ms
p99 chunk latency	16.8 ms
RSS post-load	262 MB (mmap)
RSS peak (mid-stream)	997 MB

Usage

Python / MLX

from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit")
# Load + assemble like bf16 sibling; quantization replayed via `mlx.nn.quantize`.

Swift (speech-swift)

The speech-swift SDK ships the CoreML INT8 variant (NemotronStreamingASR target). To use this MLX 8bit bundle from Swift requires mlx-swift wiring; for typical app use, CoreML INT8 is the recommended on-device path and matches MLX 8bit accuracy within 0.3 pp WER.

CLI

brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US

Source

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.

Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Finetuned

(9)

this model

Collection including aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 56 items • Updated 1 day ago • 5

aufklarer
/

Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit