Nemotron-3.5 ASR Streaming 0.6B — MLX 8bit

INT8 quantized (mlx.nn.quantize(group_size=64, bits=8)) port of NVIDIA's Nemotron-3.5 streaming ASR for Apple Silicon. Essentially lossless vs bf16 at 40 % smaller disk + 32 % lower streaming peak.

Model

Parameters 600 M (8 bit weights, group=64)
Architecture FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel
Languages 40
Sample rate 16 kHz mono
Streaming chunk 320 ms
On-disk size 732 MB

Files

File Size Description
model.safetensors 732 MB 8 bit grouped-quantized weights
vocab.json 100 KB id → SentencePiece piece
lang2slot.json 2 KB Language tag → prompt slot
config.json <1 KB Quantization config + arch/streaming geometry

Performance

M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test.

Accuracy

lang WER % CER % Δ WER vs fp32 source
en_us 10.28 4.32 +0.95
de_de 10.59 5.06 +0.37
fr_fr 11.20 4.79 +0.07
ar_eg 13.95 3.87 +0.68
hi_in 5.68 4.47 +0.42
ja_jp 17.98 11.97 +1.01

Streaming throughput + memory

metric value
RTF (encode + decode) 0.044
p50 chunk latency 13.5 ms
p99 chunk latency 16.8 ms
RSS post-load 262 MB (mmap)
RSS peak (mid-stream) 997 MB

Usage

Python / MLX

from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit")
# Load + assemble like bf16 sibling; quantization replayed via `mlx.nn.quantize`.

Swift (speech-swift)

The speech-swift SDK ships the CoreML INT8 variant (NemotronStreamingASR target). To use this MLX 8bit bundle from Swift requires mlx-swift wiring; for typical app use, CoreML INT8 is the recommended on-device path and matches MLX 8bit accuracy within 0.3 pp WER.

CLI

brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US

Source

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.

Links

Downloads last month
46
Safetensors
Model size
0.2B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit

Finetuned
(9)
this model

Collection including aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit