MLX Speech Models
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 56 items • Updated • 5
How to use aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit
INT8 quantized (mlx.nn.quantize(group_size=64, bits=8)) port of NVIDIA's Nemotron-3.5 streaming ASR for Apple Silicon. Essentially lossless vs bf16 at 40 % smaller disk + 32 % lower streaming peak.
| Parameters | 600 M (8 bit weights, group=64) |
| Architecture | FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel |
| Languages | 40 |
| Sample rate | 16 kHz mono |
| Streaming chunk | 320 ms |
| On-disk size | 732 MB |
| File | Size | Description |
|---|---|---|
model.safetensors |
732 MB | 8 bit grouped-quantized weights |
vocab.json |
100 KB | id → SentencePiece piece |
lang2slot.json |
2 KB | Language tag → prompt slot |
config.json |
<1 KB | Quantization config + arch/streaming geometry |
M5 Pro (Apple Silicon GPU), 50 samples per language from FLEURS test.
| lang | WER % | CER % | Δ WER vs fp32 source |
|---|---|---|---|
| en_us | 10.28 | 4.32 | +0.95 |
| de_de | 10.59 | 5.06 | +0.37 |
| fr_fr | 11.20 | 4.79 | +0.07 |
| ar_eg | 13.95 | 3.87 | +0.68 |
| hi_in | 5.68 | 4.47 | +0.42 |
| ja_jp | 17.98 | 11.97 | +1.01 |
| metric | value |
|---|---|
| RTF (encode + decode) | 0.044 |
| p50 chunk latency | 13.5 ms |
| p99 chunk latency | 16.8 ms |
| RSS post-load | 262 MB (mmap) |
| RSS peak (mid-stream) | 997 MB |
from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-MLX-8bit")
# Load + assemble like bf16 sibling; quantization replayed via `mlx.nn.quantize`.
The speech-swift SDK ships the CoreML INT8 variant (NemotronStreamingASR target). To use this MLX 8bit bundle from Swift requires mlx-swift wiring; for typical app use, CoreML INT8 is the recommended on-device path and matches MLX 8bit accuracy within 0.3 pp WER.
brew install soniqo/tap/speech
# CLI defaults to the CoreML INT8 bundle (--engine nemotron); MLX variants
# are loaded via the Python pipeline above.
speech transcribe recording.wav --engine nemotron --language en-US
Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.
Quantized
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b