Nemotron-3.5 ASR Streaming 0.6B — CoreML INT8
Cache-aware streaming Conformer + RNN-T from NVIDIA, ported to CoreML (.mlmodelc) for on-device inference on Apple Silicon. 600 M params, 40 language-locales, native punctuation and capitalization. INT8 palettized encoder, FP16 decoder + joint. Ships .mlmodelc (compiled) — .mlpackage is not shipped because on-device MLModel.compileModel() produces non-deterministic output across iOS simulator vs device runtimes.
Model
| Parameters | 600 M |
| Architecture | FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel |
| Languages | 40 (see frontmatter) |
| Sample rate | 16 kHz mono |
| Streaming chunk | 320 ms (att_context_size = [56, 3]) |
| Encoder quantization | INT8 palettized |
| Decoder / joint dtype | FP16 |
| On-disk size | 612 MB |
Files
| File | Size | Description |
|---|---|---|
encoder.mlmodelc/ |
565 MB | 24-layer cache-aware Conformer encoder + prompt kernel |
decoder.mlmodelc/ |
29 MB | RNN-T prediction net (2-layer LSTM, 640 dim) |
joint.mlmodelc/ |
18 MB | Joint network (vocab=13088) |
vocab.json |
230 KB | SentencePiece pieces, id → string |
languages.json |
2 KB | Language tag → prompt slot (e.g. "en-US": 0) |
config.json |
<1 KB | Streaming geometry + dims for the loader |
Performance
M5 Pro (Apple Silicon), 50 samples per language from FLEURS test, streaming 320 ms chunks, compute units .all. Scoring uses Whisper EnglishTextNormalizer for English and BasicTextNormalizer(split_letters=True) for hi/ja (char-level), BasicTextNormalizer for de/fr/ar.
Accuracy
| lang | WER % | CER % | Δ WER vs fp32 source |
|---|---|---|---|
| en_us | 9.59 | 4.26 | +0.26 |
| de_de | 10.41 | 5.37 | +0.19 |
| fr_fr | 12.18 | 4.84 | +1.05 |
| ar_eg | 13.37 | 3.80 | +0.10 |
| hi_in | 4.42 | 3.61 | −0.84 |
| ja_jp | 17.66 | 12.09 | +0.69 |
Quantization is essentially lossless. For ja/hi the published WER is char-level (matches NVIDIA's CJK methodology); CER is the more interpretable number for those scripts.
Streaming throughput + memory
M5 Pro, 60 s long-form en_us audio, single thread, .all compute units:
| metric | value |
|---|---|
| RTF (encode + decode) | 0.068 |
| p50 chunk latency | 18.6 ms |
| p99 chunk latency | 23.4 ms |
| RSS post-load | 1046 MB |
| RSS peak (mid-stream) | 1238 MB |
Usage
Python (coremltools)
import coremltools as ct
import numpy as np
import json
from huggingface_hub import snapshot_download
bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8")
encoder = ct.models.CompiledMLModel(f"{bundle}/encoder.mlmodelc")
decoder = ct.models.CompiledMLModel(f"{bundle}/decoder.mlmodelc")
joint = ct.models.CompiledMLModel(f"{bundle}/joint.mlmodelc")
slots = json.load(open(f"{bundle}/languages.json"))["promptDictionary"]
lang_mask = np.zeros((1, 128), dtype=np.float32)
lang_mask[0, slots["en-US"]] = 1.0
# Feed 320 ms chunks; persist caches across calls (see inference reference below).
Swift (speech-swift SDK)
import NemotronStreamingASR
// Default model id is this repo; pass `modelId:` to pin
let model = try await NemotronStreamingASRModel.fromPretrained()
// Batch
let text = try model.transcribeAudio(audio, sampleRate: 16000, language: "en-US")
// Streaming (yields partials as audio is fed)
for await partial in model.transcribeStream(audio: audio, sampleRate: 16000, language: "ja-JP") {
print(partial.text, partial.isFinal)
}
Language is a BCP-47 tag (en-US, de-DE, fr-FR, ja-JP, hi-IN, ...) — resolved via languages.json. The Swift wrapper persists the streaming KV/conv caches across pushAudio calls within a StreamingSession.
CLI (Homebrew)
brew install soniqo/tap/speech
# Single-shot transcription on a wav file (uses this CoreML INT8 bundle by default)
speech transcribe recording.wav --engine nemotron --language en-US
# Other languages
speech transcribe meeting.wav --engine nemotron --language de-DE
speech transcribe interview.wav --engine nemotron --language ja-JP
speech is a universal binary; the model is downloaded on first use and cached under ~/Library/Caches/qwen3-speech/.
Source
Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.
Links
- speech-swift — Apple SDK with
NemotronStreamingASR - soniqo.audio
- blog
- Downloads last month
- 56
Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b