Nemotron-3.5 ASR Streaming 0.6B — CoreML INT8

Cache-aware streaming Conformer + RNN-T from NVIDIA, ported to CoreML (.mlmodelc) for on-device inference on Apple Silicon. 600 M params, 40 language-locales, native punctuation and capitalization. INT8 palettized encoder, FP16 decoder + joint. Ships .mlmodelc (compiled) — .mlpackage is not shipped because on-device MLModel.compileModel() produces non-deterministic output across iOS simulator vs device runtimes.

Model


Parameters	600 M
Architecture	FastConformer-CacheAware-RNN-T with language-conditioning prompt kernel
Languages	40 (see frontmatter)
Sample rate	16 kHz mono
Streaming chunk	320 ms (`att_context_size = [56, 3]`)
Encoder quantization	INT8 palettized
Decoder / joint dtype	FP16
On-disk size	612 MB

Files

File	Size	Description
`encoder.mlmodelc/`	565 MB	24-layer cache-aware Conformer encoder + prompt kernel
`decoder.mlmodelc/`	29 MB	RNN-T prediction net (2-layer LSTM, 640 dim)
`joint.mlmodelc/`	18 MB	Joint network (vocab=13088)
`vocab.json`	230 KB	SentencePiece pieces, id → string
`languages.json`	2 KB	Language tag → prompt slot (e.g. `"en-US": 0`)
`config.json`	<1 KB	Streaming geometry + dims for the loader

Performance

M5 Pro (Apple Silicon), 50 samples per language from FLEURS test, streaming 320 ms chunks, compute units .all. Scoring uses Whisper EnglishTextNormalizer for English and BasicTextNormalizer(split_letters=True) for hi/ja (char-level), BasicTextNormalizer for de/fr/ar.

Accuracy

lang	WER %	CER %	Δ WER vs fp32 source
en_us	9.59	4.26	+0.26
de_de	10.41	5.37	+0.19
fr_fr	12.18	4.84	+1.05
ar_eg	13.37	3.80	+0.10
hi_in	4.42	3.61	−0.84
ja_jp	17.66	12.09	+0.69

Quantization is essentially lossless. For ja/hi the published WER is char-level (matches NVIDIA's CJK methodology); CER is the more interpretable number for those scripts.

Streaming throughput + memory

M5 Pro, 60 s long-form en_us audio, single thread, .all compute units:

metric	value
RTF (encode + decode)	0.068
p50 chunk latency	18.6 ms
p99 chunk latency	23.4 ms
RSS post-load	1046 MB
RSS peak (mid-stream)	1238 MB

Usage

Python (coremltools)

import coremltools as ct
import numpy as np
import json
from huggingface_hub import snapshot_download

bundle = snapshot_download("aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8")
encoder = ct.models.CompiledMLModel(f"{bundle}/encoder.mlmodelc")
decoder = ct.models.CompiledMLModel(f"{bundle}/decoder.mlmodelc")
joint   = ct.models.CompiledMLModel(f"{bundle}/joint.mlmodelc")

slots = json.load(open(f"{bundle}/languages.json"))["promptDictionary"]
lang_mask = np.zeros((1, 128), dtype=np.float32)
lang_mask[0, slots["en-US"]] = 1.0
# Feed 320 ms chunks; persist caches across calls (see inference reference below).

Swift (speech-swift SDK)

import NemotronStreamingASR

// Default model id is this repo; pass `modelId:` to pin
let model = try await NemotronStreamingASRModel.fromPretrained()

// Batch
let text = try model.transcribeAudio(audio, sampleRate: 16000, language: "en-US")

// Streaming (yields partials as audio is fed)
for await partial in model.transcribeStream(audio: audio, sampleRate: 16000, language: "ja-JP") {
    print(partial.text, partial.isFinal)
}

Language is a BCP-47 tag (en-US, de-DE, fr-FR, ja-JP, hi-IN, ...) — resolved via languages.json. The Swift wrapper persists the streaming KV/conv caches across pushAudio calls within a StreamingSession.

CLI (Homebrew)

brew install soniqo/tap/speech

# Single-shot transcription on a wav file (uses this CoreML INT8 bundle by default)
speech transcribe recording.wav --engine nemotron --language en-US

# Other languages
speech transcribe meeting.wav --engine nemotron --language de-DE
speech transcribe interview.wav --engine nemotron --language ja-JP

speech is a universal binary; the model is downloaded on first use and cached under ~/Library/Caches/qwen3-speech/.

Source

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b.

Model tree for aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Finetuned

(9)

this model

Collection including aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8

CoreML Speech Models

Collection

Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 24 items • Updated 1 day ago • 4

aufklarer
/

Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8