Nemotron-3.5-ASR-Streaming-Multilingual 0.6B β€” OpenVINO IR

OpenVINO IR export of NVIDIA's nvidia/nemotron-3.5-asr-streaming-0.6b, a cache-aware streaming FastConformer-RNNT with prompt-conditioned multilingual decoding (40+ languages). This is the OpenVINO sibling of FluidInference/Nemotron-3.5-ASR-Streaming-Multilingual-0.6b-CoreML, intended for the eddy-audio runtime (OpenVINO, CPU / Intel NPU / iGPU).

Contents

fp32/   FP32 IR        (~2.5 GB)  β€” graph the WER/CER table below was measured on
fp16/   FP16 IR        (~1.3 GB)  β€” identical transcripts, half the size, NPU-friendly
int8/   INT8-weight IR (~749 MB)  β€” weight-only encoder (per-channel sym); WER matches FP32

Each folder holds 4 IR pairs plus the tokenizer + metadata:

file role
nemotron_preprocessor.xml/.bin 128-bin log-mel front end (16 kHz)
nemotron_encoder.xml/.bin 24-layer cache-aware streaming FastConformer (d_model=1024) + language prompt_kernel
nemotron_decoder.xml/.bin RNNT prediction network (2Γ— LSTM @ 640)
nemotron_joint.xml/.bin joint network β†’ vocab 13088 (blank=13087)
nemotron_vocab.json SentencePiece pieces (▁ word-boundary marker)
metadata.json shapes, cache dims, prompt_dictionary (lang β†’ prompt_id), lang-tag token ids

Streaming / cache-aware I/O

  • Chunk: chunk_mel_frames=112 (+pre_encode_cache=9 β†’ total_mel_frames=121), att_context_size=[56,0].
  • Encoder cache carried across chunks: cache_channel [1,24,56,1024], cache_time [1,24,1024,8], cache_len [1].
  • Language is selected per chunk via an integer prompt_id (see prompt_dictionary in metadata.json; auto = 101).
  • The 39 language-tag token ids in metadata.json["lang_tag_token_ids"] are stripped from the decoded text.

Accuracy (FLEURS, full test splits, FP32, greedy, forced language)

Measured on the OpenVINO IR (CPU). Reference column is the FluidAudio CoreML build of the same base model.

Lang Metric OpenVINO FluidAudio ref Ξ” n
en_us WER 11.78 12.09 βˆ’0.31 647
es_419 WER 6.99 9.01 βˆ’2.02 908
fr_fr WER 12.92 15.18 βˆ’2.26 676
cmn_hans_cn CER 21.05 24.54 βˆ’3.49 945
ja_jp CER 15.12 16.86 βˆ’1.74 650
weighted mixed 13.70 15.79 βˆ’2.09 3826

The export matches or beats the reference on every language β€” no conversion regression. FP16 produces transcripts identical to FP32.

Throughput: RTFx β‰ˆ 3.66Γ— audio-weighted on a single CPU (Intel Xeon E5-2699 v4, 4 vCPU); higher on Intel NPU / iGPU.

INT8 weight-only (int8/)

Weight-only INT8 compression of the encoder (per-channel symmetric, data-free β€” mirrors the CoreML linear_quantize_weights build). Only the weight constants become int8 with fp16 scales; activations stay fp16. The 24 Conformer relative-position projections (self_attn.linear_pos) are kept FP16 β€” int8-compressing them trips an OpenVINO CPU compile bug and they carry little weight mass. Decoder / joint / preprocessor stay FP16.

Measured on the same FLEURS full test splits, forced language, greedy β€” directly comparable to the FP32 table above:

Lang Metric FP32 INT8-wo Ξ” n
en_us WER 11.78 11.93 +0.15 647
es_419 WER 6.99 7.05 +0.06 908
fr_fr WER 12.92 12.99 +0.07 676
cmn_hans_cn CER 21.05 20.94 βˆ’0.11 945
ja_jp CER 15.12 15.15 +0.03 650
weighted mixed 13.70 13.73 +0.03 3826

Essentially lossless β€” the largest swing is +0.15 WER (within run-to-run noise) and Chinese CER improves slightly.

What INT8 buys you β€” and what it doesn't:

FP32 FP16 INT8-wo
encoder .bin 2.5 GB 1.25 GB 732 MB
folder size ~2.5 GB ~1.3 GB ~749 MB
peak RSS (CPU) 5.8 GB 3.9 GB 2.1 GB
RTFx (audio-wtd, 4 vCPU) 3.66Γ— β‰ˆ3.66Γ— 2.96Γ—

INT8 is a memory / disk win (~Β½ the RAM of FP16), not a speed win: on x86 CPU the int8 weights decompress to float per-op, so it runs slower than FP16. The speed payoff is on Intel NPU, where native int8 execution applies. Choose int8/ for memory-constrained or NPU deployments; choose fp16/ for fastest CPU inference.

License

Inherits openmdw-1.1 from the base model. See the base model card for terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FluidInference/Nemotron-3.5-ASR-Streaming-Multilingual-0.6b-ov

Quantized
(29)
this model