Shenava — Rizeh v1.0 (32M) · cache-aware streaming · native-Rust (tract)
Cache-aware streaming CTC export of Shenava-Rizeh-v1.0
that runs in the pure-Rust tract engine — no C++, no ONNX Runtime.
Part of VisualEars / Shenava: offline, on-device, streaming Persian ASR for the Deaf/Hard-of-Hearing.
Quality: near-exact (12.11% golden-6669 WER). RTF ≈ 0.027 (30.0 ms/chunk on x86 CPU; chunk = 1.12 s audio).
⚠️ Requires patched tract (until upstreamed)
Stock tract rejects NeMo cache-aware streaming graphs in two inference-layer spots. Fix = a 23-line, 2-file patch
(shenava_tract_streaming.patch, included) — PR open at sonos/tract#2441.
Build tract with the patch, then load model.onnx normally. The graph itself is valid (identical decode to ONNX Runtime).
Streaming contract
Per-step inputs / outputs (fixed shapes, greedy CTC):
audio_signal[1,80,121]— un-normalized log-mel chunk (NeMo featurizer,normalize=NA)length[1]i64 — true valid frames in the chunkcache_last_channel[1,16,70,256],cache_last_time[1,16,256,8],cache_last_channel_len[1]i64 — start zeros / 0- →
logprobs[1,T',1025]+ next caches
Chunking: feed 121-mel-frame chunks, shift 112 (9-frame pre-encode overlap). First chunk is 105 → pad to 121; pad the tail too; pass the true length. Thread the *_next caches back each step (cast cache_last_channel_len_next to i64). Greedy CTC: carry the previous token across chunk boundaries when collapsing repeats; blank id = 1024; map via tokens.txt; ▁→space.
Numbers are spoken-form → ITN
The model spells numbers (هشت not ۸). Apply persian_itn.py at display for spoken→Persian-digit (cardinals + هزار/میلیون/میلیارد + «و» + compounds).
Shenava-1 family (all native-Rust streaming)
- Koochik 114M — flagship
- Rizeh 32M — mid
- Rizeh-Pizeh 6.9M — tiniest
Model tree for Reza2kn/Shenava-Rizeh-v1.0-tract-streaming
Base model
nvidia/stt_fa_fastconformer_hybrid_large