Shenava — Rizeh v1.0 (32M) · cache-aware streaming · native-Rust (tract)

Cache-aware streaming CTC export of Shenava-Rizeh-v1.0 that runs in the pure-Rust tract engine — no C++, no ONNX Runtime. Part of VisualEars / Shenava: offline, on-device, streaming Persian ASR for the Deaf/Hard-of-Hearing.

Quality: near-exact (12.11% golden-6669 WER). RTF ≈ 0.027 (30.0 ms/chunk on x86 CPU; chunk = 1.12 s audio).

⚠️ Requires patched tract (until upstreamed)

Stock tract rejects NeMo cache-aware streaming graphs in two inference-layer spots. Fix = a 23-line, 2-file patch (shenava_tract_streaming.patch, included) — PR open at sonos/tract#2441. Build tract with the patch, then load model.onnx normally. The graph itself is valid (identical decode to ONNX Runtime).

Streaming contract

Per-step inputs / outputs (fixed shapes, greedy CTC):

  • audio_signal [1,80,121] — un-normalized log-mel chunk (NeMo featurizer, normalize=NA)
  • length [1] i64 — true valid frames in the chunk
  • cache_last_channel [1,16,70,256], cache_last_time [1,16,256,8], cache_last_channel_len [1] i64 — start zeros / 0
  • logprobs [1,T',1025] + next caches

Chunking: feed 121-mel-frame chunks, shift 112 (9-frame pre-encode overlap). First chunk is 105 → pad to 121; pad the tail too; pass the true length. Thread the *_next caches back each step (cast cache_last_channel_len_next to i64). Greedy CTC: carry the previous token across chunk boundaries when collapsing repeats; blank id = 1024; map via tokens.txt; →space.

Numbers are spoken-form → ITN

The model spells numbers (هشت not ۸). Apply persian_itn.py at display for spoken→Persian-digit (cardinals + هزار/میلیون/میلیارد + «و» + compounds).

Shenava-1 family (all native-Rust streaming)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reza2kn/Shenava-Rizeh-v1.0-tract-streaming