Omnilingual ASR — CTC 300M (MLX 4-bit)

MLX-compatible 4-bit quantization of Meta's Omnilingual ASR CTC-300M model, targeting on-device inference on Apple Silicon (M1/M2/M3/M4).

Omnilingual ASR is a wav2vec 2.0–style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time (no language hint needed).

Model


Parameters	326 M
Format	MLX safetensors (quantized linear layers + fp16 features)
Quantization	4-bit per-group min-max, group size 64
Sample rate	16 kHz (raw waveform input)
Frame rate	50 fps (320× downsampling in CNN frontend)
Max duration	40 s
Languages	1,600+
Vocabulary	10,288 SentencePiece tokens

Files

File	Size	Description
`model.safetensors`	193 MB	4-bit quantized transformer weights + fp16 conv frontend
`tokenizer.model`	1.2 MB	SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0)
`config.json`	<1 KB	Architecture + quantization metadata

Architecture

Raw audio [1, samples]
  → Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320×)
  → Linear 512 → 1024
  → Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
  → 24 × StandardTransformerEncoderLayer (pre-norm, dim 1024, heads 16, ffn 4096)
  → LayerNorm
  → Linear 1024 → 10288   (CTC head)
  → logits [1, T/320, 10288]

CTC greedy decoding with duplicate collapsing over the argmax path.

Performance

FLEURS test set, CTC-300M fp32 on CPU (Apple M-series), 30 utterances/language, aggregate WER via exact-edit-distance scorer (no external text normalization):

Language	WER	Audio	Inference	RTF
English (en_us)	20.0%	289 s	16.3 s	0.056
French (fr_fr)	23.2%	334 s	19.5 s	0.059
German (de_de)	16.5%	361 s	20.8 s	0.058
Arabic (ar_eg)	19.5%	331 s	17.0 s	0.051
Hindi (hi_in)	22.5%	364 s	18.2 s	0.050

Aggregate CPU RTF ≈ 0.05; on M-series GPU via MLX, expect RTF < 0.02. (4-bit quantization typically adds <1% absolute WER on wav2vec2-class models; treat these as close upper bounds for the quantized variant.)

Usage

import mlx.core as mx
from mlx.utils import tree_unflatten
from safetensors import safe_open

weights = {}
with safe_open("model.safetensors", framework="mlx") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)

# Your MLX wav2vec2 + CTC implementation consumes these keys.
# Expected input : float32 audio [1, samples] at 16 kHz, zero-mean unit-var
# Expected output: logits [1, T, 10288] then CTC greedy decode via the
#                  tokenizer in tokenizer.model

Swift inference is provided by speech-swift (see Sources/OmnilingualASR/).

Source

Upstream model: facebook/omniASR-CTC-300M
Paper: Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
Meta blog: Omnilingual ASR announcement

License

Apache 2.0 (inherited from upstream).

Guide: soniqo.audio/guides/omnilingual
Docs: soniqo.audio
GitHub: soniqo/speech-swift

Downloads last month: 67

Safetensors

Model size

62.3M params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit

Base model

facebook/omniASR-CTC-300M

Finetuned

(3)

this model

Collection including aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 56 items • Updated 1 day ago • 5

Paper for aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Paper • 2511.09690 • Published Nov 12, 2025 • 1

aufklarer
/

Omnilingual-ASR-CTC-300M-MLX-4bit