MLX Speech Models
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. โข 56 items โข Updated โข 5
How to use aufklarer/Qwen3.5-0.8B-Chat-MLX with MLX:
# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm
# if on a CUDA device, also pip install mlx[cuda]
# Generate text with mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("aufklarer/Qwen3.5-0.8B-Chat-MLX")
prompt = "Once upon a time in"
text = generate(model, tokenizer, prompt=prompt, verbose=True)How to use aufklarer/Qwen3.5-0.8B-Chat-MLX with MLX LM:
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "aufklarer/Qwen3.5-0.8B-Chat-MLX" --prompt "Once upon a time"
Text-only extraction of Qwen3.5-0.8B quantized for on-device LLM chat on Apple Silicon via MLX.
Qwen3.5 is a hybrid model with 24 layers:
[linear, linear, linear, full] ร 6| Variant | Size | Path |
|---|---|---|
| INT4 | 404 MB | int4/model.safetensors |
| INT8 | 786 MB | int8/model.safetensors |
Each variant includes config.json, tokenizer.json, and tokenizer_config.json.
import Qwen3Chat
let model = try await Qwen35MLXChat.fromPretrained(quantization: .int4)
let response = try model.generate(
messages: [ChatMessage(role: .user, content: "Hello!")],
sampling: ChatSamplingConfig(temperature: 0.3, maxTokens: 100)
)
Part of the soniqo speech toolkit for Apple Silicon.
Quantized directly from Qwen/Qwen3.5-0.8B using mx.quantize() (group_size=64). Text model extracted, vision tower removed. Norm weights adjusted (+1). Conv1d transposed to MLX channels-last format.
python scripts/convert_qwen35_chat_mlx.py --output int4/ --bits 4
python scripts/convert_qwen35_chat_mlx.py --output int8/ --bits 8
Quantized