Qwen3.5-0.8B Chat — MLX (Apple Silicon)

Text-only extraction of Qwen3.5-0.8B quantized for on-device LLM chat on Apple Silicon via MLX.

Architecture

Qwen3.5 is a hybrid model with 24 layers:

18× DeltaNet — linear attention with gated delta rule recurrence, O(1) memory per step
6× GatedAttention — full scaled dot-product attention with KV cache, partial RoPE (25%)
Pattern: [linear, linear, linear, full] × 6
Tied word embeddings (lm_head = embed_tokens)

Variants

Variant	Size	Path
INT4	404 MB	`int4/model.safetensors`
INT8	786 MB	`int8/model.safetensors`

Each variant includes config.json, tokenizer.json, and tokenizer_config.json.

Usage

import Qwen3Chat

let model = try await Qwen35MLXChat.fromPretrained(quantization: .int4)
let response = try model.generate(
    messages: [ChatMessage(role: .user, content: "Hello!")],
    sampling: ChatSamplingConfig(temperature: 0.3, maxTokens: 100)
)

Part of the soniqo speech toolkit for Apple Silicon.

Conversion

Quantized directly from Qwen/Qwen3.5-0.8B using mx.quantize() (group_size=64). Text model extracted, vision tower removed. Norm weights adjusted (+1). Conv1d transposed to MLX channels-last format.

python scripts/convert_qwen35_chat_mlx.py --output int4/ --bits 4
python scripts/convert_qwen35_chat_mlx.py --output int8/ --bits 8

Guide: soniqo.audio/guides/chat
Docs: soniqo.audio
GitHub: soniqo/speech-swift

Downloads last month: 313

MLX

Hardware compatibility

Quantized

Model tree for aufklarer/Qwen3.5-0.8B-Chat-MLX

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(224)

this model

Collection including aufklarer/Qwen3.5-0.8B-Chat-MLX

MLX Speech Models

Collection

Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. • 56 items • Updated 1 day ago • 5