Qwen3.5-0.8B Chat โ€” MLX (Apple Silicon)

Text-only extraction of Qwen3.5-0.8B quantized for on-device LLM chat on Apple Silicon via MLX.

Architecture

Qwen3.5 is a hybrid model with 24 layers:

  • 18ร— DeltaNet โ€” linear attention with gated delta rule recurrence, O(1) memory per step
  • 6ร— GatedAttention โ€” full scaled dot-product attention with KV cache, partial RoPE (25%)
  • Pattern: [linear, linear, linear, full] ร— 6
  • Tied word embeddings (lm_head = embed_tokens)

Variants

Variant Size Path
INT4 404 MB int4/model.safetensors
INT8 786 MB int8/model.safetensors

Each variant includes config.json, tokenizer.json, and tokenizer_config.json.

Usage

import Qwen3Chat

let model = try await Qwen35MLXChat.fromPretrained(quantization: .int4)
let response = try model.generate(
    messages: [ChatMessage(role: .user, content: "Hello!")],
    sampling: ChatSamplingConfig(temperature: 0.3, maxTokens: 100)
)

Part of the soniqo speech toolkit for Apple Silicon.

Conversion

Quantized directly from Qwen/Qwen3.5-0.8B using mx.quantize() (group_size=64). Text model extracted, vision tower removed. Norm weights adjusted (+1). Conv1d transposed to MLX channels-last format.

python scripts/convert_qwen35_chat_mlx.py --output int4/ --bits 4
python scripts/convert_qwen35_chat_mlx.py --output int8/ --bits 8

Downloads last month
313
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aufklarer/Qwen3.5-0.8B-Chat-MLX

Finetuned
(224)
this model

Collection including aufklarer/Qwen3.5-0.8B-Chat-MLX