Qwen3.5-9B MTPLX Optimized Speed

Fast local 9B inference for Apple Silicon, packaged for MTPLX native Multi-Token-Prediction speculative decoding.

This is the 9B speed checkpoint: a 6-bit MLX body with BF16 MTP heads, tuned as the stronger small-Mac option above the 4B release.

Run It

brew install youssofal/mtplx/mtplx
mtplx start
mtplx run "hello" --model Youssofal/Qwen3.5-9B-MTPLX-Optimized-Speed

For an OpenAI-compatible local server:

mtplx serve --model Youssofal/Qwen3.5-9B-MTPLX-Optimized-Speed --profile sustained --max --port 8000 --no-stats-footer

Why This Exists

MTPLX uses the model's own MTP heads to generate draft tokens, then verifies those tokens with the main model. When the draft heads are well-matched, you get higher throughput without running a separate drafter model.

MTPLX reads mtplx_runtime.json and selects the measured defaults automatically.

Recommended Runtime Defaults

Setting Value
Backend qwen3-next-mtp
Default depth D2
Target sampler temp=0.60, top_p=0.95, top_k=20
Draft sampler temp=0.60, top_p=0.95, top_k=20
Profile sustained
Benchmark fan mode max

Performance

Measured in MTPLX on Apple Silicon using the release runtime path.

Mode TPS Verify time Acceptance
AR baseline 64.96 - -
D1 comparison 92.87 6.83s 0.9120
D2 promoted default 101.32 4.13s 0.9398, 0.8102
D3 comparison 96.30 4.58s 0.9278, 0.7732, 0.6443

Model Build

Component Format
Main body 6-bit MLX affine, group size 64
MTP heads BF16 native MTP sidecar
Sampler target and draft temp=0.60, top_p=0.95, top_k=20

This is not a full-precision checkpoint. It is built for fast local use on Apple Silicon through MTPLX.

Files

  • model-*.safetensors: MLX body shards
  • mtp.safetensors: MTP sidecar
  • mtplx_runtime.json: MTPLX runtime contract and measured defaults
  • MTPLX_PUBLISH_MANIFEST.json: file sizes and artifact metadata
  • tokenizer and config files for local loading
Downloads last month
1,648
Safetensors
Model size
9B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support