Qwen3.6-35B-A3B-MTP — ROCmFP4 STRIX (f16 embeddings)

Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-35B-A3B — a mixture-of-experts model (~3B active params) with the built-in MTP / next-token-prediction head — in the custom ROCmFP4 4-bit format. Tuned for high MTP draft acceptance and long-context, multi-turn use.

⚠️ Ignore HuggingFace's auto-detected quant badge ("F16" / 16-bit) — it's wrong. HF's parser only knows the standard GGUF quant types, so it can't read the custom ROCmFP4 format. It ends up "seeing" only the genuinely-f16 token embeddings and mislabels the whole file as 16-bit. These are ~4.4 bpw 4-bit ROCmFP4 files, not 16-bit. Pick a file by its name in the Files and versions tab (see the two-files table below).

Requires the ROCmFP4 fork (public) — not stock llama.cpp

This file uses the ROCmFP4 tensor types (q4_0_rocmfp4, q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public fork charlie12345/rocmfp4-llama:

git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Two files in this repo (pick your trade-off)

File size output head best for
…-STRIX-embF16.gguf 18.8 GB ROCmFP4 4-bit fastest — the original
…-STRIX-embF16-headQ6.gguf ~18.9 GB Q6_K a notch more faithful — small decode cost

The two are identical except one tensor — the output head (output.weight): same STRIX recipe, same f16 embeddings, same F32 MoE router, same MTP head. They differ only in the head.

The Q6-head variant — a step up (experimental)

This raises the output head — the layer that turns the final hidden state into the next-token choice — from the 4-bit ROCmFP4 format to standard Q6_K, leaving everything else untouched. It's the output-side complement to running f16 token embeddings (the input side) — sharpening both ends of the model instead of just one.

A note on evidence — read this honestly. The careful measurements for this change were done on the dense 27B sibling (card here). There, the Q6 head:

  • improved held-out perplexity on both code and prose (small but consistent), and
  • moved the model closer to the original BF16 on a KL-divergence check (more faithful word probabilities; it agrees with BF16's top choice ~96% of the time, so it mostly sharpens confidence on the same token rather than flipping it).

Subjectively, on the 27B, it followed instructions more consistently — reaching for the specific tool asked for, sticking to a task's format/rules. This 35B file applies the exact same one-tensor change, so I expect it to behave the same way — but I have not separately benchmarked the 35B. Treat the 35B quality delta as expected, not yet measured.

The cost. The Q6 head steps off the tuned 4-bit kernel for that one tensor. On the 27B that was 5–7% slower decode at short context (shrinking at long context). On this MoE the hit should be even smaller — the 35B's output head is smaller, and MoE decode is already fast (3B active params/token) — but, again, not separately measured here. Size grows ~0.14 GB.

Build it yourself — same as the original, with one extra flag (--output-tensor-type q6_K):

llama-quantize \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Part 1 — The model

What this is

  • Base: unsloth/Qwen3.6-35B-A3B-MTP-GGUF BF16, pinned at revision 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d. Arch qwen35moe: 41 blocks, 2048 hidden, 256 experts, with the nextn_predict_layers=1 MTP head (blk.40.nextn.*), so self-speculative draft-MTP survives quantization.
  • Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware.
  • This variant (STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; it's a lookup, so ~zero decode cost). No imatrix.
value
File Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf
Size / bpw 18.8 GB / 4.44 bpw
token_embd F16
MoE router (ffn_gate_inp) F32 (full precision, kept automatically)
experts (ffn_*_exps) q4_0_rocmfp4_fast (custom kernel)
attention K/V (+ fused QKV) q4_0_rocmfp4 (dual-scale)
MTP head preserved (blk.40.nextn.*)

The router stays F32 for free. The quantizer excludes expert-gating tensors from quantization, so routing (which experts each token goes to — a discrete, high-sensitivity decision) keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.

How it was built (reproducible)

llama-quantize \
  --token-embedding-type f16 \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

Status & caveats

Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.

Credits & license

  • Base model: Qwen3.6-35B-A3B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
  • BF16 GGUF source: unsloth/Qwen3.6-35B-A3B-MTP-GGUF @ 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d.
  • ROCmFP4 format & runtime: charlie12345/rocmfp4-llama (based on llama.cpp, MIT).

Part 2 — Making practical use of it

What I observed

Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):

  • It runs great on the ROCmFP4 fork78–90 t/s decode, MTP draft acceptance ~0.6–0.95 (content-dependent), coherent, and it loads at full 262144 context with ~92 GB free (the KV footprint is modest). MoE decode is naturally fast (only ~3B params active per token), and the F32 router keeps expert selection clean.
  • The companion 27B dense quant (same recipe) is at plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.

Run config (highest MTP acceptance on Strix Halo)

Full-precision (f16) KV is the dominant acceptance lever; 128 GB unified affords it (drop to -ctk q8_0 -ctv q8_0 on less memory).

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  --alias qwen3.6-35b-a3b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
  -dev Vulkan0 -ngl 999 -fa on \
  -c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
  -ctk f16 -ctv f16 \
  -cpent 256 -ctxcp 32 --cache-reuse 256 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --presence-penalty 0.0 --repeat-penalty 1.0 \
  --spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
  --spec-draft-type-k f16 --spec-draft-type-v f16 \
  --spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
  --reasoning on --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja --parallel 1 --metrics --no-mmap

-dev Vulkan0 (KHR_coopmat) beats ROCm here; -ub 256 is the prefill optimum; --spec-type draft-mtp uses the model's built-in MTP head. Temp 0.6 = Qwen3.6 "precise coding" (1.0 for general).

Multi-turn prompt-cache reuse (OpenCode)

Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:

  1. Checkpoints — default -cpent is 8192, so prompts under 8K never checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
  2. Thinking--reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' keeps <think> across turns with clean content+reasoning_content. (none = raw tags inline but works with any content-echoing client; deepseek-legacy/auto do not reuse.)
  3. Vision--mmproj disables cache reuse; keep it off for text/code.

--jinja is required for the chat template + preserve_thinking.

OpenAI-compatible client (e.g. OpenCode)

Point the client at the server. In single-model mode llama-server ignores the request's model field, so the client's model name is just a label.

  • Base URL: http://<host>:8080/v1 · API key: any non-empty string (e.g. sk-local)
  • Model id (what this server reports): qwen3.6-35b-a3b-rocmfp4-mtp

A patched OpenCode that compacts conversation history without invalidating the prompt cache is at PlunderStruck/opencode — pair it with the checkpoint flags to keep long sessions fast.

Downloads last month
1,563
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

Quantized
(5)
this model