How to use from
Unsloth Studio
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Quick Links
PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-35B-A3B-MTP
4-BIT ROCmFP4 · MIXTURE-OF-EXPERTS (A3B) · MTP SELF-SPECULATIVE DECODE · SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
4.44 BPW
SIZE
18.8 GB
CONTEXT
262 K
ARCH
MoE · 256 EXPERTS
ACTIVE / HIDDEN
~3B · 2048
DRAFT
MTP n-max 5
BACKEND
VULKAN0
⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser can't read ROCmFP4 and mislabels by the genuinely-f16 token embeddings. These are ~4.44 bpw 4-bit files; pick by filename.
01 · FILES
File Size Output head Pick if
…-STRIX-embF16.gguf18.8 GBROCmFP4 4-bitfastest decode (the original)
…-STRIX-embF16-headQ6.gguf~18.9 GBQ6_Ka notch more faithful (small decode cost)

The two are identical except one tensor — the output head (output.weight): same STRIX recipe, same f16 embeddings, same F32 MoE router, same MTP head. They differ only in the head. Both are built without imatrix (see §03).

NOTE // The Q6-head variant raises the output head — the layer that turns the final hidden state into the next-token choice — from 4-bit ROCmFP4 to standard Q6_K, leaving everything else untouched. It's the output-side complement to running f16 token embeddings, sharpening both ends of the model. The Q6-head trade-off is not separately benchmarked on this 35B.
02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  --alias qwen35b-a3b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk f16 \
  -ctv f16 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}'
Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan (KHR_coopmat) — beats ROCm here for ROCmFP4 on Strix Halo
-ngl 999 · -fa onoffload all layers · flash attention
-c 262144context length (256K) — loads with ~92 GB free; KV footprint is modest
-b 2048 · -ub 256 · -t/-tb 16prefill batch / micro-batch (256 = prefill optimum) · CPU threads
-ctk f16 · -ctv f16f16 KV cache — how we run it; drop to q8_0/q4_0 to use less memory
-cpent · -ctxcp · --cache-reuse · --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0Qwen3.6 "precise coding" sampling (1.0 for general)
--spec-type draft-mtp · --spec-draft-n-max 5built-in MTP head, self-speculative; draft depth 5
--spec-draft-device Vulkan0 · -ngl all · type-k/v f16draft head on Vulkan, fully offloaded, f16 KV
--chat-template-file chat_template.jinjafroggeric unified Qwen3.6 template (tool calls + think-toggle)
--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}keep <think> across turns with clean content+reasoning_content, so cross-turn cache survives
--jinja --parallel 1 --metrics --no-mmapapply template · single slot · metrics · weights in RAM

Multi-turn prompt-cache reuse (OpenCode). Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:

  1. Checkpoints — default -cpent is 8192, so prompts under 8K never checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
  2. Thinking--reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' keeps <think> across turns with clean content+reasoning_content. (none = raw tags inline but works with any content-echoing client; deepseek-legacy/auto do not reuse.)

--jinja is required for the chat template + preserve_thinking.

OpenAI-compatible client (e.g. OpenCode). In single-model mode llama-server ignores the request's model field, so the client's model name is just a label.

  • Base URL: http://<host>:8080/v1 · API key: any non-empty string (e.g. sk-local)
  • Model id this server reports: qwen35b-a3b-mtp

A patched OpenCode that compacts conversation history without invalidating the prompt cache is at PlunderStruck/opencode — pair it with the checkpoint flags to keep long sessions fast.

03 · PERFORMANCE & QUALITY

Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):

DECODE78–90 t/s (Vulkan / Strix Halo)
MTP DRAFT ACCEPTANCE~0.6–0.95 (content-dependent)
CONTEXT @ LOADfull 262144 with ~92 GB free
QUANTIZATIONnon-imatrix · F32 MoE router

MoE decode is naturally fast — only ~3B params active per token — and the F32 router keeps expert selection clean. The router stays F32 for free: the quantizer excludes expert-gating tensors (ffn_gate_inp) from quantization, so routing — which experts each token goes to, a discrete, high-sensitivity decision — keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.

NOTE // f16 KV is how we run it (128 GB unified affords it; drop to q8_0/q4_0 to save memory). The Q6-head variant is not separately benchmarked on this 35B; the file is ~0.14 GB larger.

The companion 27B dense quant (same recipe) is at plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.

04 · BUILD (REPRODUCIBLE)

Build the fork:

git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Quantize from the unsloth BF16+MTP GGUF — ROCmFP4 body, genuine f16 embeddings, no imatrix:

# original (ROCmFP4 4-bit output head)
llama-quantize \
  --token-embedding-type f16 \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

# headQ6 variant adds the Q6_K output head (one extra flag)
llama-quantize \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Architecture (qwen35moe): 41 blocks, 2048 hidden, 256 experts, with the nextn_predict_layers=1 MTP head (blk.40.nextn.*) — so self-speculative draft-MTP survives quantization. Format: ROCmFP4 is a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block; tensor-aware. This variant (STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; a lookup, so ~zero decode cost). Experts (ffn_*_exps) run q4_0_rocmfp4_fast; attention K/V (+ fused QKV) run q4_0_rocmfp4 (dual-scale).

Experimental research build for AMD Strix Halo — hardware-, driver-, model-, and prompt-sensitive, may not reproduce on other GPUs. Not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims. Base BF16 GGUF pinned at revision 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d.

05 · LINEAGE & CREDITS
BASE MODELQwen3.6-35B-A3B (Qwen team) — derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution / use
BF16 GGUF SOURCEunsloth/Qwen3.6-35B-A3B-MTP-GGUF @ 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d
FORMAT + RUNTIMEcharlie12345/rocmfp4-llama (based on llama.cpp, MIT)
CHAT TEMPLATEfroggeric/Qwen-Fixed-Chat-Templates

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month
1,563
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

Quantized
(5)
this model

Collection including plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF