Instructions to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF", filename="Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Use Docker
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- LM Studio
- Jan
- Ollama
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Ollama:
ollama run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- Unsloth Studio
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
- Pi
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Docker Model Runner:
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- Lemonade
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF-BF16
List all available models
lemonade list
Qwen3.6-35B-A3B-MTP — ROCmFP4 STRIX (f16 embeddings)
Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-35B-A3B — a mixture-of-experts model (~3B active params) with the built-in MTP / next-token-prediction head — in the custom ROCmFP4 4-bit format. Tuned for high MTP draft acceptance and long-context, multi-turn use.
⚠️ Ignore HuggingFace's auto-detected quant badge ("F16" / 16-bit) — it's wrong. HF's parser only knows the standard GGUF quant types, so it can't read the custom ROCmFP4 format. It ends up "seeing" only the genuinely-f16 token embeddings and mislabels the whole file as 16-bit. These are ~4.4 bpw 4-bit ROCmFP4 files, not 16-bit. Pick a file by its name in the Files and versions tab (see the two-files table below).
Requires the ROCmFP4 fork (public) — not stock llama.cpp
This file uses the ROCmFP4 tensor types (
q4_0_rocmfp4,q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public forkcharlie12345/rocmfp4-llama:git clone https://github.com/charlie12345/rocmfp4-llama cd rocmfp4-llama && git checkout mtp-rocmfp4-strix env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh
Two files in this repo (pick your trade-off)
| File | size | output head | best for |
|---|---|---|---|
…-STRIX-embF16.gguf |
18.8 GB | ROCmFP4 4-bit | fastest — the original |
…-STRIX-embF16-headQ6.gguf |
~18.9 GB | Q6_K | a notch more faithful — small decode cost |
The two are identical except one tensor — the output head (output.weight): same STRIX
recipe, same f16 embeddings, same F32 MoE router, same MTP head. They differ only in the head.
The Q6-head variant — a step up (experimental)
This raises the output head — the layer that turns the final hidden state into the next-token choice — from the 4-bit ROCmFP4 format to standard Q6_K, leaving everything else untouched. It's the output-side complement to running f16 token embeddings (the input side) — sharpening both ends of the model instead of just one.
A note on evidence — read this honestly. The careful measurements for this change were done on the dense 27B sibling (card here). There, the Q6 head:
- improved held-out perplexity on both code and prose (small but consistent), and
- moved the model closer to the original BF16 on a KL-divergence check (more faithful word probabilities; it agrees with BF16's top choice ~96% of the time, so it mostly sharpens confidence on the same token rather than flipping it).
Subjectively, on the 27B, it followed instructions more consistently — reaching for the specific tool asked for, sticking to a task's format/rules. This 35B file applies the exact same one-tensor change, so I expect it to behave the same way — but I have not separately benchmarked the 35B. Treat the 35B quality delta as expected, not yet measured.
The cost. The Q6 head steps off the tuned 4-bit kernel for that one tensor. On the 27B that was
5–7% slower decode at short context (shrinking at long context). On this MoE the hit should be even
smaller — the 35B's output head is smaller, and MoE decode is already fast (3B active params/token)
— but, again, not separately measured here. Size grows ~0.14 GB.
Build it yourself — same as the original, with one extra flag (--output-tensor-type q6_K):
llama-quantize \
--token-embedding-type f16 \
--output-tensor-type q6_K \
Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
Q4_0_ROCMFP4_STRIX
Part 1 — The model
What this is
- Base:
unsloth/Qwen3.6-35B-A3B-MTP-GGUFBF16, pinned at revision5bc3e238d916f48a861bac2f8a1990a0e9b7e98d. Archqwen35moe: 41 blocks, 2048 hidden, 256 experts, with thenextn_predict_layers=1MTP head (blk.40.nextn.*), so self-speculative draft-MTP survives quantization. - Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware.
- This variant (
STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; it's a lookup, so ~zero decode cost). No imatrix.
| value | |
|---|---|
| File | Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf |
| Size / bpw | 18.8 GB / 4.44 bpw |
| token_embd | F16 |
MoE router (ffn_gate_inp) |
F32 (full precision, kept automatically) |
experts (ffn_*_exps) |
q4_0_rocmfp4_fast (custom kernel) |
| attention K/V (+ fused QKV) | q4_0_rocmfp4 (dual-scale) |
| MTP head | preserved (blk.40.nextn.*) |
The router stays F32 for free. The quantizer excludes expert-gating tensors from quantization, so routing (which experts each token goes to — a discrete, high-sensitivity decision) keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.
How it was built (reproducible)
llama-quantize \
--token-embedding-type f16 \
Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
Q4_0_ROCMFP4_STRIX
Status & caveats
Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.
Credits & license
- Base model: Qwen3.6-35B-A3B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
- BF16 GGUF source:
unsloth/Qwen3.6-35B-A3B-MTP-GGUF@5bc3e238d916f48a861bac2f8a1990a0e9b7e98d. - ROCmFP4 format & runtime:
charlie12345/rocmfp4-llama(based on llama.cpp, MIT).
Part 2 — Making practical use of it
What I observed
Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):
- It runs great on the ROCmFP4 fork — 78–90 t/s decode, MTP draft acceptance ~0.6–0.95 (content-dependent), coherent, and it loads at full 262144 context with ~92 GB free (the KV footprint is modest). MoE decode is naturally fast (only ~3B params active per token), and the F32 router keeps expert selection clean.
- The companion 27B dense quant (same recipe) is at
plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.
Run config (highest MTP acceptance on Strix Halo)
Full-precision (f16) KV is the dominant acceptance lever; 128 GB unified affords it (drop to
-ctk q8_0 -ctv q8_0 on less memory).
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
--alias qwen3.6-35b-a3b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
-dev Vulkan0 -ngl 999 -fa on \
-c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
-ctk f16 -ctv f16 \
-cpent 256 -ctxcp 32 --cache-reuse 256 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--presence-penalty 0.0 --repeat-penalty 1.0 \
--spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
--spec-draft-type-k f16 --spec-draft-type-v f16 \
--spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
--reasoning on --reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinja --parallel 1 --metrics --no-mmap
-dev Vulkan0 (KHR_coopmat) beats ROCm here; -ub 256 is the prefill optimum; --spec-type draft-mtp uses the model's built-in MTP head. Temp 0.6 = Qwen3.6 "precise coding" (1.0 for general).
Multi-turn prompt-cache reuse (OpenCode)
Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:
- Checkpoints — default
-cpentis 8192, so prompts under 8K never checkpoint. Fix:-cpent 256 -ctxcp 32 --cache-reuse 256. - Thinking —
--reasoning-format deepseek+--chat-template-kwargs '{"preserve_thinking": true}'keeps<think>across turns with cleancontent+reasoning_content. (none= raw tags inline but works with any content-echoing client;deepseek-legacy/autodo not reuse.) - Vision —
--mmprojdisables cache reuse; keep it off for text/code.
--jinja is required for the chat template + preserve_thinking.
OpenAI-compatible client (e.g. OpenCode)
Point the client at the server. In single-model mode llama-server ignores the request's
model field, so the client's model name is just a label.
- Base URL:
http://<host>:8080/v1· API key: any non-empty string (e.g.sk-local) - Model id (what this server reports):
qwen3.6-35b-a3b-rocmfp4-mtp
A patched OpenCode that compacts conversation history without invalidating the prompt cache is
at PlunderStruck/opencode — pair it with the
checkpoint flags to keep long sessions fast.
- Downloads last month
- 1,563
16-bit