Instructions to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF",
	filename="Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Use Docker

docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

LM Studio
Jan
Ollama
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
```

Unsloth Studio

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
```

Lemonade

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF-BF16

List all available models

lemonade list

Qwen3.6-35B-A3B-MTP — ROCmFP4 STRIX (f16 embeddings)

Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-35B-A3B — a mixture-of-experts model (~3B active params) with the built-in MTP / next-token-prediction head — in the custom ROCmFP4 4-bit format. Tuned for high MTP draft acceptance and long-context, multi-turn use.

⚠️ Ignore HuggingFace's auto-detected quant badge ("F16" / 16-bit) — it's wrong. HF's parser only knows the standard GGUF quant types, so it can't read the custom ROCmFP4 format. It ends up "seeing" only the genuinely-f16 token embeddings and mislabels the whole file as 16-bit. These are ~4.4 bpw 4-bit ROCmFP4 files, not 16-bit. Pick a file by its name in the Files and versions tab (see the two-files table below).

Requires the ROCmFP4 fork (public) — not stock llama.cpp

This file uses the ROCmFP4 tensor types (q4_0_rocmfp4, q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public fork charlie12345/rocmfp4-llama:
git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Two files in this repo (pick your trade-off)

File	size	output head	best for
`…-STRIX-embF16.gguf`	18.8 GB	ROCmFP4 4-bit	fastest — the original
`…-STRIX-embF16-headQ6.gguf`	~18.9 GB	Q6_K	a notch more faithful — small decode cost

The two are identical except one tensor — the output head (output.weight): same STRIX recipe, same f16 embeddings, same F32 MoE router, same MTP head. They differ only in the head.

The Q6-head variant — a step up (experimental)

This raises the output head — the layer that turns the final hidden state into the next-token choice — from the 4-bit ROCmFP4 format to standard Q6_K, leaving everything else untouched. It's the output-side complement to running f16 token embeddings (the input side) — sharpening both ends of the model instead of just one.

A note on evidence — read this honestly. The careful measurements for this change were done on the dense 27B sibling (card here). There, the Q6 head:

improved held-out perplexity on both code and prose (small but consistent), and
moved the model closer to the original BF16 on a KL-divergence check (more faithful word probabilities; it agrees with BF16's top choice ~96% of the time, so it mostly sharpens confidence on the same token rather than flipping it).

Subjectively, on the 27B, it followed instructions more consistently — reaching for the specific tool asked for, sticking to a task's format/rules. This 35B file applies the exact same one-tensor change, so I expect it to behave the same way — but I have not separately benchmarked the 35B. Treat the 35B quality delta as expected, not yet measured.

The cost. The Q6 head steps off the tuned 4-bit kernel for that one tensor. On the 27B that was ~~5–7% slower decode at short context (shrinking at long context). On this MoE the hit should be even smaller — the 35B's output head is smaller, and MoE decode is already fast (~~3B active params/token) — but, again, not separately measured here. Size grows ~0.14 GB.

Build it yourself — same as the original, with one extra flag (--output-tensor-type q6_K):

llama-quantize \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Part 1 — The model

What this is

Base: unsloth/Qwen3.6-35B-A3B-MTP-GGUF BF16, pinned at revision 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d. Arch qwen35moe: 41 blocks, 2048 hidden, 256 experts, with the nextn_predict_layers=1 MTP head (blk.40.nextn.*), so self-speculative draft-MTP survives quantization.
Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware.
This variant (STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; it's a lookup, so ~zero decode cost). No imatrix.

	value
File	`Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf`
Size / bpw	18.8 GB / 4.44 bpw
token_embd	F16
MoE router (`ffn_gate_inp`)	F32 (full precision, kept automatically)
experts (`ffn_*_exps`)	`q4_0_rocmfp4_fast` (custom kernel)
attention K/V (+ fused QKV)	`q4_0_rocmfp4` (dual-scale)
MTP head	preserved (`blk.40.nextn.*`)

The router stays F32 for free. The quantizer excludes expert-gating tensors from quantization, so routing (which experts each token goes to — a discrete, high-sensitivity decision) keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.

How it was built (reproducible)

llama-quantize \
  --token-embedding-type f16 \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

Status & caveats

Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.

Credits & license

Base model: Qwen3.6-35B-A3B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
BF16 GGUF source: unsloth/Qwen3.6-35B-A3B-MTP-GGUF @ 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d.
ROCmFP4 format & runtime: charlie12345/rocmfp4-llama (based on llama.cpp, MIT).

Part 2 — Making practical use of it

What I observed

Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):

It runs great on the ROCmFP4 fork — 78–90 t/s decode, MTP draft acceptance ~0.6–0.95 (content-dependent), coherent, and it loads at full 262144 context with ~92 GB free (the KV footprint is modest). MoE decode is naturally fast (only ~3B params active per token), and the F32 router keeps expert selection clean.
The companion 27B dense quant (same recipe) is at plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.

Run config (highest MTP acceptance on Strix Halo)

Full-precision (f16) KV is the dominant acceptance lever; 128 GB unified affords it (drop to -ctk q8_0 -ctv q8_0 on less memory).

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  --alias qwen3.6-35b-a3b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
  -dev Vulkan0 -ngl 999 -fa on \
  -c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
  -ctk f16 -ctv f16 \
  -cpent 256 -ctxcp 32 --cache-reuse 256 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --presence-penalty 0.0 --repeat-penalty 1.0 \
  --spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
  --spec-draft-type-k f16 --spec-draft-type-v f16 \
  --spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
  --reasoning on --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja --parallel 1 --metrics --no-mmap

-dev Vulkan0 (KHR_coopmat) beats ROCm here; -ub 256 is the prefill optimum; --spec-type draft-mtp uses the model's built-in MTP head. Temp 0.6 = Qwen3.6 "precise coding" (1.0 for general).

Multi-turn prompt-cache reuse (OpenCode)

Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:

Checkpoints — default -cpent is 8192, so prompts under 8K never checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
Thinking — --reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' keeps <think> across turns with clean content+reasoning_content. (none = raw tags inline but works with any content-echoing client; deepseek-legacy/auto do not reuse.)
Vision — --mmproj disables cache reuse; keep it off for text/code.

--jinja is required for the chat template + preserve_thinking.

OpenAI-compatible client (e.g. OpenCode)

Point the client at the server. In single-model mode llama-server ignores the request's model field, so the client's model name is just a label.

Base URL: http://<host>:8080/v1 · API key: any non-empty string (e.g. sk-local)
Model id (what this server reports): qwen3.6-35b-a3b-rocmfp4-mtp

A patched OpenCode that compacts conversation history without invalidating the prompt cache is at PlunderStruck/opencode — pair it with the checkpoint flags to keep long sessions fast.

Downloads last month: 1,563

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Quantized

(5)

this model