Instructions to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF",
	filename="Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Use Docker

docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

LM Studio
Jan
Ollama
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
```

Unsloth Studio

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
```

Lemonade

How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF-BF16

List all available models

lemonade list

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-35B-A3B-MTP
4-BIT ROCmFP4 · MIXTURE-OF-EXPERTS (A3B) · MTP SELF-SPECULATIVE DECODE · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
4.44 BPW

      SIZE
18.8 GB

      CONTEXT
262 K

    

      ARCH
MoE · 256 EXPERTS

      ACTIVE / HIDDEN
~3B · 2048

      DRAFT
MTP n-max 5

      BACKEND
VULKAN0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser can't read ROCmFP4 and mislabels by the genuinely-f16 token embeddings. These are ~4.44 bpw 4-bit files; pick by filename.

01 · FILES

File	Size	Output head	Pick if
`…-STRIX-embF16.gguf`	18.8 GB	ROCmFP4 4-bit	fastest decode (the original)
`…-STRIX-embF16-headQ6.gguf`	~18.9 GB	Q6_K	a notch more faithful (small decode cost)

The two are identical except one tensor — the output head (output.weight): same STRIX recipe, same f16 embeddings, same F32 MoE router, same MTP head. They differ only in the head. Both are built without imatrix (see §03).

NOTE // The Q6-head variant raises the output head — the layer that turns the final hidden state into the next-token choice — from 4-bit ROCmFP4 to standard Q6_K, leaving everything else untouched. It's the output-side complement to running f16 token embeddings, sharpening both ends of the model. The Q6-head trade-off is not separately benchmarked on this 35B.

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  --alias qwen35b-a3b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk f16 \
  -ctv f16 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}'

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan (KHR_coopmat) — beats ROCm here for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K) — loads with ~92 GB free; KV footprint is modest
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch (256 = prefill optimum) · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it; drop to `q8_0`/`q4_0` to use less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	Qwen3.6 "precise coding" sampling (1.0 for general)
`--spec-type draft-mtp · --spec-draft-n-max 5`	built-in MTP head, self-speculative; draft depth 5
`--spec-draft-device Vulkan0 · -ngl all · type-k/v f16`	draft head on Vulkan, fully offloaded, f16 KV
`--chat-template-file chat_template.jinja`	froggeric unified Qwen3.6 template (tool calls + think-toggle)
`--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}`	keep `<think>` across turns with clean `content`+`reasoning_content`, so cross-turn cache survives
`--jinja --parallel 1 --metrics --no-mmap`	apply template · single slot · metrics · weights in RAM

Multi-turn prompt-cache reuse (OpenCode). Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:

Checkpoints — default -cpent is 8192, so prompts under 8K never checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
Thinking — --reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' keeps <think> across turns with clean content+reasoning_content. (none = raw tags inline but works with any content-echoing client; deepseek-legacy/auto do not reuse.)

--jinja is required for the chat template + preserve_thinking.

OpenAI-compatible client (e.g. OpenCode). In single-model mode llama-server ignores the request's model field, so the client's model name is just a label.

Base URL: http://<host>:8080/v1 · API key: any non-empty string (e.g. sk-local)
Model id this server reports: qwen35b-a3b-mtp

A patched OpenCode that compacts conversation history without invalidating the prompt cache is at PlunderStruck/opencode — pair it with the checkpoint flags to keep long sessions fast.

03 · PERFORMANCE & QUALITY

Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):

DECODE	78–90 t/s (Vulkan / Strix Halo)
MTP DRAFT ACCEPTANCE	~0.6–0.95 (content-dependent)
CONTEXT @ LOAD	full 262144 with ~92 GB free
QUANTIZATION	non-imatrix · F32 MoE router

MoE decode is naturally fast — only ~3B params active per token — and the F32 router keeps expert selection clean. The router stays F32 for free: the quantizer excludes expert-gating tensors (ffn_gate_inp) from quantization, so routing — which experts each token goes to, a discrete, high-sensitivity decision — keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.

NOTE // f16 KV is how we run it (128 GB unified affords it; drop to q8_0/q4_0 to save memory). The Q6-head variant is not separately benchmarked on this 35B; the file is ~0.14 GB larger.

The companion 27B dense quant (same recipe) is at plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.

04 · BUILD (REPRODUCIBLE)

Build the fork:

git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Quantize from the unsloth BF16+MTP GGUF — ROCmFP4 body, genuine f16 embeddings, no imatrix:

# original (ROCmFP4 4-bit output head)
llama-quantize \
  --token-embedding-type f16 \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

# headQ6 variant adds the Q6_K output head (one extra flag)
llama-quantize \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Architecture (qwen35moe): 41 blocks, 2048 hidden, 256 experts, with the nextn_predict_layers=1 MTP head (blk.40.nextn.*) — so self-speculative draft-MTP survives quantization. Format: ROCmFP4 is a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block; tensor-aware. This variant (STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; a lookup, so ~zero decode cost). Experts (ffn_*_exps) run q4_0_rocmfp4_fast; attention K/V (+ fused QKV) run q4_0_rocmfp4 (dual-scale).

Experimental research build for AMD Strix Halo — hardware-, driver-, model-, and prompt-sensitive, may not reproduce on other GPUs. Not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims. Base BF16 GGUF pinned at revision 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d.

05 · LINEAGE & CREDITS

BASE MODEL	Qwen3.6-35B-A3B (Qwen team) — derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution / use
BF16 GGUF SOURCE	unsloth/Qwen3.6-35B-A3B-MTP-GGUF @ `5bc3e238d916f48a861bac2f8a1990a0e9b7e98d`
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (based on llama.cpp, MIT)
CHAT TEMPLATE	froggeric/Qwen-Fixed-Chat-Templates

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month: 1,563

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Quantized

(5)

this model

Collection including plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

ROCmFP4 MTP · Strix Halo

Collection

Self-speculative MTP quants in custom ROCmFP4 4-bit for AMD Strix Halo (gfx1151). Needs the charlie12345/rocmfp4-llama fork. • 5 items • Updated about 10 hours ago

FORMAT ROCmFP4 4-BIT	PRECISION 4.44 BPW	SIZE 18.8 GB	CONTEXT 262 K
ARCH MoE · 256 EXPERTS	ACTIVE / HIDDEN ~3B · 2048	DRAFT MTP n-max 5	BACKEND VULKAN0