Instructions to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF", filename="Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Use Docker
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- LM Studio
- Jan
- Ollama
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Ollama:
ollama run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- Unsloth Studio
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
- Pi
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Docker Model Runner:
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- Lemonade
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF-BF16
List all available models
lemonade list
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chattingUsing HuggingFace Spaces for Unsloth
# No setup required# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting ▗▇▇▇▇▇▇▇▖
▗█▘▝██████▖
▗▛ ▝██████▆▆▆▆▆▆▆▆▆▆▅
▟▛ ▗█████████████████▙▖
▄▄▄▄▄▟▛ ▟████████████████████▖
▗██▌ ▚▖ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘
▗████▖ ▜▖ ▗█▘
▜█████▙ ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙
▜█████▙ ▝████████████▛ ▜▙
▜█████▙ ▝██████████▛ ▃ ▜▙
▀█████▙▖ ▝████████▘ ▟█▙ ▀▙
▝██████▖ ▝▜█████▘ ▟███▙▂▂▂▂▐█
▟███████▖ ▜███▘ ▗███████████▛
▟█████████▄ ▜▛ ▗███████████▀
▝█████▀ ▗▛ ▗██████▀▀▀▀▀▘
▜██▘ ▗▛ ▟█████▛▘
▜█▇▇▇▇▇▇▇▇▇█▖ ▟█████▛
▝█▖ ▟█████▛
▝███████▀
FORMAT ROCmFP4 4-BIT |
PRECISION 4.44 BPW |
SIZE 18.8 GB |
CONTEXT 262 K |
ARCH MoE · 256 EXPERTS |
ACTIVE / HIDDEN ~3B · 2048 |
DRAFT MTP n-max 5 |
BACKEND VULKAN0 |
The custom
q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
The two are identical except one tensor — the output head (output.weight): same STRIX recipe, same f16 embeddings, same F32 MoE router, same MTP head. They differ only in the head. Both are built without imatrix (see §03).
Run from the folder holding the .gguf + chat_template.jinja:
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
--alias qwen35b-a3b-mtp \
--host 0.0.0.0 \
--port 8080 \
-c 262144 \
-ctk f16 \
-ctv f16 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--jinja \
--parallel 1 \
--metrics \
--no-mmap \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 5 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--chat-template-file chat_template.jinja \
--reasoning on \
--reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}'
Multi-turn prompt-cache reuse (OpenCode). Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:
- Checkpoints — default
-cpentis 8192, so prompts under 8K never checkpoint. Fix:-cpent 256 -ctxcp 32 --cache-reuse 256. - Thinking —
--reasoning-format deepseek+--chat-template-kwargs '{"preserve_thinking": true}'keeps<think>across turns with cleancontent+reasoning_content. (none= raw tags inline but works with any content-echoing client;deepseek-legacy/autodo not reuse.)
--jinja is required for the chat template + preserve_thinking.
OpenAI-compatible client (e.g. OpenCode). In single-model mode llama-server ignores the request's model field, so the client's model name is just a label.
- Base URL:
http://<host>:8080/v1· API key: any non-empty string (e.g.sk-local) - Model id this server reports:
qwen35b-a3b-mtp
A patched OpenCode that compacts conversation history without invalidating the prompt cache is at PlunderStruck/opencode — pair it with the checkpoint flags to keep long sessions fast.
Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):
MoE decode is naturally fast — only ~3B params active per token — and the F32 router keeps expert selection clean. The router stays F32 for free: the quantizer excludes expert-gating tensors (ffn_gate_inp) from quantization, so routing — which experts each token goes to, a discrete, high-sensitivity decision — keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.
The companion 27B dense quant (same recipe) is at plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.
Build the fork:
git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh
Quantize from the unsloth BF16+MTP GGUF — ROCmFP4 body, genuine f16 embeddings, no imatrix:
# original (ROCmFP4 4-bit output head)
llama-quantize \
--token-embedding-type f16 \
Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
Q4_0_ROCMFP4_STRIX
# headQ6 variant adds the Q6_K output head (one extra flag)
llama-quantize \
--token-embedding-type f16 \
--output-tensor-type q6_K \
Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
Q4_0_ROCMFP4_STRIX
Architecture (qwen35moe): 41 blocks, 2048 hidden, 256 experts, with the nextn_predict_layers=1 MTP head (blk.40.nextn.*) — so self-speculative draft-MTP survives quantization. Format: ROCmFP4 is a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block; tensor-aware. This variant (STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; a lookup, so ~zero decode cost). Experts (ffn_*_exps) run q4_0_rocmfp4_fast; attention K/V (+ fused QKV) run q4_0_rocmfp4 (dual-scale).
Experimental research build for AMD Strix Halo — hardware-, driver-, model-, and prompt-sensitive, may not reproduce on other GPUs. Not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims. Base BF16 GGUF pinned at revision
5bc3e238d916f48a861bac2f8a1990a0e9b7e98d.
Derivative quantization — verify the base model's license before redistribution / use.
- Downloads last month
- 1,563
16-bit
Install Unsloth Studio (macOS, Linux, WSL)
# Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting