Instructions to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp",
	filename="Qwen3.6-35B-A3B-NSC-ACE-SABER-MTP-F16-to-ROCmFP4-STRIX_LEAN.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16
# Run inference directly in the terminal:
llama-cli -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16
# Run inference directly in the terminal:
llama-cli -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16
# Run inference directly in the terminal:
./llama-cli -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Use Docker

docker model run hf.co/jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

LM Studio
Jan

vLLM

How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Ollama
How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with Ollama:
```
ollama run hf.co/jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16
```

Unsloth Studio

How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp to start chatting

How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with Docker Model Runner:
```
docker model run hf.co/jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16
```

Lemonade

How to use jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp:F16

Run and chat with the model

lemonade run user.chadrock-35b-ace-saber-rocmfp4-mtp-F16

List all available models

lemonade list

Chadrock-35B Ace Saber ROCmFP4 MTP

Chadrock-35B Ace Saber is a ROCmFP4/MTP GGUF for AMD Ryzen AI Max+ 395 / Strix Halo systems.

The model behavior comes from the Ace Saber build by @DJLougen. The runtime speed comes from charlie12345, also known as @Italianclownz on X/Twitter, and his custom ROCmFP4 llama.cpp fork.

This GGUF will not run correctly with stock llama.cpp. You need the custom charlie12345/rocmfp4-llama build because this file uses ROCmFP4 tensor types that upstream llama.cpp does not currently understand.

The model file is already provided here. You do not need to rebuild or quantize the model.

Why This Mix

Ace Saber gives the model its coding, agentic, and tool-use behavior. Chadrock/ROCmFP4 gives it the speed profile needed to feel good locally on AMD unified-memory hardware.

The goal is not just another Qwen3.6 quant. The goal is:

Ace Saber behavior from @DJLougen
Qwen3.6 35B-A3B MoE efficiency
MTP speculative decoding
ROCmFP4 tensor-aware quantization
high-throughput local serving on Ryzen AI Max+ 395 / Strix Halo

Technical Metadata

Hugging Face may round the parsed GGUF tensor count to 36B in its automatic badge. This release is the Qwen3.6 35B-A3B MoE family: about 35B-class total parameters with roughly 3B active parameters per token.

Field	Value
model size	`35B-A3B` MoE
total parameters	`35B` class
active parameters	`~3B` class
architecture	`qwen35moe`
direct upstream GGUF	`GestaltLabs/Qwen3.6-35B-A3B-NSC-ACE-SABER-GGUF-MTP`
base family	`Qwen/Qwen3.6-35B-A3B`
local runtime format	ROCmFP4 Chadrock GGUF

Speed

Fresh HumanEval generation speed on AMD Ryzen AI Max+ 395 / Strix Halo with the included Ace Saber ROCmFP4 GGUF. This row uses the Chadrock Vulkan d2 quality profile at 32k context, f16 main/draft KV, draft-MTP depth 2, and deterministic decoding.

Metric	Result
HumanEval tasks	`164`
completion tokens generated	`48,824`
aggregate llama-server prompt speed	`~543.56 tok/s`
aggregate llama-server eval speed	`~101.31 tok/s`
peak live prompt speed	`~544.93 tok/s`
peak live eval speed	`~104.35 tok/s`
run context	`32,768`

These numbers come from the 2026-06-07 EvalPlus HumanEval rerun that produced the 155/164 base pass@1 and 148/164 HumanEval+ pass@1 result. The rerun replaced the older ROCm/q8/q4/depth-3 serving recipe because local tuning found the faster and more stable Chadrock profile on this Strix Halo box is Vulkan target+draft, f16 KV, larger batch/ubatch, and draft-MTP depth 2.

HumanEval

This model also posts an exceptional HumanEval result for a local GGUF run:

Model / row	HumanEval base pass@1	HumanEval+ pass@1
Chadrock-35B Ace Saber ROCmFP4, 32k Vulkan d2 rerun	`155/164 = 94.51%`	`148/164 = 90.24%`
earlier Chadrock-35B Ace Saber ROCmFP4 run	`157/164 = 95.73%`	`149/164 = 90.85%`
recorded stock Qwen3.6-27B UD-Q8_K_XL	`154/164 = 93.90%`	`149/164 = 90.85%`

The fresh 32k rerun still beats the recorded stock 27B row on base HumanEval, while the older row remains one task higher on HumanEval+.

BigCodeBench-Hard

The same tuned Chadrock Vulkan d2 family was also run on BigCodeBench-Hard-Instruct:

Benchmark	Result
BigCodeBench-Hard-Instruct pass@1	`47/148 = 31.76%`
generation wall time	`799 s`
aggregate prompt speed	`~624.06 tok/s`
aggregate generation speed	`~100.12 tok/s`

This is a harder instruction-coding benchmark than HumanEval and is included as a sanity check that the speed-tuned runtime still produces usable code under a broader task mix.

Run With llama-server

Build Charlie's custom llama.cpp once, download this GGUF, then run:

HSA_OVERRIDE_GFX_VERSION=11.5.1 \
GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
/path/to/rocmfp4-llama/build-strix-rocmfp4/bin/llama-server \
  -m Qwen3.6-35B-A3B-NSC-ACE-SABER-MTP-F16-to-ROCmFP4-STRIX_LEAN.gguf \
  --alias chadrock-35b-ace-saber \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  -c 262144 \
  --reasoning off \
  --reasoning-format none \
  --reasoning-budget 0 \
  -ngl 999 \
  -fa on \
  -sm row \
  -dev Vulkan0 \
  -b 8192 \
  -ub 2048 \
  -t 16 \
  -tb 32 \
  -ctk f16 \
  -ctv f16 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-threads 16 \
  --spec-draft-threads-batch 32 \
  --spec-draft-n-max 2 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --poll 100 \
  --poll-batch 1 \
  --spec-draft-poll 1 \
  --spec-draft-poll-batch 1 \
  --temp 0 \
  --min-p 0.0 \
  --top-p 0.9 \
  --top-k 20 \
  --repeat-penalty 1.0 \
  --seed 123 \
  --cache-ram 0 \
  --parallel 1 \
  --no-mmproj \
  --metrics

Use --parallel 1 for MTP. Multi-slot serving changes the MTP behavior and is not the intended profile. The benchmark table above used -c 32768 for the EvalPlus rerun. For general local coding and agent work, use the full -c 262144 context unless your hardware cannot hold it.

About Ace Saber

The source checkpoint is GestaltLabs/Qwen3.6-35B-A3B-NSC-ACE-SABER, based on Qwen/Qwen3.6-35B-A3B. The direct GGUF-MTP source is GestaltLabs/Qwen3.6-35B-A3B-NSC-ACE-SABER-GGUF-MTP.

Training order:

Qwen3.6-35B-A3B -> NSC-ACE -> SABER

NSC-ACE uses multiple steered rollouts from the same model and rewards convergence across latent behavior modes, especially for tool-call structure, reasoning wrappers, self-consistency, and avoiding repeated loops.

SABER is the final calibration pass. The source model card reports 98.33% HarmBench-300 compliance and final KLD 0.025383937664711.

About Chadrock / ROCmFP4

Charlie's ROCmFP4 method adds AMD-focused GGUF tensor formats and backend paths to llama.cpp.

ROCmFP4 is not stock Q4, MXFP4, or NVFP4. It uses custom Codebook10 4-bit weights, finite unsigned E4M3 scale semantics, tensor-aware presets, ROCm/HIP kernels, Vulkan shader support, and MTP regression guards.

Why it matters: Ryzen AI Max+ 395 / Strix Halo has a large unified-memory pool, but decode speed still depends heavily on bandwidth, tensor layout, and draft-token acceptance. ROCmFP4 is designed to make this class of AMD machine fast enough for serious local long-context use.

Build The Required llama.cpp

The GGUF is already provided. You only need to build the custom llama.cpp server once:

git clone https://github.com/charlie12345/rocmfp4-llama.git
cd rocmfp4-llama
git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

The server binary will be here:

build-strix-rocmfp4/bin/llama-server

File

File	SHA256
`Qwen3.6-35B-A3B-NSC-ACE-SABER-MTP-F16-to-ROCmFP4-STRIX_LEAN.gguf`	`6a635d1d8ac4af8f2c4ca6ff528bc6bad9b3a6d45e8630ef6e5728f04898eeed`

Credits

@DJLougen: Ace Saber / NSC-ACE SABER model build.
charlie12345 / @Italianclownz: ROCmFP4 llama.cpp fork, Strix Halo build path, and AMD-focused MTP runtime work.
Qwen: base Qwen3.6-35B-A3B model.

Notes

This is an experimental AMD ROCmFP4/MTP build. Performance depends on driver version, clocks, prompt shape, MTP acceptance, and serving flags. The numbers above are local reproducible measurements, not universal llama.cpp claims.

Downloads last month: 959

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

16-bit

Model tree for jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp

Base model

GestaltLabs/Qwen3.6-35B-A3B-NSC-ACE-SABER-GGUF-MTP

Quantized

(1)

this model