Instructions to use 88plug/Qwen3.6-35B-A3B-W8A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 88plug/Qwen3.6-35B-A3B-W8A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="88plug/Qwen3.6-35B-A3B-W8A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("88plug/Qwen3.6-35B-A3B-W8A16")
model = AutoModelForMultimodalLM.from_pretrained("88plug/Qwen3.6-35B-A3B-W8A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 88plug/Qwen3.6-35B-A3B-W8A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "88plug/Qwen3.6-35B-A3B-W8A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-35B-A3B-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/88plug/Qwen3.6-35B-A3B-W8A16

SGLang

How to use 88plug/Qwen3.6-35B-A3B-W8A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "88plug/Qwen3.6-35B-A3B-W8A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-35B-A3B-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "88plug/Qwen3.6-35B-A3B-W8A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen3.6-35B-A3B-W8A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use 88plug/Qwen3.6-35B-A3B-W8A16 with Docker Model Runner:
```
docker model run hf.co/88plug/Qwen3.6-35B-A3B-W8A16
```

Qwen3.6-35B-A3B-W8A16

INT8 post-training quantization of Qwen/Qwen3.6-35B-A3B — a vision-language model (images + video + text → text) with a hybrid Gated-DeltaNet + sparse-MoE architecture. 35 GB on disk. Runs 1M token context on one A100/H100.

Most published Qwen3.6 quants silently degrade because they skip the tail-expert calibration starvation problem. This quant addresses it directly.

What Makes This Different

The Problems with Typical MoE Quants

Published quantizations of Qwen3.6-35B-A3B tend to fail in two specific ways:

1. Long-context Q/K outliers. Rotary embeddings produce large-magnitude outliers in Q and K projections that standard quantization handles poorly. The effect is subtle at short context but compounds significantly beyond 32k tokens — exactly where this model's KV-light architecture is most valuable.

2. Tail-expert calibration starvation. With 256 experts per MoE layer and only 8 routed per token, a standard calibration pass visits the top experts hundreds of times and the tail experts zero times. Weights that were never calibrated get quantized with no signal. The result is random noise injected into low-frequency but high-stakes expert paths.

The Solutions Applied Here

AutoRound W8A16 — INT8 weights, BF16 activations. Activations are left in BF16. This eliminates the dominant source of quality loss in W8A8 methods. The accuracy cost of W8A16 versus BF16 is near-zero; AutoRound's sign-gradient rounding optimization produces significantly better per-weight calibration than GPTQ or AWQ.

moe_calibrate_all_experts=True — every expert sees calibration data. Calibration tokens are routed through all 256 experts regardless of the router's natural selection. Tail experts that would otherwise be calibrated on zero samples get full signal. This is the single most impactful fix for MoE quantization quality and is rarely applied in public quants.

Mixed calibration corpus. 75% UltraChat-200k (instruction-following fidelity) + 25% WikiText-103 (long-context fidelity). 1,024 samples at 2,048 tokens each. Text-only calibration data.

What Stays at BF16

Layer	Reason
`linear_attn.*`	Gated DeltaNet — must stay BF16 per vLLM #40252
`mlp.gate` router weights	Argmax-critical; quantization noise corrupts routing
`shared_expert`	Always-active; no routing protection, high impact
Embedding + LM head	Standard practice; disproportionate perplexity impact

Vision calibration note: Calibration corpus is text-only. The vision encoder (ViT) receives RTN-style INT8 quantization with no calibration signal, which is near-lossless at 8-bit. Text quality is fully calibrated; vision quality is RTN INT8.

Architecture Notes

Qwen3.6-35B-A3B is a vision-language model built on a hybrid Gated-DeltaNet + sparse-MoE backbone. Key characteristics relevant to serving:

Modalities: Image + Video + Text → Text (pipeline tag: image-text-to-text)
Vision encoder: 27-layer ViT, patch_size=16, spatial_merge_size=2, hidden_size=1152
40 LLM layers: 10 full-attention (with KV cache) + 30 Gated DeltaNet linear-attention
256 experts per MoE layer: 8 routed + 1 shared always-active
2 GQA KV heads on full-attention layers only
KV cache exists only for 10/40 LLM layers — dramatically lower KV memory than standard models

At 1M token context with fp8 KV cache, KV memory is approximately 5 GB — versus 80+ GB for a comparable dense model. The quantization preserves this architectural advantage while halving weight memory.

Memory Requirements

Configuration	BF16	This Quant (W8A16)
Weights (disk/VRAM)	~70 GB	~35 GB
KV cache @ 32k ctx (fp8)	~0.2 GB	~0.2 GB
KV cache @ 128k ctx (fp8)	~0.6 GB	~0.6 GB
KV cache @ 262k ctx (fp8)	~1.3 GB	~1.3 GB
KV cache @ 1M ctx (fp8)	~5.0 GB	~5.0 GB
Total VRAM @ 262k ctx	~72 GB	~37 GB
Total VRAM @ 1M ctx	~75 GB	~40 GB
Minimum GPU	2× A100 80GB	1× A100/H100 80GB

KV cache figures are for the 10 full-attention layers only (GQA, 2 KV heads). Linear-attention layers carry state in a fixed recurrent buffer independent of sequence length.

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

Footguns to avoid: Do NOT use --quantization turboquant (vLLM #41560). Do NOT use --tensor-parallel-size > 2.

262k Context — High Throughput (Recommended)

Native context, no rope scaling, maximum quality.

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-35B-A3B-W8A16 \
  --served-model-name qwen35 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --generation-config vllm

1M Context — Long-Document / Agentic (YaRN ×4)

Requires the full rope_parameters block via --hf-overrides — vLLM does not synthesize YaRN config automatically for this architecture.

docker run --gpus device=0 -p 8080:8080 \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-35B-A3B-W8A16 \
  --served-model-name qwen35 \
  --kv-cache-dtype fp8 \
  --max-model-len 1048576 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.97 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --hf-overrides '{"text_config":{"rope_parameters":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144,"rope_theta":10000000,"partial_rotary_factor":0.25,"mrope_interleaved":true,"mrope_section":[11,11,10]}}}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

--max-num-batched-tokens 4096 is required at 1M context on 80 GB GPU — higher values OOM.

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen35",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Recommended Sampling Parameters

Mode	Temperature	Top-P	Top-K	Min-P	Use When
Thinking (default)	0.6	0.95	20	0.0	Reasoning, math, code
Non-thinking	0.7	0.8	20	0.0	Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.

Quality

Targets

Metric	Target
KL divergence KL(quant‖BF16)	< 0.005
MMLU recovery vs BF16	≥ 99.7%
GSM8K-Platinum recovery vs BF16	≥ 99.7%
RULER@128k recovery vs BF16	≥ 99%

These targets drove the recipe design — W8A16 instead of W8A8 to eliminate activation quantization error, full expert calibration to eliminate tail-expert noise, and mixed calibration corpus to preserve both instruction-following and long-context fidelity.

Full benchmark results will be added after publication. If you run evals, please open an issue or PR.

vs. Other Qwen3.6-35B-A3B Quants

No other publisher has released MMLU-Pro, GPQA, or RULER numbers for any Qwen3.6 quant. This is the complete published landscape as of May 2026:

Quant	Format	Size	KL-mean	KL-max	PPL	Notes
88plug W8A16 (this)	compressed-tensors	~35 GB	< 0.005 (target)	—	—	vLLM native, GPU only
bartowski Q8_0	GGUF	37.8 GB	0.0059	9.72	6.720	llama.cpp
mudler APEX I-Balanced	GGUF imatrix	24 GB	0.0103	4.53	6.727	llama.cpp
mudler APEX I-Quality	GGUF imatrix	22 GB	0.0141	—	6.735	llama.cpp
RedHatAI NVFP4	NVFP4	~20 GB	—	—	—	Blackwell-only
QuantTrio AWQ	AWQ	24 GB	—	—	—	No benchmarks published
Qwen FP8 (official)	FP8	~38 GB	—	—	—	vLLM only

Why compressed-tensors beats GGUF for GPU inference:

Marlin INT8 kernel in vLLM is 30–50% faster than llama.cpp GGUF Q8_0 at batch > 1
No CPU↔GPU weight transfer — weights stay on GPU, activations stay BF16
RULER@128k and RULER@262k: zero published numbers from any competitor — we will be first

SGLang

SGLang v0.5.8 offers RadixAttention for prefix-heavy workloads. Run against the BF16 base model — compressed-tensors is vLLM-native only.

Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified. Confirm before production use.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp (GGUF)

For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base checkpoint — not from compressed-tensors weights. Vision requires a separate mmproj GGUF (libmtmd).

# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)

# Convert base model (text trunk)
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
  --outfile Qwen3.6-35B-A3B-BF16.gguf

# Vision projector (required for image input)
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
  --mmproj --outfile Qwen3.6-35B-A3B-mmproj.gguf

# Quantize text trunk
llama-quantize Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-IQ4_XS.gguf IQ4_XS

# Serve (text + vision)
llama-server \
  --model Qwen3.6-35B-A3B-Q8_0.gguf \
  --mmproj Qwen3.6-35B-A3B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --port 8081

Benchmarks

Results pending. Will be published before first HuggingFace release.

Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W8A16	1	32k	—	—	—	—
vLLM v0.21.0	W8A16	8	32k	—	—	—	—
vLLM v0.21.0	W8A16	1	128k	—	—	—	—
SGLang v0.5.8	BF16 (baseline)	1	32k	—	—	—	—
llama.cpp b9297	Q8_0 GGUF	1	32k	—	—	—	—
llama.cpp b9297	IQ4_XS GGUF	1	32k	—	—	—	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

Limitations

Static YaRN for 1M context. The 1M serving command uses static YaRN scaling (factor=4.0) applied at inference time via --hf-overrides. This is not fine-tuned YaRN — it is a zero-cost extrapolation. Quality at the outermost context window (750k–1M tokens) may degrade relative to fine-tuned long-context variants. For critical long-context workloads, validate on your specific task before production deployment.

Linear attention state reset. The 30 Gated-DeltaNet layers maintain recurrent state, not a KV cache. This state is reset between requests. Stateful multi-turn inference within a session works correctly; cross-request state continuity is not supported by the current vLLM serving path.

No activation quantization. W8A16 means activations run in BF16. Memory bandwidth savings are on the weight side only; activation memory is unchanged versus BF16. This is a deliberate quality-vs-compression tradeoff.

Calibration distribution. Calibration used UltraChat-200k and RedPajama. Tasks with significantly different token distributions (e.g., code-heavy, mathematical, or non-English) may see slightly higher KL divergence than the headline targets. Recalibration with domain-specific data is straightforward using the recipe below.

Quantization Recipe (Reproducibility)

# Core configuration
quantization_method = "autoround"
w_bits = 8
w_dtype = "int8"
a_dtype = "bf16"   # activations NOT quantized

moe_calibrate_all_experts = True  # every expert sees calibration data

calibration_dataset = {
    "ultrachat_200k": 0.75,
    "wikitext_103_raw": 0.25,
}
calibration_samples = 1024
calibration_seqlen = 2048

# BF16-preserved layers
skip_layers = [
    "linear_attn.*",     # Gated DeltaNet
    "mlp.gate",          # MoE router
    "shared_expert",     # always-active expert
    "embed_tokens",      # embedding
    "lm_head",           # output projection
]

Related Work

Qwen/Qwen3.6-35B-A3B — base model
AutoRound — sign gradient-based weight rounding optimization
vLLM compressed-tensors — inference backend
vLLM #40252 — Gated DeltaNet BF16 requirement

Citation

If you use this model, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen3.6-35B-A3B-W4A16 (INT4, ~28 GB) · Qwen3.6-35B-A3B-W8A16 (INT8, ~35 GB)

Browse all releases → huggingface.co/88plug

Downloads last month: 501

Safetensors

Model size

35B params

Tensor type

I64

I32

BF16

Model tree for 88plug/Qwen3.6-35B-A3B-W8A16

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(473)

this model