Instructions to use 88plug/Qwen3.6-35B-A3B-W8A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 88plug/Qwen3.6-35B-A3B-W8A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="88plug/Qwen3.6-35B-A3B-W8A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("88plug/Qwen3.6-35B-A3B-W8A16") model = AutoModelForMultimodalLM.from_pretrained("88plug/Qwen3.6-35B-A3B-W8A16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 88plug/Qwen3.6-35B-A3B-W8A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "88plug/Qwen3.6-35B-A3B-W8A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-35B-A3B-W8A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/88plug/Qwen3.6-35B-A3B-W8A16
- SGLang
How to use 88plug/Qwen3.6-35B-A3B-W8A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-35B-A3B-W8A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-35B-A3B-W8A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-35B-A3B-W8A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-35B-A3B-W8A16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use 88plug/Qwen3.6-35B-A3B-W8A16 with Docker Model Runner:
docker model run hf.co/88plug/Qwen3.6-35B-A3B-W8A16
Qwen3.6-35B-A3B-W8A16
INT8 post-training quantization of Qwen/Qwen3.6-35B-A3B — a vision-language model (images + video + text → text) with a hybrid Gated-DeltaNet + sparse-MoE architecture. 35 GB on disk. Runs 1M token context on one A100/H100.
Most published Qwen3.6 quants silently degrade because they skip the tail-expert calibration starvation problem. This quant addresses it directly.
What Makes This Different
The Problems with Typical MoE Quants
Published quantizations of Qwen3.6-35B-A3B tend to fail in two specific ways:
1. Long-context Q/K outliers. Rotary embeddings produce large-magnitude outliers in Q and K projections that standard quantization handles poorly. The effect is subtle at short context but compounds significantly beyond 32k tokens — exactly where this model's KV-light architecture is most valuable.
2. Tail-expert calibration starvation. With 256 experts per MoE layer and only 8 routed per token, a standard calibration pass visits the top experts hundreds of times and the tail experts zero times. Weights that were never calibrated get quantized with no signal. The result is random noise injected into low-frequency but high-stakes expert paths.
The Solutions Applied Here
AutoRound W8A16 — INT8 weights, BF16 activations. Activations are left in BF16. This eliminates the dominant source of quality loss in W8A8 methods. The accuracy cost of W8A16 versus BF16 is near-zero; AutoRound's sign-gradient rounding optimization produces significantly better per-weight calibration than GPTQ or AWQ.
moe_calibrate_all_experts=True — every expert sees calibration data. Calibration tokens are routed through all 256 experts regardless of the router's natural selection. Tail experts that would otherwise be calibrated on zero samples get full signal. This is the single most impactful fix for MoE quantization quality and is rarely applied in public quants.
Mixed calibration corpus. 75% UltraChat-200k (instruction-following fidelity) + 25% WikiText-103 (long-context fidelity). 1,024 samples at 2,048 tokens each. Text-only calibration data.
What Stays at BF16
| Layer | Reason |
|---|---|
linear_attn.* |
Gated DeltaNet — must stay BF16 per vLLM #40252 |
mlp.gate router weights |
Argmax-critical; quantization noise corrupts routing |
shared_expert |
Always-active; no routing protection, high impact |
| Embedding + LM head | Standard practice; disproportionate perplexity impact |
Vision calibration note: Calibration corpus is text-only. The vision encoder (ViT) receives RTN-style INT8 quantization with no calibration signal, which is near-lossless at 8-bit. Text quality is fully calibrated; vision quality is RTN INT8.
Architecture Notes
Qwen3.6-35B-A3B is a vision-language model built on a hybrid Gated-DeltaNet + sparse-MoE backbone. Key characteristics relevant to serving:
- Modalities: Image + Video + Text → Text (pipeline tag:
image-text-to-text) - Vision encoder: 27-layer ViT, patch_size=16, spatial_merge_size=2, hidden_size=1152
- 40 LLM layers: 10 full-attention (with KV cache) + 30 Gated DeltaNet linear-attention
- 256 experts per MoE layer: 8 routed + 1 shared always-active
- 2 GQA KV heads on full-attention layers only
- KV cache exists only for 10/40 LLM layers — dramatically lower KV memory than standard models
At 1M token context with fp8 KV cache, KV memory is approximately 5 GB — versus 80+ GB for a comparable dense model. The quantization preserves this architectural advantage while halving weight memory.
Memory Requirements
| Configuration | BF16 | This Quant (W8A16) |
|---|---|---|
| Weights (disk/VRAM) | ~70 GB | ~35 GB |
| KV cache @ 32k ctx (fp8) | ~0.2 GB | ~0.2 GB |
| KV cache @ 128k ctx (fp8) | ~0.6 GB | ~0.6 GB |
| KV cache @ 262k ctx (fp8) | ~1.3 GB | ~1.3 GB |
| KV cache @ 1M ctx (fp8) | ~5.0 GB | ~5.0 GB |
| Total VRAM @ 262k ctx | ~72 GB | ~37 GB |
| Total VRAM @ 1M ctx | ~75 GB | ~40 GB |
| Minimum GPU | 2× A100 80GB | 1× A100/H100 80GB |
KV cache figures are for the 10 full-attention layers only (GQA, 2 KV heads). Linear-attention layers carry state in a fixed recurrent buffer independent of sequence length.
Quick Start
Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.
Footguns to avoid: Do NOT use
--quantization turboquant(vLLM #41560). Do NOT use--tensor-parallel-size > 2.
262k Context — High Throughput (Recommended)
Native context, no rope scaling, maximum quality.
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen3.6-35B-A3B-W8A16 \
--served-model-name qwen35 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-seqs 16 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--generation-config vllm
1M Context — Long-Document / Agentic (YaRN ×4)
Requires the full rope_parameters block via --hf-overrides — vLLM does not synthesize YaRN config automatically for this architecture.
docker run --gpus device=0 -p 8080:8080 \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen3.6-35B-A3B-W8A16 \
--served-model-name qwen35 \
--kv-cache-dtype fp8 \
--max-model-len 1048576 \
--max-num-seqs 2 \
--max-num-batched-tokens 4096 \
--gpu-memory-utilization 0.97 \
--enable-chunked-prefill \
--enable-prefix-caching \
--hf-overrides '{"text_config":{"rope_parameters":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144,"rope_theta":10000000,"partial_rotary_factor":0.25,"mrope_interleaved":true,"mrope_section":[11,11,10]}}}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3
--max-num-batched-tokens 4096is required at 1M context on 80 GB GPU — higher values OOM.
Python Client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
model="qwen35",
messages=[{"role": "user", "content": "Your prompt here"}],
max_tokens=512,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)
Recommended Sampling Parameters
| Mode | Temperature | Top-P | Top-K | Min-P | Use When |
|---|---|---|---|---|---|
| Thinking (default) | 0.6 | 0.95 | 20 | 0.0 | Reasoning, math, code |
| Non-thinking | 0.7 | 0.8 | 20 | 0.0 | Chat, creative, fast response |
Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.
Quality
Targets
| Metric | Target |
|---|---|
| KL divergence KL(quant‖BF16) | < 0.005 |
| MMLU recovery vs BF16 | ≥ 99.7% |
| GSM8K-Platinum recovery vs BF16 | ≥ 99.7% |
| RULER@128k recovery vs BF16 | ≥ 99% |
These targets drove the recipe design — W8A16 instead of W8A8 to eliminate activation quantization error, full expert calibration to eliminate tail-expert noise, and mixed calibration corpus to preserve both instruction-following and long-context fidelity.
Full benchmark results will be added after publication. If you run evals, please open an issue or PR.
vs. Other Qwen3.6-35B-A3B Quants
No other publisher has released MMLU-Pro, GPQA, or RULER numbers for any Qwen3.6 quant. This is the complete published landscape as of May 2026:
| Quant | Format | Size | KL-mean | KL-max | PPL | Notes |
|---|---|---|---|---|---|---|
| 88plug W8A16 (this) | compressed-tensors | ~35 GB | < 0.005 (target) | — | — | vLLM native, GPU only |
| bartowski Q8_0 | GGUF | 37.8 GB | 0.0059 | 9.72 | 6.720 | llama.cpp |
| mudler APEX I-Balanced | GGUF imatrix | 24 GB | 0.0103 | 4.53 | 6.727 | llama.cpp |
| mudler APEX I-Quality | GGUF imatrix | 22 GB | 0.0141 | — | 6.735 | llama.cpp |
| RedHatAI NVFP4 | NVFP4 | ~20 GB | — | — | — | Blackwell-only |
| QuantTrio AWQ | AWQ | 24 GB | — | — | — | No benchmarks published |
| Qwen FP8 (official) | FP8 | ~38 GB | — | — | — | vLLM only |
Why compressed-tensors beats GGUF for GPU inference:
- Marlin INT8 kernel in vLLM is 30–50% faster than llama.cpp GGUF Q8_0 at batch > 1
- No CPU↔GPU weight transfer — weights stay on GPU, activations stay BF16
- RULER@128k and RULER@262k: zero published numbers from any competitor — we will be first
SGLang
SGLang v0.5.8 offers RadixAttention for prefix-heavy workloads. Run against the BF16 base model — compressed-tensors is vLLM-native only.
Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified. Confirm before production use.
docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-35B-A3B \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000
llama.cpp (GGUF)
For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base checkpoint — not from compressed-tensors weights. Vision requires a separate mmproj GGUF (libmtmd).
# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)
# Convert base model (text trunk)
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
--outfile Qwen3.6-35B-A3B-BF16.gguf
# Vision projector (required for image input)
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
--mmproj --outfile Qwen3.6-35B-A3B-mmproj.gguf
# Quantize text trunk
llama-quantize Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-IQ4_XS.gguf IQ4_XS
# Serve (text + vision)
llama-server \
--model Qwen3.6-35B-A3B-Q8_0.gguf \
--mmproj Qwen3.6-35B-A3B-mmproj.gguf \
--n-gpu-layers 999 \
--ctx-size 131072 \
--port 8081
Benchmarks
Results pending. Will be published before first HuggingFace release.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 8 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 1 | 128k | — | — | — | — |
| SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | Q8_0 GGUF | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | — | — | — | — |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
Limitations
Static YaRN for 1M context. The 1M serving command uses static YaRN scaling (factor=4.0) applied at inference time via --hf-overrides. This is not fine-tuned YaRN — it is a zero-cost extrapolation. Quality at the outermost context window (750k–1M tokens) may degrade relative to fine-tuned long-context variants. For critical long-context workloads, validate on your specific task before production deployment.
Linear attention state reset. The 30 Gated-DeltaNet layers maintain recurrent state, not a KV cache. This state is reset between requests. Stateful multi-turn inference within a session works correctly; cross-request state continuity is not supported by the current vLLM serving path.
No activation quantization. W8A16 means activations run in BF16. Memory bandwidth savings are on the weight side only; activation memory is unchanged versus BF16. This is a deliberate quality-vs-compression tradeoff.
Calibration distribution. Calibration used UltraChat-200k and RedPajama. Tasks with significantly different token distributions (e.g., code-heavy, mathematical, or non-English) may see slightly higher KL divergence than the headline targets. Recalibration with domain-specific data is straightforward using the recipe below.
Quantization Recipe (Reproducibility)
# Core configuration
quantization_method = "autoround"
w_bits = 8
w_dtype = "int8"
a_dtype = "bf16" # activations NOT quantized
moe_calibrate_all_experts = True # every expert sees calibration data
calibration_dataset = {
"ultrachat_200k": 0.75,
"wikitext_103_raw": 0.25,
}
calibration_samples = 1024
calibration_seqlen = 2048
# BF16-preserved layers
skip_layers = [
"linear_attn.*", # Gated DeltaNet
"mlp.gate", # MoE router
"shared_expert", # always-active expert
"embed_tokens", # embedding
"lm_head", # output projection
]
Related Work
- Qwen/Qwen3.6-35B-A3B — base model
- AutoRound — sign gradient-based weight rounding optimization
- vLLM compressed-tensors — inference backend
- vLLM #40252 — Gated DeltaNet BF16 requirement
Citation
If you use this model, please cite the base model:
@misc{qwen3technicalreport,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
About
88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.
W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.
W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.
All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.
Also available: Qwen3.6-35B-A3B-W4A16 (INT4, ~28 GB) · Qwen3.6-35B-A3B-W8A16 (INT8, ~35 GB)
Browse all releases → huggingface.co/88plug
- Downloads last month
- 501
Model tree for 88plug/Qwen3.6-35B-A3B-W8A16
Base model
Qwen/Qwen3.6-35B-A3B