Qwen3.6-35B-A3B-W8A16

INT8 post-training quantization of Qwen/Qwen3.6-35B-A3B — a vision-language model (images + video + text → text) with a hybrid Gated-DeltaNet + sparse-MoE architecture. 35 GB on disk. Runs 1M token context on one A100/H100.

Most published Qwen3.6 quants silently degrade because they skip the tail-expert calibration starvation problem. This quant addresses it directly.


What Makes This Different

The Problems with Typical MoE Quants

Published quantizations of Qwen3.6-35B-A3B tend to fail in two specific ways:

1. Long-context Q/K outliers. Rotary embeddings produce large-magnitude outliers in Q and K projections that standard quantization handles poorly. The effect is subtle at short context but compounds significantly beyond 32k tokens — exactly where this model's KV-light architecture is most valuable.

2. Tail-expert calibration starvation. With 256 experts per MoE layer and only 8 routed per token, a standard calibration pass visits the top experts hundreds of times and the tail experts zero times. Weights that were never calibrated get quantized with no signal. The result is random noise injected into low-frequency but high-stakes expert paths.

The Solutions Applied Here

AutoRound W8A16 — INT8 weights, BF16 activations. Activations are left in BF16. This eliminates the dominant source of quality loss in W8A8 methods. The accuracy cost of W8A16 versus BF16 is near-zero; AutoRound's sign-gradient rounding optimization produces significantly better per-weight calibration than GPTQ or AWQ.

moe_calibrate_all_experts=True — every expert sees calibration data. Calibration tokens are routed through all 256 experts regardless of the router's natural selection. Tail experts that would otherwise be calibrated on zero samples get full signal. This is the single most impactful fix for MoE quantization quality and is rarely applied in public quants.

Mixed calibration corpus. 75% UltraChat-200k (instruction-following fidelity) + 25% WikiText-103 (long-context fidelity). 1,024 samples at 2,048 tokens each. Text-only calibration data.

What Stays at BF16

Layer Reason
linear_attn.* Gated DeltaNet — must stay BF16 per vLLM #40252
mlp.gate router weights Argmax-critical; quantization noise corrupts routing
shared_expert Always-active; no routing protection, high impact
Embedding + LM head Standard practice; disproportionate perplexity impact

Vision calibration note: Calibration corpus is text-only. The vision encoder (ViT) receives RTN-style INT8 quantization with no calibration signal, which is near-lossless at 8-bit. Text quality is fully calibrated; vision quality is RTN INT8.


Architecture Notes

Qwen3.6-35B-A3B is a vision-language model built on a hybrid Gated-DeltaNet + sparse-MoE backbone. Key characteristics relevant to serving:

  • Modalities: Image + Video + Text → Text (pipeline tag: image-text-to-text)
  • Vision encoder: 27-layer ViT, patch_size=16, spatial_merge_size=2, hidden_size=1152
  • 40 LLM layers: 10 full-attention (with KV cache) + 30 Gated DeltaNet linear-attention
  • 256 experts per MoE layer: 8 routed + 1 shared always-active
  • 2 GQA KV heads on full-attention layers only
  • KV cache exists only for 10/40 LLM layers — dramatically lower KV memory than standard models

At 1M token context with fp8 KV cache, KV memory is approximately 5 GB — versus 80+ GB for a comparable dense model. The quantization preserves this architectural advantage while halving weight memory.


Memory Requirements

Configuration BF16 This Quant (W8A16)
Weights (disk/VRAM) ~70 GB ~35 GB
KV cache @ 32k ctx (fp8) ~0.2 GB ~0.2 GB
KV cache @ 128k ctx (fp8) ~0.6 GB ~0.6 GB
KV cache @ 262k ctx (fp8) ~1.3 GB ~1.3 GB
KV cache @ 1M ctx (fp8) ~5.0 GB ~5.0 GB
Total VRAM @ 262k ctx ~72 GB ~37 GB
Total VRAM @ 1M ctx ~75 GB ~40 GB
Minimum GPU 2× A100 80GB 1× A100/H100 80GB

KV cache figures are for the 10 full-attention layers only (GQA, 2 KV heads). Linear-attention layers carry state in a fixed recurrent buffer independent of sequence length.


Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

Footguns to avoid: Do NOT use --quantization turboquant (vLLM #41560). Do NOT use --tensor-parallel-size > 2.

262k Context — High Throughput (Recommended)

Native context, no rope scaling, maximum quality.

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-35B-A3B-W8A16 \
  --served-model-name qwen35 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --generation-config vllm

1M Context — Long-Document / Agentic (YaRN ×4)

Requires the full rope_parameters block via --hf-overrides — vLLM does not synthesize YaRN config automatically for this architecture.

docker run --gpus device=0 -p 8080:8080 \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-35B-A3B-W8A16 \
  --served-model-name qwen35 \
  --kv-cache-dtype fp8 \
  --max-model-len 1048576 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.97 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --hf-overrides '{"text_config":{"rope_parameters":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144,"rope_theta":10000000,"partial_rotary_factor":0.25,"mrope_interleaved":true,"mrope_section":[11,11,10]}}}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

--max-num-batched-tokens 4096 is required at 1M context on 80 GB GPU — higher values OOM.

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen35",
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)

Recommended Sampling Parameters

Mode Temperature Top-P Top-K Min-P Use When
Thinking (default) 0.6 0.95 20 0.0 Reasoning, math, code
Non-thinking 0.7 0.8 20 0.0 Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.


Quality

Targets

Metric Target
KL divergence KL(quant‖BF16) < 0.005
MMLU recovery vs BF16 ≥ 99.7%
GSM8K-Platinum recovery vs BF16 ≥ 99.7%
RULER@128k recovery vs BF16 ≥ 99%

These targets drove the recipe design — W8A16 instead of W8A8 to eliminate activation quantization error, full expert calibration to eliminate tail-expert noise, and mixed calibration corpus to preserve both instruction-following and long-context fidelity.

Full benchmark results will be added after publication. If you run evals, please open an issue or PR.

vs. Other Qwen3.6-35B-A3B Quants

No other publisher has released MMLU-Pro, GPQA, or RULER numbers for any Qwen3.6 quant. This is the complete published landscape as of May 2026:

Quant Format Size KL-mean KL-max PPL Notes
88plug W8A16 (this) compressed-tensors ~35 GB < 0.005 (target) vLLM native, GPU only
bartowski Q8_0 GGUF 37.8 GB 0.0059 9.72 6.720 llama.cpp
mudler APEX I-Balanced GGUF imatrix 24 GB 0.0103 4.53 6.727 llama.cpp
mudler APEX I-Quality GGUF imatrix 22 GB 0.0141 6.735 llama.cpp
RedHatAI NVFP4 NVFP4 ~20 GB Blackwell-only
QuantTrio AWQ AWQ 24 GB No benchmarks published
Qwen FP8 (official) FP8 ~38 GB vLLM only

Why compressed-tensors beats GGUF for GPU inference:

  • Marlin INT8 kernel in vLLM is 30–50% faster than llama.cpp GGUF Q8_0 at batch > 1
  • No CPU↔GPU weight transfer — weights stay on GPU, activations stay BF16
  • RULER@128k and RULER@262k: zero published numbers from any competitor — we will be first

SGLang

SGLang v0.5.8 offers RadixAttention for prefix-heavy workloads. Run against the BF16 base model — compressed-tensors is vLLM-native only.

Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified. Confirm before production use.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp (GGUF)

For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base checkpoint — not from compressed-tensors weights. Vision requires a separate mmproj GGUF (libmtmd).

# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)

# Convert base model (text trunk)
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
  --outfile Qwen3.6-35B-A3B-BF16.gguf

# Vision projector (required for image input)
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
  --mmproj --outfile Qwen3.6-35B-A3B-mmproj.gguf

# Quantize text trunk
llama-quantize Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-IQ4_XS.gguf IQ4_XS

# Serve (text + vision)
llama-server \
  --model Qwen3.6-35B-A3B-Q8_0.gguf \
  --mmproj Qwen3.6-35B-A3B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --port 8081

Benchmarks

Results pending. Will be published before first HuggingFace release.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W8A16 1 32k
vLLM v0.21.0 W8A16 8 32k
vLLM v0.21.0 W8A16 1 128k
SGLang v0.5.8 BF16 (baseline) 1 32k
llama.cpp b9297 Q8_0 GGUF 1 32k
llama.cpp b9297 IQ4_XS GGUF 1 32k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Limitations

Static YaRN for 1M context. The 1M serving command uses static YaRN scaling (factor=4.0) applied at inference time via --hf-overrides. This is not fine-tuned YaRN — it is a zero-cost extrapolation. Quality at the outermost context window (750k–1M tokens) may degrade relative to fine-tuned long-context variants. For critical long-context workloads, validate on your specific task before production deployment.

Linear attention state reset. The 30 Gated-DeltaNet layers maintain recurrent state, not a KV cache. This state is reset between requests. Stateful multi-turn inference within a session works correctly; cross-request state continuity is not supported by the current vLLM serving path.

No activation quantization. W8A16 means activations run in BF16. Memory bandwidth savings are on the weight side only; activation memory is unchanged versus BF16. This is a deliberate quality-vs-compression tradeoff.

Calibration distribution. Calibration used UltraChat-200k and RedPajama. Tasks with significantly different token distributions (e.g., code-heavy, mathematical, or non-English) may see slightly higher KL divergence than the headline targets. Recalibration with domain-specific data is straightforward using the recipe below.


Quantization Recipe (Reproducibility)

# Core configuration
quantization_method = "autoround"
w_bits = 8
w_dtype = "int8"
a_dtype = "bf16"   # activations NOT quantized

moe_calibrate_all_experts = True  # every expert sees calibration data

calibration_dataset = {
    "ultrachat_200k": 0.75,
    "wikitext_103_raw": 0.25,
}
calibration_samples = 1024
calibration_seqlen = 2048

# BF16-preserved layers
skip_layers = [
    "linear_attn.*",     # Gated DeltaNet
    "mlp.gate",          # MoE router
    "shared_expert",     # always-active expert
    "embed_tokens",      # embedding
    "lm_head",           # output projection
]

Related Work


Citation

If you use this model, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen3.6-35B-A3B-W4A16 (INT4, ~28 GB) · Qwen3.6-35B-A3B-W8A16 (INT8, ~35 GB)

Browse all releases → huggingface.co/88plug

Downloads last month
501
Safetensors
Model size
35B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 88plug/Qwen3.6-35B-A3B-W8A16

Quantized
(473)
this model