GLM 4.7 (EXL3 Quants)

Original Model:
- zai-org/GLM-4.7

⚠️ Proper support for GLM-4.7 reasoning requires PR #295 in TabbyAPI, see my modified Dockerfile.

This repo contains:

base quants (2, 3, 4, 5, 6, 8 bits) for Exllamav3 (using SOTA random Hadamard transforms and Trellis quantization for high-quality reconstruction)
layer and tensor level KL-divergence measurements for bit-allocation optimization given a target size
theoretical research related to quantization, in particular MoE quantization

Motivation

The goals are:

to provide the best possible quants for what is arguably the top general model of 2025
to serve as a reference for quantization strategies (as of 2025 knowledge)

The base model is 355B parameters, which when 4-bit quantized should take about 177GiB, leaving almost 20GB for context, a perfect situation when you have 196GiB of VRAM (i.e. 8x 3090/4090, 6x 5090, $x RTX A6000, 4x RTX 6000 Ada or 2x RTX Pro 6000 Blackwell). Too bad all the 4-bit quants for my usual framework of choice, vllm, start at 191~200GiB of VRAM (AWQ / GPTQ / NVFP4 are actually ~4.5bpw).

So while looking for a new backend that could leverage tensor parallelism, I landed on Exllamav3. And even better it had already in place the proper tools to fully quantized Mixture-of-Experts (MoE) models, unlike vllm/llmcompressor that requires you extra code to ensure all experts are activated (or their activation might be quantized away as unimportant if you have a non-comprehensive calibration dataset).

Visual showcase of why ensuring quantization of all MoE experts is important

Source: https://avtc.github.io/aquarium-side-by-side/
Context: https://github.com/ModelCloud/GPTQModel/pull/2235

Artifacts

Base Quants

The base quants use the new "MCG" multiplier from https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-3395345415

Size measured through: https://github.com/turboderp-org/exllamav3/pull/103
Kullback-Leibler divergence (KL-div) and Top-K agreement measured through: https://github.com/turboderp-org/exllamav3/blob/v0.0.14/eval/model_diff.py
Perplexity measured through: https://github.com/turboderp-org/exllamav3/blob/v0.0.14/eval/model_diff.py
- Caveat both quantization calibration and perplexity use the same dataset in EXL3, hence we have overfitting.
  The most appropriate measure for quality is KL-divergence (i.e. how well the quant reproduces the original probability distribution of token output, before samplers)
  For example the 3-bit quant have lower perplexity than the original FP16.\

Quant	Size	KL-div (quant, FP16)	KL-div (FP16, quant)	Perplexity	Top-1	Top-2	Top-3	Top-4	Top-5
2bpw-H6	83 GiB	0.65096196	0.75914080	9.36106675	0.7315	0.3852	0.1653	0.0628	0.0221
3bpw-H6	124 GiB	0.27578034	0.28499938	6.95262863	0.8388	0.5717	0.3306	0.1713	0.0805
4bpw-H6	165 GiB	0.13722391	0.13577676	6.60474035	0.8947	0.6948	0.4810	0.3007	0.1754
5bpw-H8	206 GiB	0.10889671	0.10216227	6.41035355	0.9168	0.7520	0.5609	0.3905	0.2481
6bpw-H8	247 GiB	0.08202591	0.0784423	6.32611481	0.9334	0.7951	0.6274	0.4597	0.3190
8bpw-H8	328 GiB	0.07552261	0.07230427	6.38240525	0.9396	0.8172	0.6598	0.5048	0.3666
FP16	656 GiB			6.49784813

Optimized Quants

🛈 Despite the KL-divergence, even the 2.10bpw quant looks quite smart for creative writing.
Succinct test on a scenario with 1 narrator and 6 leads.

🛈 HuggingFace reports file sizes in GB while VRAM is in GiB, there is a factor (1024/1000)³ = 1.0734 between both.

Quant	Size	Context / VRAM	KL-div (quant, FP16)	KL-div (FP16, quant)	Perplexity	Top-1	Top-2	Top-3	Top-4	Top-5
2.10bpw-tuned🂱	86 GiB	131072 tokens, k5v4 for 96 GiB VRAM	0.54398251	0.61162654	7.15544606	0.7584	0.4237	0.1948	0.0801	0.0306
3.15bpw-tuned🂱	129 GiB	102400 tokens, k5v4 for 144 GiB VRAM	0.21854555	0.21465828	6.35729832	0.8573	0.6119	0.3776	0.2107	0.1071
3.84bpw-tuned🂱	158 GiB	202752 tokens (max), k6v5 for 192GiB VRAM	0.15823333	0.15401253	6.41935951	0.8854	0.6743	0.4587	0.2832	0.1638

"opt🂡" for automatically optimized quants
"tuned🂱" for hand-tuned quants

They can be downloaded with huggingface-cli with the following command:

hf download mratsim/GLM-4.7-EXL3 --revision 3.84bpw-tuned --local-dir /path/to/your/models/directory

Unfortunately, as of December 2025 automatically optimized quants are not able to beat hand-tuned heuristics and research-based mixed-precision quantization for I suspect one of 2 reasons (or both):

An optimization algorithm with no backtracking, i.e. single-pass but not comparing current layer importance with past layer importance.
Not taking synergies into account. Just like LLMs have emergent properties with size, it might be that up-quantizing certain projections significantly improve KL-divergence even if it appears as noise if we only measure improvement of a single up-quant.

Modified Dockerfile

You can build a custom tabbyAPI image that integrates PR #295:

Dockerfile

# Use an official CUDA runtime with Ubuntu as a parent image
FROM nvidia/cuda:12.8.1-runtime-ubuntu24.04

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    ca-certificates \
    python3.12 \
    python3-pip \
    python3.12-venv \
    git \
    && rm -rf /var/lib/apt/lists/*

# Create a virtual environment
RUN python3 -m venv /opt/venv

# Activate the venv and set the PATH
ENV PATH="/opt/venv/bin:$PATH"

# Upgrade pip and install uv
RUN pip install --no-cache-dir --upgrade pip

# Set the working directory in the container
WORKDIR /app

# Clone tabbyAPI repository
RUN git clone https://github.com/theroyallab/tabbyAPI.git /app

# Configure git user (required for merge)
RUN git config --global user.email "[email protected]" && \
    git config --global user.name "Docker Build"

# Fetch and merge PR #295
RUN git fetch origin pull/295/head:pr-295 && \
    git merge --strategy-option theirs pr-295

# Install packages specified in pyproject.toml cu12, extras
RUN pip install --no-cache-dir .[cu12,extras]

# Make port 5000 available to the world outside this container
EXPOSE 5000

# Set the entry point
ENTRYPOINT ["python3"]

# Run main.py when the container launches
CMD ["main.py"]

Detailed measurements of KL-div improvements

Quantization quality benchmarking in Exllama v3

Exllamav3 offers tools to measure per layer (with `-l2`) or even per-tensor (with `-l3`) contributions to KL-div improvements. They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.

The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target bpw to output an optimized quant.

Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having shared experts or self_attn layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results.

Quantization theory and heuristics for manual tuning

In-depth overview of quantization theory and heuristics for manual tuning

Layers to quantize

Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]

LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

This is also reported in Intel and Nvidia repo:

EXL3 can only quantize linear layers.

Tensors to up-quantize

If there is enough bits, down projections should be prioritized.

According to [4]

Fig. 3: Maximum absolute value over layers for a LLaMA3-8B. Each color represent a different projection and we clearly see that down_proj has the biggest spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model

According to [5]

Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting that weight outliers are concentrated in the down-projection matrices Wdown ℓ of the second layer and the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last two layers.

Mixture-of-Experts quantization (MoE)

Mixture-of-Experts require specific quantization techniques.

Mixed-precision quantization

Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:

quantizing expert FFN layers do not seriously impact model quality
quantizing cross-attention has some impact
quantizing self-attention has a large impact
quantizing dense FFN has a very significant impact

Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.

We notice that:

official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

Layers with high-impact

According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks.

Expert quantization

When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request.

EXL3 has the tooling in-place to ensure all experts are activated during quantization, though it is unsure if the dataset should be expanded to be diverse enough so that all experts have a high likelyhood of taking the full range of values they can exhibit to avoid clipping.

References

Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410
Precision Where It Matters: A Novel Spike
Aware Mixed-Precision Quantization Strategy for
LLaMA-based Language Models (2025)
Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello
https://arxiv.org/pdf/2504.21553
Systematic Outliers in Large Language Models (2025)
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
https://arxiv.org/pdf/2502.06415v2

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mratsim/GLM-4.7-EXL3

Base model

zai-org/GLM-4.7

Quantized

(19)

this model

Collection including mratsim/GLM-4.7-EXL3

2025 - General Purpose >256B

Collection

2 items • Updated 3 days ago