This repo contains specialized MoE-quants for GLM-4.7. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality (Q8_0 to Q5_K) and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

The mixture convention is as follows: [Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN], eg: Q8_0-Q4_K-Q4_K-Q5_K. This means:

Q8_0 is the default type (attention, shared expert, etc.)
Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
Q5_K was used for the FFN_DOWN conditional expert tensors

I've mapped these mixes to the closest BPW I could reasonably discern.

Quant	Size	Mixture	PPL	KLD
Q8_0	354.79 GiB (8.50 BPW)	Q8_0	8.6821 ± 0.15706	0
Q5_K_M	250.15 GiB (6.00 BPW)	Q8_0-Q5_K-Q5_K-Q6_K	8.6823 ± 0.15710	0.01157 ± 0.00068
Q4_K_M	209.77 GiB (5.03 BPW)	Q8_0-Q4_K-Q4_K-Q5_K	8.7467 ± 0.15845	0.01726 ± 0.00058
IQ4_XS	165.28 GiB (3.96 BPW)	Q8_0-IQ3_S-IQ3_S-IQ4_XS	8.8664 ± 0.16071	0.04375 ± 0.00107
IQ2_M	107.12 GiB (2.57 BPW)	Q5_K-IQ2_XXS-IQ2_XXS-IQ3_XXS	9.8248 ± 0.17931	0.19464 ± 0.00315

Shout Outs

Downloads last month: 575

GGUF

Model size

358B params

Architecture

glm4moe

Hardware compatibility

2-bit

4-bit

View +1 variant

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AesSedai/GLM-4.7-GGUF

Base model

zai-org/GLM-4.7

Quantized

(17)

this model