This repo contains specialized MoE-quants for GLM-4.7. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality (Q8_0 to Q5_K) and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

The mixture convention is as follows: [Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN], eg: Q8_0-Q4_K-Q4_K-Q5_K. This means:

  • Q8_0 is the default type (attention, shared expert, etc.)
  • Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
  • Q5_K was used for the FFN_DOWN conditional expert tensors

I've mapped these mixes to the closest BPW I could reasonably discern.

Quant Size Mixture PPL KLD
Q8_0 354.79 GiB (8.50 BPW) Q8_0 8.6821 ± 0.15706 0
Q5_K_M 250.15 GiB (6.00 BPW) Q8_0-Q5_K-Q5_K-Q6_K 8.6823 ± 0.15710 0.01157 ± 0.00068
Q4_K_M 209.77 GiB (5.03 BPW) Q8_0-Q4_K-Q4_K-Q5_K 8.7467 ± 0.15845 0.01726 ± 0.00058
IQ4_XS 165.28 GiB (3.96 BPW) Q8_0-IQ3_S-IQ3_S-IQ4_XS 8.8664 ± 0.16071 0.04375 ± 0.00107
IQ2_M 107.12 GiB (2.57 BPW) Q5_K-IQ2_XXS-IQ2_XXS-IQ3_XXS 9.8248 ± 0.17931 0.19464 ± 0.00315

ppl_ratio_vs_kld

Shout Outs

Downloads last month
575
GGUF
Model size
358B params
Architecture
glm4moe
Hardware compatibility
Log In to view the estimation

2-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AesSedai/GLM-4.7-GGUF

Base model

zai-org/GLM-4.7
Quantized
(17)
this model