This repo contains specialized MoE-quants for GLM-4.7. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality (Q8_0 to Q5_K) and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.
The mixture convention is as follows: [Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN], eg: Q8_0-Q4_K-Q4_K-Q5_K. This means:
- Q8_0 is the default type (attention, shared expert, etc.)
- Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
- Q5_K was used for the FFN_DOWN conditional expert tensors
I've mapped these mixes to the closest BPW I could reasonably discern.
| Quant | Size | Mixture | PPL | KLD |
|---|---|---|---|---|
| Q8_0 | 354.79 GiB (8.50 BPW) | Q8_0 | 8.6821 ± 0.15706 | 0 |
| Q5_K_M | 250.15 GiB (6.00 BPW) | Q8_0-Q5_K-Q5_K-Q6_K | 8.6823 ± 0.15710 | 0.01157 ± 0.00068 |
| Q4_K_M | 209.77 GiB (5.03 BPW) | Q8_0-Q4_K-Q4_K-Q5_K | 8.7467 ± 0.15845 | 0.01726 ± 0.00058 |
| IQ4_XS | 165.28 GiB (3.96 BPW) | Q8_0-IQ3_S-IQ3_S-IQ4_XS | 8.8664 ± 0.16071 | 0.04375 ± 0.00107 |
| IQ2_M | 107.12 GiB (2.57 BPW) | Q5_K-IQ2_XXS-IQ2_XXS-IQ3_XXS | 9.8248 ± 0.17931 | 0.19464 ± 0.00315 |
Shout Outs
- Downloads last month
- 575
Hardware compatibility
Log In
to view the estimation
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for AesSedai/GLM-4.7-GGUF
Base model
zai-org/GLM-4.7