|  | --- | 
					
						
						|  | language: | 
					
						
						|  | - en | 
					
						
						|  | library_name: transformers | 
					
						
						|  | tags: | 
					
						
						|  | - glm | 
					
						
						|  | - MOE | 
					
						
						|  | - pruning | 
					
						
						|  | - compression | 
					
						
						|  | license: mit | 
					
						
						|  | name: cerebras/GLM-4.5-Air-REAP-82B-A12B | 
					
						
						|  | description: > | 
					
						
						|  | This model was obtained by uniformly pruning 25% of experts in GLM-4.5-Air using the REAP method. | 
					
						
						|  | readme: > | 
					
						
						|  | https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B/main/README.md | 
					
						
						|  | license_link: https://huggingface.co/zai-org/GLM-4.5-Air/blob/main/LICENSE | 
					
						
						|  | pipeline_tag: text-generation | 
					
						
						|  | base_model: | 
					
						
						|  | - zai-org/GLM-4.5-Air | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | <p align="center"> | 
					
						
						|  | <em>π³ <strong>REAP</strong>π³  the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br> | 
					
						
						|  | <img src="https://i.imgur.com/rmzG3gg.png" alt="REAP" width="75%"> | 
					
						
						|  | </p> | 
					
						
						|  |  | 
					
						
						|  | # GLM-4.5-Air-REAP-82B-A12B | 
					
						
						|  |  | 
					
						
						|  | ## β¨ Highlights | 
					
						
						|  |  | 
					
						
						|  | Introducing **GLM-4.5-Air-REAP-82B-A12B**, a **memory-efficient compressed variant** of GLM-4.5-Air that maintains near-identical performance while being **25% lighter**. | 
					
						
						|  |  | 
					
						
						|  | This model was created using **REAP (Router-weighted Expert Activation Pruning)**, a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include: | 
					
						
						|  |  | 
					
						
						|  | - **Near-Lossless Performance**: Maintains almost identical accuracy on code generation, agentic coding, and function calling tasks compared to the full 106B model | 
					
						
						|  | - **25% Memory Reduction**: Compressed from 106B to 82B parameters, significantly lowering deployment costs and memory requirements | 
					
						
						|  | - **Preserved Capabilities**: Retains all core functionalities including code generation, agentic workflows, repository-scale understanding, and function calling | 
					
						
						|  | - **Drop-in Compatibility**: Works with vanilla vLLM - no source modifications or custom patches required | 
					
						
						|  | - **Optimized for Real-World Use**: Particularly effective for resource-constrained environments, local deployments, and academic research | 
					
						
						|  | --- | 
					
						
						|  | ## π Model Overview | 
					
						
						|  |  | 
					
						
						|  | **GLM-4.5-Air-REAP-82B-A12B** has the following specifications: | 
					
						
						|  |  | 
					
						
						|  | - **Base Model**: GLM-4.5-Air | 
					
						
						|  | - **Compression Method**: REAP (Router-weighted Expert Activation Pruning) | 
					
						
						|  | - **Compression Ratio**: 25% expert pruning | 
					
						
						|  | - **Type**: Sparse Mixture-of-Experts (SMoE) Causal Language Model | 
					
						
						|  | - **Number of Parameters**: 82B total, 12B activated per token | 
					
						
						|  | - **Number of Layers**: 46 | 
					
						
						|  | - **Number of Attention Heads (GQA)**: 96 for Q and 8 for KV | 
					
						
						|  | - **Number of Experts**: 96 (uniformly pruned from 128) | 
					
						
						|  | - **Number of Activated Experts**: 8 per token | 
					
						
						|  | - **Context Length**: 131,072 tokens | 
					
						
						|  | - **License**: MIT | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | ## π Evaluations | 
					
						
						|  |  | 
					
						
						|  | <table> | 
					
						
						|  | <thead> | 
					
						
						|  | <tr> | 
					
						
						|  | <th align="left">Benchmark</th> | 
					
						
						|  | <th align="center">GLM-4.5-Air</th> | 
					
						
						|  | <th align="center"><a href="https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B">GLM-4.5-Air-REAP-82B-A12B</a></th> | 
					
						
						|  | </tr> | 
					
						
						|  | </thead> | 
					
						
						|  | <tbody> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>Compression</strong></td> | 
					
						
						|  | <td align="center">β</td> | 
					
						
						|  | <td align="center">25%</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td colspan="3" align="center"><strong>Coding</strong></td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>HumanEval</strong></td> | 
					
						
						|  | <td align="center">92.7</td> | 
					
						
						|  | <td align="center">89.6</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>HumanEval+</strong></td> | 
					
						
						|  | <td align="center">86.0</td> | 
					
						
						|  | <td align="center">84.8</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>MBPP</strong></td> | 
					
						
						|  | <td align="center">86.2</td> | 
					
						
						|  | <td align="center">84.4</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>MBPP+</strong></td> | 
					
						
						|  | <td align="center">69.8</td> | 
					
						
						|  | <td align="center">69.6</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td colspan="3" align="center"><strong>Reasoning</strong></td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>LiveCodeBench</strong> (25.01 - 25.05, thinking)</td> | 
					
						
						|  | <td align="center">39.6</td> | 
					
						
						|  | <td align="center">42.9</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>GPQA diamond</strong> (thinking)</td> | 
					
						
						|  | <td align="center">65.2</td> | 
					
						
						|  | <td align="center">65.2</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>AIME24</strong> (thinking)</td> | 
					
						
						|  | <td align="center">83.3</td> | 
					
						
						|  | <td align="center">80.0</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>MATH-500</strong> (thinking)</td> | 
					
						
						|  | <td align="center">94.8</td> | 
					
						
						|  | <td align="center">94.8</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td colspan="3" align="center"><strong>Tool Calling</strong></td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>BFCL-v3</strong></td> | 
					
						
						|  | <td align="center">73.4</td> | 
					
						
						|  | <td align="center">71.8</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>BFCL-v3</strong> (thinking)</td> | 
					
						
						|  | <td align="center">76.8</td> | 
					
						
						|  | <td align="center">76.3</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>πΒ²-bench</strong> (airline)</td> | 
					
						
						|  | <td align="center">63.3</td> | 
					
						
						|  | <td align="center">64.0</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>πΒ²-bench</strong> (retail)</td> | 
					
						
						|  | <td align="center">72.8</td> | 
					
						
						|  | <td align="center">75.1</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>πΒ²-bench</strong> (telecom)</td> | 
					
						
						|  | <td align="center">28.4</td> | 
					
						
						|  | <td align="center">30.7</td> | 
					
						
						|  | </tr> | 
					
						
						|  | <tr> | 
					
						
						|  | <td><strong>πΒ²-bench</strong> (telecom, thinking)</td> | 
					
						
						|  | <td align="center">27.2</td> | 
					
						
						|  | <td align="center">26.9</td> | 
					
						
						|  | </tr> | 
					
						
						|  | </tbody> | 
					
						
						|  | </table> | 
					
						
						|  |  | 
					
						
						|  | π© *This checkpoint maintains almost identical performance while being 25% lighter.* | 
					
						
						|  |  | 
					
						
						|  | For more details on the evaluation setup, refer to the [REAP arXiv preprint](https://arxiv.org/abs/2510.13999). | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | ## π Deployment | 
					
						
						|  |  | 
					
						
						|  | You can deploy the model directly using the **latest vLLM** (v0.11.0), no source modifications or custom patches required. | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | vllm serve cerebras/GLM-4.5-Air-REAP-82B-A12B \ | 
					
						
						|  | --tensor-parallel-size 4 \ | 
					
						
						|  | --tool-call-parser glm45 \ | 
					
						
						|  | --enable-auto-tool-choice \ | 
					
						
						|  | --enable-expert-parallel | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | If you encounter insufficient memory when running this model, you might need to set a lower value for `--max-num-seqs` flag (e.g. set to 64). | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## π§© Model Creation | 
					
						
						|  |  | 
					
						
						|  | This checkpoint was created by applying the **REAP (Router-weighted Expert Activation Pruning)** method uniformly across all Mixture-of-Experts (MoE) blocks of **GLM-4.5-Air**, with a **25% pruning rate**. | 
					
						
						|  |  | 
					
						
						|  | ### How REAP Works | 
					
						
						|  |  | 
					
						
						|  | REAP selects experts to prune based on a novel **saliency criterion** that considers both: | 
					
						
						|  | - **Router gate values**: How frequently and strongly the router activates each expert | 
					
						
						|  | - **Expert activation norms**: The magnitude of each expert's output contributions | 
					
						
						|  |  | 
					
						
						|  | This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations. | 
					
						
						|  |  | 
					
						
						|  | ### Key Advantages | 
					
						
						|  |  | 
					
						
						|  | - **One-Shot Compression**: No fine-tuning required after pruning - the model is immediately ready for deployment | 
					
						
						|  | - **Preserved Router Control**: Unlike expert merging methods, REAP maintains the router's independent, input-dependent control over remaining experts, avoiding "functional subspace collapse" | 
					
						
						|  | - **Generative Task Superiority**: REAP significantly outperforms expert merging approaches on generative benchmarks (code generation, creative writing, mathematical reasoning) while maintaining competitive performance on discriminative tasks | 
					
						
						|  |  | 
					
						
						|  | ### Calibration | 
					
						
						|  |  | 
					
						
						|  | The model was calibrated using a diverse mixture of domain-specific datasets including: | 
					
						
						|  | - Code generation samples ([evol-codealpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)) | 
					
						
						|  | - Function calling examples ([xlam-function-calling](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)) | 
					
						
						|  | - Agentic multi-turn trajectories ([SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories)) | 
					
						
						|  |  | 
					
						
						|  | π For more details, refer to the following resources: | 
					
						
						|  |  | 
					
						
						|  | - [π§Ύ arXiv Preprint](https://arxiv.org/abs/2510.13999) | 
					
						
						|  | - [π§Ύ REAP Blog](https://www.cerebras.ai/blog/reap) | 
					
						
						|  | - [π» REAP Codebase (GitHub)](https://github.com/CerebrasResearch/reap) | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | ## βοΈ License | 
					
						
						|  |  | 
					
						
						|  | This model is derived from | 
					
						
						|  | **[`zai-org/GLM-4.5-Air`](https://huggingface.co/zai-org/GLM-4.5-Air)** | 
					
						
						|  | and distributed under the **MIT license**. | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | ## π§Ύ Citation | 
					
						
						|  |  | 
					
						
						|  | If you use this checkpoint, please cite the REAP paper: | 
					
						
						|  |  | 
					
						
						|  | ```bibtex | 
					
						
						|  | @article{lasby-reap, | 
					
						
						|  | title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}, | 
					
						
						|  | author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, | 
					
						
						|  | journal={arXiv preprint arXiv:2510.13999}, | 
					
						
						|  | year={2025} | 
					
						
						|  | } | 
					
						
						|  | ``` |