--- tags: - moe - minimax - bfloat16 - sglang - gguf license: mit datasets: - nick007x/github-code-2025 - tatsu-lab/alpaca base_model: - MiniMaxAI/MiniMax-M2 --- ![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png) # THRIFT — Targeted Reduction for Inference and Fine-Tuning A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io. ## TLDR We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved \~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a 50% pruned version of Kimi K2 Thinking.We’re writing the paper and expanding the evaluation set to substantiate the results. Check back later, cheers\! ## Why it’s useful * **Lower latency:** Snappier responses for interactive apps and chatbots. * **Smaller memory footprint:** Runs on cheaper GPUs or with fewer resources per replica. * **Higher throughput:** Serve more concurrent users at the same cost. * **Deployment-friendly:** Drop-in replacement for the base model in most inference stacks. * **Adaptable:** Supports light fine-tuning to match your domain and style guidelines. ## Intended use * General chat and coding assistance * Enterprise assistants with strict latency/VRAM budgets * Batch or realtime serving in cloud and on-prem environments * Edge or cost-sensitive deployments where efficiency matters ## When to use it * You’re constrained by GPU memory or need shorter response times * You want to increase QPS without scaling infrastructure * You need a model that is “good enough” for most tasks at a better cost profile --- # Model Comparison Report **Models Under Evaluation** | Model | Type | | :---- | :---- | | ModelCloud/MiniMax-M2-BF16 | Base Model | | VibeStudio/MiniMax-M2-THRIFT | Compressed/Optimized | **Evaluation Date: November 7, 2025** ## 📊 Results Comparison ### 1\) Multiple Choice Q\&A (lm-eval) **Overall MMLU Performance** | Model | MMLU Overall | Humanities | STEM | Social Sciences | Other | | :---- | ----: | ----: | ----: | ----: | ----: | | MiniMax-M2-BF16 | 83.16% | 77.45% | 80.91% | 90.02% | 87.29% | | MiniMax-M2-THRIFT | 77.72% | 70.14% | 77.61% | 86.84% | 80.27% | | **Δ (Difference)** | **\-5.44%** | **\-7.31%** | **\-3.30%** | **\-3.18%** | **\-7.02%** | **Individual Task Performance** | Task | BF16 (Base) | THRIFT-BF16 | Difference | | :---- | ----: | ----: | ----: | | arc\_challenge | 73.21% | 61.01% | \-12.20% ⬇️ | | arc\_easy | 88.30% | 83.08% | \-5.22% ⬇️ | | boolq | 87.95% | 84.95% | \-3.00% ⬇️ | | hellaswag | 83.00% | 77.09% | \-5.91% ⬇️ | | mmlu | 83.16% | 77.72% | \-5.44% ⬇️ | | openbookqa | 48.60% | 43.00% | \-5.60% ⬇️ | | rte | 75.45% | 80.14% | **\+4.69% ⬆️** | | winogrande | 76.48% | 74.90% | \-1.58% ⬇️ | **Average Accuracy Drop: \-4.28%** ### 2\) Code Generation (EvalPlus) **MBPP Results** | Model | MBPP (base) | MBPP+ (extended) | | :---- | ----: | ----: | | MiniMax-M2-BF16 | 73.8% | 64.0% | | MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon | **HumanEval Results** | Model | HumanEval (base) | HumanEval+ (extended) | | :---- | ----: | ----: | | MiniMax-M2-BF16 | ✅ Complete | ✅ Complete | | MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon | ### 3\) Math Benchmarks **GSM8K Results** | Model | Accuracy | Problems | | :---- | ----: | ----: | | MiniMax-M2-BF16 | 92.72% | 1,319 | | MiniMax-M2-THRIFT | 🔄 Coming Soon | 1,319 | **MATH-500 Results** | Model | Overall | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 | | :---- | ----: | ----: | ----: | ----: | ----: | ----: | | MiniMax-M2-BF16 | 87.2% | 90.7% | 95.56% | 82.86% | 85.16% | 85.82% | | MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 | 🔄 | 🔄 | 🔄 | 🔄 | ### 4\) LiveCodeBench (Live Coding Problems) | Model | pass@1 | Problems | Status | | :---- | ----: | ----: | :---- | | **MiniMax-M2-BF16** | **35.71%** | 182 | ✅ Complete | | **MiniMax-M2-THRIFT** | 🔄 Coming Soon | 182 | ⏳ Not Started Yet | --- ## 📈 Analysis (Preliminary) ### Key Findings **MMLU Performance Drop** * THRIFT-BF16 shows **\-5.44%** overall MMLU drop * Largest drop: **arc\_challenge (-12.20%)** * Smallest drop: **winogrande (-1.58%)** * **RTE improved by \+4.69%** 🎉 **Subject-Specific Performance** * Best preservation: **Social Sciences (-3.18%)** * Most degraded: **Other (-7.02%)** * STEM: **Moderate drop (-3.30%)** **Compression Trade-off** * THRIFT-BF16 (compressed) vs BF16 (base) * Average accuracy loss: **\~4–5%** * Expected for compressed/quantized models **MMLU Category Breakdown** | Category | BF16 (Base) | THRIFT-BF16 | Difference | Status | | :---- | ----: | ----: | ----: | :---- | | High School Government | 97.93% | 94.82% | \-3.11% | ✅ Still Excellent | | High School Psychology | 95.41% | 93.58% | \-1.83% | ✅ Well Preserved | | Marketing | 95.73% | 91.88% | \-3.85% | ✅ Good | | Professional Medicine | 92.28% | 79.78% | \-12.50% | ⚠️ Notable Drop | | Clinical Knowledge | 92.83% | 85.66% | \-7.17% | ⚠️ Moderate Drop | --- ## **sglang Deployment with Python** It is recommended to use a virtual environment (such as **venv**, **conda**, or **uv**) to avoid dependency conflicts. We recommend installing SGLang in a fresh Python environment: ```shell git clone -b v0.5.4.post1 https://github.com/sgl-project/sglang.git cd sglang # Install the python packages pip install --upgrade pip pip install -e "python" ``` Run the following command to start the SGLang server. SGLang will automatically download and cache the MiniMax-M2 model from Hugging Face. **4-GPU deployment command:** ```shell python -m sglang.launch_server \ --model-path MiniMaxAI/MiniMax-M2 \ --tp-size 4 \ --tool-call-parser minimax-m2 \ --reasoning-parser minimax-append-think \ --host 0.0.0.0 \ --trust-remote-code \ --port 8000 \ --mem-fraction-static 0.85 ``` **8-GPU deployment command:** ```shell python -m sglang.launch_server \ --model-path MiniMaxAI/MiniMax-M2 \ --tp-size 8 \ --ep-size 8 \ --tool-call-parser minimax-m2 \ --trust-remote-code \ --host 0.0.0.0 \ --reasoning-parser minimax-append-think \ --port 8000 \ --mem-fraction-static 0.85 ``` ## **Testing Deployment** After startup, you can test the SGLang OpenAI-compatible API with the following command: ```shell curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "MiniMaxAI/MiniMax-M2", "messages": [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]} ] }' ``` ## Benchmarks Coming soon. ## Research paper Coming soon. --- ## License This model is derived from MiniMax-M2 and distributed under the MIT License [http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE](http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE) --- ## Credits Model conversion and HF Transformers code by @Qubitum at ModelCloud. ## **References (BibTeX)** ``` @article{cai2025thinking, title = {Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series}, author = {Cai, Wenrui and Wang, Chengyu and Yan, Junbing and Huang, Jun and Fang, Xiangzhong}, journal = {arXiv preprint arXiv:2511.01354}, year = {2025}, eprinttype = {arXiv}, eprint = {2511.01354}, primaryclass = {cs.CL}, institution = {Shanghai Jiao Tong University and Alibaba Cloud Computing}, note = {License: arXiv.org perpetual non-exclusive license} } @misc{lasby-reap, title = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}}, author = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, year = {2025}, publisher = {arXiv}, note = {arXiv:2510.13999v1 [cs]}, url = {https://arxiv.org/abs/2510.13999v1}, } @article{yang2025wanda++, title = {Wanda++: Pruning Large Language Models via Regional Gradients}, author = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and K{"u}bler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek}, journal = {arXiv preprint arXiv:2503.04992}, year = {2025}, eprinttype = {arXiv}, eprint = {2503.04992}, primaryclass = {cs.CL} } @article{li2025tyr, title = {Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization}, author = {Li, G. and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad}, journal = {arXiv preprint arXiv:2503.09657}, year = {2025}, eprinttype = {arXiv}, eprint = {2503.09657}, primaryclass = {cs.CL} } @article{xia2023sheared, title = {Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning}, author = {Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi}, journal = {arXiv preprint arXiv:2310.06694}, year = {2023}, eprinttype = {arXiv}, eprint = {2310.06694}, primaryclass = {cs.CL} } @article{ma2023llmpruner, title = {LLM-Pruner: On the Structural Pruning of Large Language Models}, author = {Ma, Xinyin and Fang, Gongfan and Wang, Xinchao}, journal = {arXiv preprint arXiv:2305.11627}, year = {2023}, eprinttype = {arXiv}, eprint = {2305.11627}, primaryclass = {cs.CL} } @article{yang2023wanda, title = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis}, author = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek}, journal = {arXiv preprint arXiv:2306.11695}, year = {2023}, eprinttype = {arXiv}, eprint = {2306.11695}, primaryclass = {cs.CL} } @article{frantar2023sparsegpt, title = {SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot}, author = {Frantar, Elias and Alistarh, Dan}, journal = {arXiv preprint arXiv:2301.00774}, year = {2023}, eprinttype = {arXiv}, eprint = {2301.00774}, primaryclass = {cs.CL} } @article{dettmers2023qlora, title = {QLoRA: Efficient Finetuning of Quantized LLMs}, author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke}, journal = {arXiv preprint arXiv:2307.02973}, year = {2023}, eprinttype = {arXiv}, eprint = {2307.02973}, primaryclass = {cs.CL} } ```