THRIFT — Targeted Reduction for Inference and Fine-Tuning
A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.
TLDR
We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved ~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a 50% pruned version of Kimi K2 Thinking.We’re writing the paper and expanding the evaluation set to substantiate the results. Check back later, cheers!
Why it’s useful
- Lower latency: Snappier responses for interactive apps and chatbots.
- Smaller memory footprint: Runs on cheaper GPUs or with fewer resources per replica.
- Higher throughput: Serve more concurrent users at the same cost.
- Deployment-friendly: Drop-in replacement for the base model in most inference stacks.
- Adaptable: Supports light fine-tuning to match your domain and style guidelines.
Intended use
- General chat and coding assistance
- Enterprise assistants with strict latency/VRAM budgets
- Batch or realtime serving in cloud and on-prem environments
- Edge or cost-sensitive deployments where efficiency matters
When to use it
- You’re constrained by GPU memory or need shorter response times
- You want to increase QPS without scaling infrastructure
- You need a model that is “good enough” for most tasks at a better cost profile
Model Comparison Report
Models Under Evaluation
| Model | Type |
|---|---|
| ModelCloud/MiniMax-M2-BF16 | Base Model |
| VibeStudio/MiniMax-M2-THRIFT | Compressed/Optimized |
Evaluation Date: November 7, 2025
📊 Results Comparison
1) Multiple Choice Q&A (lm-eval)
Overall MMLU Performance
| Model | MMLU Overall | Humanities | STEM | Social Sciences | Other |
|---|---|---|---|---|---|
| MiniMax-M2-BF16 | 83.16% | 77.45% | 80.91% | 90.02% | 87.29% |
| MiniMax-M2-THRIFT | 77.72% | 70.14% | 77.61% | 86.84% | 80.27% |
| Δ (Difference) | -5.44% | -7.31% | -3.30% | -3.18% | -7.02% |
Individual Task Performance
| Task | BF16 (Base) | THRIFT-BF16 | Difference |
|---|---|---|---|
| arc_challenge | 73.21% | 61.01% | -12.20% ⬇️ |
| arc_easy | 88.30% | 83.08% | -5.22% ⬇️ |
| boolq | 87.95% | 84.95% | -3.00% ⬇️ |
| hellaswag | 83.00% | 77.09% | -5.91% ⬇️ |
| mmlu | 83.16% | 77.72% | -5.44% ⬇️ |
| openbookqa | 48.60% | 43.00% | -5.60% ⬇️ |
| rte | 75.45% | 80.14% | +4.69% ⬆️ |
| winogrande | 76.48% | 74.90% | -1.58% ⬇️ |
Average Accuracy Drop: -4.28%
2) Code Generation (EvalPlus)
MBPP Results
| Model | MBPP (base) | MBPP+ (extended) |
|---|---|---|
| MiniMax-M2-BF16 | 73.8% | 64.0% |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |
HumanEval Results
| Model | HumanEval (base) | HumanEval+ (extended) |
|---|---|---|
| MiniMax-M2-BF16 | ✅ Complete | ✅ Complete |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |
3) Math Benchmarks
GSM8K Results
| Model | Accuracy | Problems |
|---|---|---|
| MiniMax-M2-BF16 | 92.72% | 1,319 |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 1,319 |
MATH-500 Results
| Model | Overall | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
|---|---|---|---|---|---|---|
| MiniMax-M2-BF16 | 87.2% | 90.7% | 95.56% | 82.86% | 85.16% | 85.82% |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 | 🔄 | 🔄 | 🔄 | 🔄 |
4) LiveCodeBench (Live Coding Problems)
| Model | pass@1 | Problems | Status |
|---|---|---|---|
| MiniMax-M2-BF16 | 35.71% | 182 | ✅ Complete |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 182 | ⏳ Not Started Yet |
📈 Analysis (Preliminary)
Key Findings
MMLU Performance Drop
- THRIFT-BF16 shows -5.44% overall MMLU drop
- Largest drop: arc_challenge (-12.20%)
- Smallest drop: winogrande (-1.58%)
- RTE improved by +4.69% 🎉
Subject-Specific Performance
- Best preservation: Social Sciences (-3.18%)
- Most degraded: Other (-7.02%)
- STEM: Moderate drop (-3.30%)
Compression Trade-off
- THRIFT-BF16 (compressed) vs BF16 (base)
- Average accuracy loss: ~4–5%
- Expected for compressed/quantized models
MMLU Category Breakdown
| Category | BF16 (Base) | THRIFT-BF16 | Difference | Status |
|---|---|---|---|---|
| High School Government | 97.93% | 94.82% | -3.11% | ✅ Still Excellent |
| High School Psychology | 95.41% | 93.58% | -1.83% | ✅ Well Preserved |
| Marketing | 95.73% | 91.88% | -3.85% | ✅ Good |
| Professional Medicine | 92.28% | 79.78% | -12.50% | ⚠️ Notable Drop |
| Clinical Knowledge | 92.83% | 85.66% | -7.17% | ⚠️ Moderate Drop |
sglang Deployment with Python
It is recommended to use a virtual environment (such as venv, conda, or uv) to avoid dependency conflicts.
We recommend installing SGLang in a fresh Python environment:
git clone -b v0.5.4.post1 https://github.com/sgl-project/sglang.git
cd sglang
# Install the python packages
pip install --upgrade pip
pip install -e "python"
Run the following command to start the SGLang server. SGLang will automatically download and cache the MiniMax-M2 model from Hugging Face.
4-GPU deployment command:
python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2 \
--tp-size 4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--host 0.0.0.0 \
--trust-remote-code \
--port 8000 \
--mem-fraction-static 0.85
8-GPU deployment command:
python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2 \
--tp-size 8 \
--ep-size 8 \
--tool-call-parser minimax-m2 \
--trust-remote-code \
--host 0.0.0.0 \
--reasoning-parser minimax-append-think \
--port 8000 \
--mem-fraction-static 0.85
Testing Deployment
After startup, you can test the SGLang OpenAI-compatible API with the following command:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M2",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
Benchmarks
Coming soon.
Research paper
Coming soon.
License
This model is derived from MiniMax-M2 and distributed under the MIT License http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE
Credits
Model conversion and HF Transformers code by @Qubitum at ModelCloud.
References (BibTeX)
@article{cai2025thinking,
title = {Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series},
author = {Cai, Wenrui and Wang, Chengyu and Yan, Junbing and Huang, Jun and Fang, Xiangzhong},
journal = {arXiv preprint arXiv:2511.01354},
year = {2025},
eprinttype = {arXiv},
eprint = {2511.01354},
primaryclass = {cs.CL},
institution = {Shanghai Jiao Tong University and Alibaba Cloud Computing},
note = {License: arXiv.org perpetual non-exclusive license}
}
@misc{lasby-reap,
title = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}},
author = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
year = {2025},
publisher = {arXiv},
note = {arXiv:2510.13999v1 [cs]},
url = {https://arxiv.org/abs/2510.13999v1},
}
@article{yang2025wanda++,
title = {Wanda++: Pruning Large Language Models via Regional Gradients},
author = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and K{"u}bler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
journal = {arXiv preprint arXiv:2503.04992},
year = {2025},
eprinttype = {arXiv},
eprint = {2503.04992},
primaryclass = {cs.CL}
}
@article{li2025tyr,
title = {Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization},
author = {Li, G. and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad},
journal = {arXiv preprint arXiv:2503.09657},
year = {2025},
eprinttype = {arXiv},
eprint = {2503.09657},
primaryclass = {cs.CL}
}
@article{xia2023sheared,
title = {Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning},
author = {Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},
journal = {arXiv preprint arXiv:2310.06694},
year = {2023},
eprinttype = {arXiv},
eprint = {2310.06694},
primaryclass = {cs.CL}
}
@article{ma2023llmpruner,
title = {LLM-Pruner: On the Structural Pruning of Large Language Models},
author = {Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
journal = {arXiv preprint arXiv:2305.11627},
year = {2023},
eprinttype = {arXiv},
eprint = {2305.11627},
primaryclass = {cs.CL}
}
@article{yang2023wanda,
title = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
author = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
journal = {arXiv preprint arXiv:2306.11695},
year = {2023},
eprinttype = {arXiv},
eprint = {2306.11695},
primaryclass = {cs.CL}
}
@article{frantar2023sparsegpt,
title = {SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot},
author = {Frantar, Elias and Alistarh, Dan},
journal = {arXiv preprint arXiv:2301.00774},
year = {2023},
eprinttype = {arXiv},
eprint = {2301.00774},
primaryclass = {cs.CL}
}
@article{dettmers2023qlora,
title = {QLoRA: Efficient Finetuning of Quantized LLMs},
author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
journal = {arXiv preprint arXiv:2307.02973},
year = {2023},
eprinttype = {arXiv},
eprint = {2307.02973},
primaryclass = {cs.CL}
}
- Downloads last month
- 382
