---
tags:
- moe
- minimax
- bfloat16
- sglang
- gguf
license: mit
datasets:
- nick007x/github-code-2025
- tatsu-lab/alpaca
base_model:
- MiniMaxAI/MiniMax-M2
---


![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)

# THRIFT — Targeted Reduction for Inference and Fine-Tuning

A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.

## TLDR

We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved \~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a  50% pruned version of Kimi K2 Thinking.We’re writing the paper and expanding the evaluation set to substantiate the results. Check back later, cheers\!

## Why it’s useful

* **Lower latency:** Snappier responses for interactive apps and chatbots.  
* **Smaller memory footprint:** Runs on cheaper GPUs or with fewer resources per replica.  
* **Higher throughput:** Serve more concurrent users at the same cost.  
* **Deployment-friendly:** Drop-in replacement for the base model in most inference stacks.  
* **Adaptable:** Supports light fine-tuning to match your domain and style guidelines.

## Intended use

* General chat and coding assistance  
* Enterprise assistants with strict latency/VRAM budgets  
* Batch or realtime serving in cloud and on-prem environments  
* Edge or cost-sensitive deployments where efficiency matters

## When to use it

* You’re constrained by GPU memory or need shorter response times  
* You want to increase QPS without scaling infrastructure  
* You need a model that is “good enough” for most tasks at a better cost profile

---

# Model Comparison Report

**Models Under Evaluation**

| Model | Type |
| :---- | :---- |
| ModelCloud/MiniMax-M2-BF16 | Base Model |
| VibeStudio/MiniMax-M2-THRIFT | Compressed/Optimized |

**Evaluation Date: November 7, 2025**

## 📊 Results Comparison

### 1\) Multiple Choice Q\&A (lm-eval)

**Overall MMLU Performance**

| Model | MMLU Overall | Humanities | STEM | Social Sciences | Other |
| :---- | ----: | ----: | ----: | ----: | ----: |
| MiniMax-M2-BF16 | 83.16% | 77.45% | 80.91% | 90.02% | 87.29% |
| MiniMax-M2-THRIFT | 77.72% | 70.14% | 77.61% | 86.84% | 80.27% |
| **Δ (Difference)** | **\-5.44%** | **\-7.31%** | **\-3.30%** | **\-3.18%** | **\-7.02%** |

**Individual Task Performance**

| Task | BF16 (Base) | THRIFT-BF16 | Difference |
| :---- | ----: | ----: | ----: |
| arc\_challenge | 73.21% | 61.01% | \-12.20% ⬇️ |
| arc\_easy | 88.30% | 83.08% | \-5.22% ⬇️ |
| boolq | 87.95% | 84.95% | \-3.00% ⬇️ |
| hellaswag | 83.00% | 77.09% | \-5.91% ⬇️ |
| mmlu | 83.16% | 77.72% | \-5.44% ⬇️ |
| openbookqa | 48.60% | 43.00% | \-5.60% ⬇️ |
| rte | 75.45% | 80.14% | **\+4.69% ⬆️** |
| winogrande | 76.48% | 74.90% | \-1.58% ⬇️ |

**Average Accuracy Drop: \-4.28%**

### 2\) Code Generation (EvalPlus)

**MBPP Results**

| Model | MBPP (base) | MBPP+ (extended) |
| :---- | ----: | ----: |
| MiniMax-M2-BF16 | 73.8% | 64.0% |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |

**HumanEval Results**

| Model | HumanEval (base) | HumanEval+ (extended) |
| :---- | ----: | ----: |
| MiniMax-M2-BF16 | ✅ Complete | ✅ Complete |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |

### 3\) Math Benchmarks

**GSM8K Results**

| Model | Accuracy | Problems |
| :---- | ----: | ----: |
| MiniMax-M2-BF16 | 92.72% | 1,319 |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 1,319 |

**MATH-500 Results**

| Model | Overall | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
| :---- | ----: | ----: | ----: | ----: | ----: | ----: |
| MiniMax-M2-BF16 | 87.2% | 90.7% | 95.56% | 82.86% | 85.16% | 85.82% |
| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 | 🔄 | 🔄 | 🔄 | 🔄 |

### 4\) LiveCodeBench (Live Coding Problems)

| Model | pass@1 | Problems | Status |
| :---- | ----: | ----: | :---- |
| **MiniMax-M2-BF16** | **35.71%** | 182 | ✅ Complete |
| **MiniMax-M2-THRIFT** | 🔄 Coming Soon | 182 | ⏳ Not Started Yet |

---

## 📈 Analysis (Preliminary)

### Key Findings

**MMLU Performance Drop**

* THRIFT-BF16 shows **\-5.44%** overall MMLU drop  
* Largest drop: **arc\_challenge (-12.20%)**  
* Smallest drop: **winogrande (-1.58%)**  
* **RTE improved by \+4.69%** 🎉

**Subject-Specific Performance**

* Best preservation: **Social Sciences (-3.18%)**  
* Most degraded: **Other (-7.02%)**  
* STEM: **Moderate drop (-3.30%)**

**Compression Trade-off**

* THRIFT-BF16 (compressed) vs BF16 (base)  
* Average accuracy loss: **\~4–5%**  
* Expected for compressed/quantized models

**MMLU Category Breakdown**

| Category | BF16 (Base) | THRIFT-BF16 | Difference | Status |
| :---- | ----: | ----: | ----: | :---- |
| High School Government | 97.93% | 94.82% | \-3.11% | ✅ Still Excellent |
| High School Psychology | 95.41% | 93.58% | \-1.83% | ✅ Well Preserved |
| Marketing | 95.73% | 91.88% | \-3.85% | ✅ Good |
| Professional Medicine | 92.28% | 79.78% | \-12.50% | ⚠️ Notable Drop |
| Clinical Knowledge | 92.83% | 85.66% | \-7.17% | ⚠️ Moderate Drop |

---
## **sglang Deployment with Python**

It is recommended to use a virtual environment (such as **venv**, **conda**, or **uv**) to avoid dependency conflicts.

We recommend installing SGLang in a fresh Python environment:

```shell
git clone -b v0.5.4.post1 https://github.com/sgl-project/sglang.git
cd sglang

# Install the python packages
pip install --upgrade pip
pip install -e "python"
```

Run the following command to start the SGLang server. SGLang will automatically download and cache the MiniMax-M2 model from Hugging Face.

**4-GPU deployment command:**

```shell
python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2 \
    --tp-size 4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --trust-remote-code \
    --port 8000 \
    --mem-fraction-static 0.85
```

**8-GPU deployment command:**

```shell
python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2 \
    --tp-size 8 \
    --ep-size 8 \
    --tool-call-parser minimax-m2 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --reasoning-parser minimax-append-think \
    --port 8000 \
    --mem-fraction-static 0.85
```

## **Testing Deployment**

After startup, you can test the SGLang OpenAI-compatible API with the following command:

```shell
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M2",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'
```
## Benchmarks

Coming soon.

## Research paper

Coming soon.

---

## License

This model is derived from MiniMax-M2 and distributed under the MIT License [http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE](http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE)

---

## Credits

Model conversion and HF Transformers code by @Qubitum at ModelCloud.

## **References (BibTeX)**

```
@article{cai2025thinking,
  title        = {Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series},
  author       = {Cai, Wenrui and Wang, Chengyu and Yan, Junbing and Huang, Jun and Fang, Xiangzhong},
  journal      = {arXiv preprint arXiv:2511.01354},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2511.01354},
  primaryclass = {cs.CL},
  institution  = {Shanghai Jiao Tong University and Alibaba Cloud Computing},
  note         = {License: arXiv.org perpetual non-exclusive license}
}

@misc{lasby-reap,
    title       = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}},
    author      = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
    year        = {2025},
    publisher   = {arXiv},
    note        = {arXiv:2510.13999v1 [cs]},
    url         = {https://arxiv.org/abs/2510.13999v1}, 
}

@article{yang2025wanda++,
  title        = {Wanda++: Pruning Large Language Models via Regional Gradients},
  author       = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and K{"u}bler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
  journal      = {arXiv preprint arXiv:2503.04992},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2503.04992},
  primaryclass = {cs.CL}
}

@article{li2025tyr,
  title        = {Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization},
  author       = {Li, G. and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad},
  journal      = {arXiv preprint arXiv:2503.09657},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2503.09657},
  primaryclass = {cs.CL}
}

@article{xia2023sheared,
  title        = {Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning},
  author       = {Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},
  journal      = {arXiv preprint arXiv:2310.06694},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2310.06694},
  primaryclass = {cs.CL}
}

@article{ma2023llmpruner,
  title        = {LLM-Pruner: On the Structural Pruning of Large Language Models},
  author       = {Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  journal      = {arXiv preprint arXiv:2305.11627},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2305.11627},
  primaryclass = {cs.CL}
}

@article{yang2023wanda,
  title        = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
  author       = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
  journal      = {arXiv preprint arXiv:2306.11695},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2306.11695},
  primaryclass = {cs.CL}
}

@article{frantar2023sparsegpt,
  title        = {SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot},
  author       = {Frantar, Elias and Alistarh, Dan},
  journal      = {arXiv preprint arXiv:2301.00774},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2301.00774},
  primaryclass = {cs.CL}
}

@article{dettmers2023qlora,
  title        = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author       = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal      = {arXiv preprint arXiv:2307.02973},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2307.02973},
  primaryclass = {cs.CL}
}
```