THRIFT — Targeted Reduction for Inference and Fine-Tuning

A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.

TLDR

We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved ~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a 50% pruned version of Kimi K2 Thinking.We’re writing the paper and expanding the evaluation set to substantiate the results. Check back later, cheers!

Why it’s useful

Lower latency: Snappier responses for interactive apps and chatbots.
Smaller memory footprint: Runs on cheaper GPUs or with fewer resources per replica.
Higher throughput: Serve more concurrent users at the same cost.
Deployment-friendly: Drop-in replacement for the base model in most inference stacks.
Adaptable: Supports light fine-tuning to match your domain and style guidelines.

Intended use

General chat and coding assistance
Enterprise assistants with strict latency/VRAM budgets
Batch or realtime serving in cloud and on-prem environments
Edge or cost-sensitive deployments where efficiency matters

When to use it

You’re constrained by GPU memory or need shorter response times
You want to increase QPS without scaling infrastructure
You need a model that is “good enough” for most tasks at a better cost profile

Model Comparison Report

Models Under Evaluation

Model	Type
ModelCloud/MiniMax-M2-BF16	Base Model
VibeStudio/MiniMax-M2-THRIFT	Compressed/Optimized

Evaluation Date: November 7, 2025

📊 Results Comparison

1) Multiple Choice Q&A (lm-eval)

Overall MMLU Performance

Model	MMLU Overall	Humanities	STEM	Social Sciences	Other
MiniMax-M2-BF16	83.16%	77.45%	80.91%	90.02%	87.29%
MiniMax-M2-THRIFT	77.72%	70.14%	77.61%	86.84%	80.27%
Δ (Difference)	-5.44%	-7.31%	-3.30%	-3.18%	-7.02%

Individual Task Performance

Task	BF16 (Base)	THRIFT-BF16	Difference
arc_challenge	73.21%	61.01%	-12.20% ⬇️
arc_easy	88.30%	83.08%	-5.22% ⬇️
boolq	87.95%	84.95%	-3.00% ⬇️
hellaswag	83.00%	77.09%	-5.91% ⬇️
mmlu	83.16%	77.72%	-5.44% ⬇️
openbookqa	48.60%	43.00%	-5.60% ⬇️
rte	75.45%	80.14%	+4.69% ⬆️
winogrande	76.48%	74.90%	-1.58% ⬇️

Average Accuracy Drop: -4.28%

2) Code Generation (EvalPlus)

MBPP Results

Model	MBPP (base)	MBPP+ (extended)
MiniMax-M2-BF16	73.8%	64.0%
MiniMax-M2-THRIFT	🔄 Coming Soon	🔄 Coming Soon

HumanEval Results

Model	HumanEval (base)	HumanEval+ (extended)
MiniMax-M2-BF16	✅ Complete	✅ Complete
MiniMax-M2-THRIFT	🔄 Coming Soon	🔄 Coming Soon

3) Math Benchmarks

GSM8K Results

Model	Accuracy	Problems
MiniMax-M2-BF16	92.72%	1,319
MiniMax-M2-THRIFT	🔄 Coming Soon	1,319

MATH-500 Results

Model	Overall	Level 1	Level 2	Level 3	Level 4	Level 5
MiniMax-M2-BF16	87.2%	90.7%	95.56%	82.86%	85.16%	85.82%
MiniMax-M2-THRIFT	🔄 Coming Soon	🔄	🔄	🔄	🔄	🔄

4) LiveCodeBench (Live Coding Problems)

Model	pass@1	Problems	Status
MiniMax-M2-BF16	35.71%	182	✅ Complete
MiniMax-M2-THRIFT	🔄 Coming Soon	182	⏳ Not Started Yet

📈 Analysis (Preliminary)

Key Findings

MMLU Performance Drop

THRIFT-BF16 shows -5.44% overall MMLU drop
Largest drop: arc_challenge (-12.20%)
Smallest drop: winogrande (-1.58%)
RTE improved by +4.69% 🎉

Subject-Specific Performance

Best preservation: Social Sciences (-3.18%)
Most degraded: Other (-7.02%)
STEM: Moderate drop (-3.30%)

Compression Trade-off

THRIFT-BF16 (compressed) vs BF16 (base)
Average accuracy loss: ~4–5%
Expected for compressed/quantized models

MMLU Category Breakdown

Category	BF16 (Base)	THRIFT-BF16	Difference	Status
High School Government	97.93%	94.82%	-3.11%	✅ Still Excellent
High School Psychology	95.41%	93.58%	-1.83%	✅ Well Preserved
Marketing	95.73%	91.88%	-3.85%	✅ Good
Professional Medicine	92.28%	79.78%	-12.50%	⚠️ Notable Drop
Clinical Knowledge	92.83%	85.66%	-7.17%	⚠️ Moderate Drop

sglang Deployment with Python

It is recommended to use a virtual environment (such as venv, conda, or uv) to avoid dependency conflicts.

We recommend installing SGLang in a fresh Python environment:

git clone -b v0.5.4.post1 https://github.com/sgl-project/sglang.git
cd sglang

# Install the python packages
pip install --upgrade pip
pip install -e "python"

Run the following command to start the SGLang server. SGLang will automatically download and cache the MiniMax-M2 model from Hugging Face.

4-GPU deployment command:

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2 \
    --tp-size 4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --trust-remote-code \
    --port 8000 \
    --mem-fraction-static 0.85

8-GPU deployment command:

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2 \
    --tp-size 8 \
    --ep-size 8 \
    --tool-call-parser minimax-m2 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --reasoning-parser minimax-append-think \
    --port 8000 \
    --mem-fraction-static 0.85

Testing Deployment

After startup, you can test the SGLang OpenAI-compatible API with the following command:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M2",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

Benchmarks

Coming soon.

Research paper

Coming soon.

License

This model is derived from MiniMax-M2 and distributed under the MIT License http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE

Credits

Model conversion and HF Transformers code by @Qubitum at ModelCloud.

References (BibTeX)

@article{cai2025thinking,
  title        = {Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series},
  author       = {Cai, Wenrui and Wang, Chengyu and Yan, Junbing and Huang, Jun and Fang, Xiangzhong},
  journal      = {arXiv preprint arXiv:2511.01354},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2511.01354},
  primaryclass = {cs.CL},
  institution  = {Shanghai Jiao Tong University and Alibaba Cloud Computing},
  note         = {License: arXiv.org perpetual non-exclusive license}
}

@misc{lasby-reap,
    title       = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}},
    author      = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
    year        = {2025},
    publisher   = {arXiv},
    note        = {arXiv:2510.13999v1 [cs]},
    url         = {https://arxiv.org/abs/2510.13999v1}, 
}

@article{yang2025wanda++,
  title        = {Wanda++: Pruning Large Language Models via Regional Gradients},
  author       = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and K{"u}bler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
  journal      = {arXiv preprint arXiv:2503.04992},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2503.04992},
  primaryclass = {cs.CL}
}

@article{li2025tyr,
  title        = {Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization},
  author       = {Li, G. and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad},
  journal      = {arXiv preprint arXiv:2503.09657},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2503.09657},
  primaryclass = {cs.CL}
}

@article{xia2023sheared,
  title        = {Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning},
  author       = {Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},
  journal      = {arXiv preprint arXiv:2310.06694},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2310.06694},
  primaryclass = {cs.CL}
}

@article{ma2023llmpruner,
  title        = {LLM-Pruner: On the Structural Pruning of Large Language Models},
  author       = {Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  journal      = {arXiv preprint arXiv:2305.11627},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2305.11627},
  primaryclass = {cs.CL}
}

@article{yang2023wanda,
  title        = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
  author       = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
  journal      = {arXiv preprint arXiv:2306.11695},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2306.11695},
  primaryclass = {cs.CL}
}

@article{frantar2023sparsegpt,
  title        = {SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot},
  author       = {Frantar, Elias and Alistarh, Dan},
  journal      = {arXiv preprint arXiv:2301.00774},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2301.00774},
  primaryclass = {cs.CL}
}

@article{dettmers2023qlora,
  title        = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author       = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal      = {arXiv preprint arXiv:2307.02973},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2307.02973},
  primaryclass = {cs.CL}
}