Screenshot

THRIFT — Targeted Reduction for Inference and Fine-Tuning

A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.

TLDR

We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved ~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a 50% pruned version of Kimi K2 Thinking.We’re writing the paper and expanding the evaluation set to substantiate the results. Check back later, cheers!

Why it’s useful

  • Lower latency: Snappier responses for interactive apps and chatbots.
  • Smaller memory footprint: Runs on cheaper GPUs or with fewer resources per replica.
  • Higher throughput: Serve more concurrent users at the same cost.
  • Deployment-friendly: Drop-in replacement for the base model in most inference stacks.
  • Adaptable: Supports light fine-tuning to match your domain and style guidelines.

Intended use

  • General chat and coding assistance
  • Enterprise assistants with strict latency/VRAM budgets
  • Batch or realtime serving in cloud and on-prem environments
  • Edge or cost-sensitive deployments where efficiency matters

When to use it

  • You’re constrained by GPU memory or need shorter response times
  • You want to increase QPS without scaling infrastructure
  • You need a model that is “good enough” for most tasks at a better cost profile

Model Comparison Report

Models Under Evaluation

Model Type
ModelCloud/MiniMax-M2-BF16 Base Model
VibeStudio/MiniMax-M2-THRIFT Compressed/Optimized

Evaluation Date: November 7, 2025

📊 Results Comparison

1) Multiple Choice Q&A (lm-eval)

Overall MMLU Performance

Model MMLU Overall Humanities STEM Social Sciences Other
MiniMax-M2-BF16 83.16% 77.45% 80.91% 90.02% 87.29%
MiniMax-M2-THRIFT 77.72% 70.14% 77.61% 86.84% 80.27%
Δ (Difference) -5.44% -7.31% -3.30% -3.18% -7.02%

Individual Task Performance

Task BF16 (Base) THRIFT-BF16 Difference
arc_challenge 73.21% 61.01% -12.20% ⬇️
arc_easy 88.30% 83.08% -5.22% ⬇️
boolq 87.95% 84.95% -3.00% ⬇️
hellaswag 83.00% 77.09% -5.91% ⬇️
mmlu 83.16% 77.72% -5.44% ⬇️
openbookqa 48.60% 43.00% -5.60% ⬇️
rte 75.45% 80.14% +4.69% ⬆️
winogrande 76.48% 74.90% -1.58% ⬇️

Average Accuracy Drop: -4.28%

2) Code Generation (EvalPlus)

MBPP Results

Model MBPP (base) MBPP+ (extended)
MiniMax-M2-BF16 73.8% 64.0%
MiniMax-M2-THRIFT 🔄 Coming Soon 🔄 Coming Soon

HumanEval Results

Model HumanEval (base) HumanEval+ (extended)
MiniMax-M2-BF16 ✅ Complete ✅ Complete
MiniMax-M2-THRIFT 🔄 Coming Soon 🔄 Coming Soon

3) Math Benchmarks

GSM8K Results

Model Accuracy Problems
MiniMax-M2-BF16 92.72% 1,319
MiniMax-M2-THRIFT 🔄 Coming Soon 1,319

MATH-500 Results

Model Overall Level 1 Level 2 Level 3 Level 4 Level 5
MiniMax-M2-BF16 87.2% 90.7% 95.56% 82.86% 85.16% 85.82%
MiniMax-M2-THRIFT 🔄 Coming Soon 🔄 🔄 🔄 🔄 🔄

4) LiveCodeBench (Live Coding Problems)

Model pass@1 Problems Status
MiniMax-M2-BF16 35.71% 182 ✅ Complete
MiniMax-M2-THRIFT 🔄 Coming Soon 182 ⏳ Not Started Yet

📈 Analysis (Preliminary)

Key Findings

MMLU Performance Drop

  • THRIFT-BF16 shows -5.44% overall MMLU drop
  • Largest drop: arc_challenge (-12.20%)
  • Smallest drop: winogrande (-1.58%)
  • RTE improved by +4.69% 🎉

Subject-Specific Performance

  • Best preservation: Social Sciences (-3.18%)
  • Most degraded: Other (-7.02%)
  • STEM: Moderate drop (-3.30%)

Compression Trade-off

  • THRIFT-BF16 (compressed) vs BF16 (base)
  • Average accuracy loss: ~4–5%
  • Expected for compressed/quantized models

MMLU Category Breakdown

Category BF16 (Base) THRIFT-BF16 Difference Status
High School Government 97.93% 94.82% -3.11% ✅ Still Excellent
High School Psychology 95.41% 93.58% -1.83% ✅ Well Preserved
Marketing 95.73% 91.88% -3.85% ✅ Good
Professional Medicine 92.28% 79.78% -12.50% ⚠️ Notable Drop
Clinical Knowledge 92.83% 85.66% -7.17% ⚠️ Moderate Drop

sglang Deployment with Python

It is recommended to use a virtual environment (such as venv, conda, or uv) to avoid dependency conflicts.

We recommend installing SGLang in a fresh Python environment:

git clone -b v0.5.4.post1 https://github.com/sgl-project/sglang.git
cd sglang

# Install the python packages
pip install --upgrade pip
pip install -e "python"

Run the following command to start the SGLang server. SGLang will automatically download and cache the MiniMax-M2 model from Hugging Face.

4-GPU deployment command:

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2 \
    --tp-size 4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --trust-remote-code \
    --port 8000 \
    --mem-fraction-static 0.85

8-GPU deployment command:

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2 \
    --tp-size 8 \
    --ep-size 8 \
    --tool-call-parser minimax-m2 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --reasoning-parser minimax-append-think \
    --port 8000 \
    --mem-fraction-static 0.85

Testing Deployment

After startup, you can test the SGLang OpenAI-compatible API with the following command:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M2",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

Benchmarks

Coming soon.

Research paper

Coming soon.


License

This model is derived from MiniMax-M2 and distributed under the MIT License http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE


Credits

Model conversion and HF Transformers code by @Qubitum at ModelCloud.

References (BibTeX)

@article{cai2025thinking,
  title        = {Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series},
  author       = {Cai, Wenrui and Wang, Chengyu and Yan, Junbing and Huang, Jun and Fang, Xiangzhong},
  journal      = {arXiv preprint arXiv:2511.01354},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2511.01354},
  primaryclass = {cs.CL},
  institution  = {Shanghai Jiao Tong University and Alibaba Cloud Computing},
  note         = {License: arXiv.org perpetual non-exclusive license}
}

@misc{lasby-reap,
    title       = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}},
    author      = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
    year        = {2025},
    publisher   = {arXiv},
    note        = {arXiv:2510.13999v1 [cs]},
    url         = {https://arxiv.org/abs/2510.13999v1}, 
}

@article{yang2025wanda++,
  title        = {Wanda++: Pruning Large Language Models via Regional Gradients},
  author       = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and K{"u}bler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
  journal      = {arXiv preprint arXiv:2503.04992},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2503.04992},
  primaryclass = {cs.CL}
}

@article{li2025tyr,
  title        = {Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization},
  author       = {Li, G. and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad},
  journal      = {arXiv preprint arXiv:2503.09657},
  year         = {2025},
  eprinttype   = {arXiv},
  eprint       = {2503.09657},
  primaryclass = {cs.CL}
}

@article{xia2023sheared,
  title        = {Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning},
  author       = {Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},
  journal      = {arXiv preprint arXiv:2310.06694},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2310.06694},
  primaryclass = {cs.CL}
}

@article{ma2023llmpruner,
  title        = {LLM-Pruner: On the Structural Pruning of Large Language Models},
  author       = {Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  journal      = {arXiv preprint arXiv:2305.11627},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2305.11627},
  primaryclass = {cs.CL}
}

@article{yang2023wanda,
  title        = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
  author       = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
  journal      = {arXiv preprint arXiv:2306.11695},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2306.11695},
  primaryclass = {cs.CL}
}

@article{frantar2023sparsegpt,
  title        = {SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot},
  author       = {Frantar, Elias and Alistarh, Dan},
  journal      = {arXiv preprint arXiv:2301.00774},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2301.00774},
  primaryclass = {cs.CL}
}

@article{dettmers2023qlora,
  title        = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author       = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal      = {arXiv preprint arXiv:2307.02973},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2307.02973},
  primaryclass = {cs.CL}
}
Downloads last month
382
Safetensors
Model size
173B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for VibeStudio/MiniMax-M2-THRIFT

Finetuned
(10)
this model
Finetunes
1 model
Quantizations
2 models

Datasets used to train VibeStudio/MiniMax-M2-THRIFT

Collection including VibeStudio/MiniMax-M2-THRIFT