Kimi-K2-Instruct-0905 MLX 5-bit
This is a 5-bit quantized version of moonshotai/Kimi-K2-Instruct-0905 converted to MLX format for efficient inference on Apple Silicon.
Model Details
- Original Model: Kimi-K2-Instruct-0905 by Moonshot AI
- Quantization: 5-bit quantization (5.502 bits per weight)
- Framework: MLX (Apple Machine Learning Framework)
- Model Size: ~658GB (5-bit quantized)
- Optimized for: Apple Silicon (M1/M2/M3/M4 chips)
Quantization Options
This model is available in multiple quantization levels:
- 8-bit - Highest quality, larger size
- 6-bit - Excellent balance of quality and size
- 5-bit (this model) - Very good quality with reduced size
- 4-bit - Lower memory usage
- 3-bit - Compact with acceptable quality
- 2-bit - Smallest size, fastest inference
Usage
Installation
pip install mlx-lm
Basic Usage
from mlx_lm import load, generate
# Load the model
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-5bit")
# Generate text
prompt = "你好,请介绍一下自己。"
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
Chat Format
messages = [
{"role": "user", "content": "What is the capital of France?"}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
Performance Considerations
5-bit Quantization Trade-offs:
- ✅ Very good quality retention (~92% of original model quality)
- ✅ Great balance between size and performance
- ✅ Smaller than 6-bit with minimal quality loss
- ✅ Suitable for production use
- ⚡ Good inference speed on Apple Silicon
Recommended Use Cases:
- Production deployments balancing quality and efficiency
- General-purpose applications
- When you need better quality than 4-bit but smaller than 6-bit
- Resource-constrained environments
Why Choose 5-bit: The 5-bit quantization offers an excellent middle ground between the premium 6-bit and the more efficient 4-bit versions. It provides near-original model quality while being more memory-efficient than 6-bit, making it ideal for systems with moderate RAM constraints.
System Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- macOS 13.0 or later
- Sufficient RAM (recommended: 64GB+ for optimal performance)
- Python 3.8+
Conversion Details
This model was quantized using MLX's conversion tools:
mlx_lm.convert \
--hf-path moonshotai/Kimi-K2-Instruct-0905 \
--mlx-path ./Kimi-K2-Instruct-0905-MLX-5bit \
-q --q-bits 5 \
--trust-remote-code
Actual quantization: 5.502 bits per weight
License
This model follows the same license as the original Kimi-K2-Instruct-0905 model. Please refer to the original model card for license details.
Citation
If you use this model, please cite the original Kimi model:
@misc{kimi-k2-instruct,
title={Kimi K2 Instruct},
author={Moonshot AI},
year={2024},
url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}
Acknowledgments
- Original model by Moonshot AI
- Quantization performed using MLX by Apple
- Conversion and hosting by richardyoung
Contact
For issues or questions about this quantized version, please open an issue on the model repository.
- Downloads last month
- 351