Kimi-K2-Instruct-0905 MLX 5-bit

This is a 5-bit quantized version of moonshotai/Kimi-K2-Instruct-0905 converted to MLX format for efficient inference on Apple Silicon.

Model Details

Original Model: Kimi-K2-Instruct-0905 by Moonshot AI
Quantization: 5-bit quantization (5.502 bits per weight)
Framework: MLX (Apple Machine Learning Framework)
Model Size: ~658GB (5-bit quantized)
Optimized for: Apple Silicon (M1/M2/M3/M4 chips)

Quantization Options

This model is available in multiple quantization levels:

8-bit - Highest quality, larger size
6-bit - Excellent balance of quality and size
5-bit (this model) - Very good quality with reduced size
4-bit - Lower memory usage
3-bit - Compact with acceptable quality
2-bit - Smallest size, fastest inference

Usage

Installation

pip install mlx-lm

Basic Usage

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-5bit")

# Generate text
prompt = "你好，请介绍一下自己。"
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

Chat Format

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)

Performance Considerations

5-bit Quantization Trade-offs:

✅ Very good quality retention (~92% of original model quality)
✅ Great balance between size and performance
✅ Smaller than 6-bit with minimal quality loss
✅ Suitable for production use
⚡ Good inference speed on Apple Silicon

Recommended Use Cases:

Production deployments balancing quality and efficiency
General-purpose applications
When you need better quality than 4-bit but smaller than 6-bit
Resource-constrained environments

Why Choose 5-bit: The 5-bit quantization offers an excellent middle ground between the premium 6-bit and the more efficient 4-bit versions. It provides near-original model quality while being more memory-efficient than 6-bit, making it ideal for systems with moderate RAM constraints.

System Requirements

Apple Silicon Mac (M1/M2/M3/M4)
macOS 13.0 or later
Sufficient RAM (recommended: 64GB+ for optimal performance)
Python 3.8+

Conversion Details

This model was quantized using MLX's conversion tools:

mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path ./Kimi-K2-Instruct-0905-MLX-5bit \
  -q --q-bits 5 \
  --trust-remote-code

Actual quantization: 5.502 bits per weight

License

This model follows the same license as the original Kimi-K2-Instruct-0905 model. Please refer to the original model card for license details.

Citation

If you use this model, please cite the original Kimi model:

@misc{kimi-k2-instruct,
  title={Kimi K2 Instruct},
  author={Moonshot AI},
  year={2024},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}

Acknowledgments

Original model by Moonshot AI
Quantization performed using MLX by Apple
Conversion and hosting by richardyoung

Contact

For issues or questions about this quantized version, please open an issue on the model repository.

Downloads last month: 351