Model Overview

  • Model Architecture: Kimi-K2.5
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI300/MI325/MI350/MI355
  • ROCm: 7.1.0
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark
    • Weight quantization: MOE-only, INT4 Per-Channel & FP8E4M3, Static
    • Activation quantization: MOE-only, FP8E4M3, Dynamic

This model was built with Kimi-K2.5 model by applying AMD-Quark for INT4-FP8 quantization.

Model Quantization

The model was quantized from moonshotai/Kimi-K2.5 using AMD-Quark. The weights and activations are quantized to INT4-FP8.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Evaluation

The model was evaluated on GSM8K benchmarks.

Accuracy

Benchmark Kimi-K2.5 Kimi-K2.5-W4A8(this model) Recovery
GSM8K (flexible-extract) 94.09 93.40 99.27%

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image vllm/vllm-openai-rocm:v0.14.0.

Install the vLLM (commit ecb4f822091a64b5084b3a4aff326906487a363f) and lm-eval (Version: 0.4.10) in container first.

git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 setup.py develop

pip install lm-eval

Launching server

VLLM_ROCM_USE_AITER_MLA=0 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0 vllm serve amd/Kimi-K2.5-W4A8 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --enforce-eager

Evaluating model in a new terminal

lm_eval \
  --model local-completions \
  --model_args "model=amd/Kimi-K2.5-W4A8,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
70
Safetensors
Model size
139B params
Tensor type
BF16
F32
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for amd/Kimi-K2.5-W4A8

Quantized
(25)
this model