Instructions to use RedHatAI/Kimi-K2.6-FP8-BLOCK with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="RedHatAI/Kimi-K2.6-FP8-BLOCK", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("RedHatAI/Kimi-K2.6-FP8-BLOCK", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RedHatAI/Kimi-K2.6-FP8-BLOCK" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Kimi-K2.6-FP8-BLOCK", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/RedHatAI/Kimi-K2.6-FP8-BLOCK
- SGLang
How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RedHatAI/Kimi-K2.6-FP8-BLOCK" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Kimi-K2.6-FP8-BLOCK", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RedHatAI/Kimi-K2.6-FP8-BLOCK" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Kimi-K2.6-FP8-BLOCK", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with Docker Model Runner:
docker model run hf.co/RedHatAI/Kimi-K2.6-FP8-BLOCK
RedHatAI/Kimi-K2.6-FP8-BLOCK
Model Overview
- Model Architecture: moonshotai/Kimi-K2.6 (
KimiK25ForConditionalGeneration) - Input: Text, image, and video
- Output: Text
- Weight Quantization: FP8 (block-wise scaling)
- Activation Quantization: FP8 (dynamic grouped scaling)
- Release Date: 2026-04-29
- Model Developers: RedHatAI
This model is a quantized variant of moonshotai/Kimi-K2.6, exported in compressed-tensors format for vLLM deployment and evaluated on instruction-following, reasoning, function-calling, and agentic coding workloads.
Model Optimizations
This checkpoint applies FP8 block quantization to transformer linear layers and FP8 dynamic quantization to activations. The resulting representation is optimized for high-throughput serving while maintaining strong benchmark retention on Kimi-K2.6 evaluation suites.
The model is exported in compressed-tensors format and is intended for OpenAI-compatible inference with vLLM.
Creation
This model was quantized with LLM Compressor and exported as compressed-tensors. The script below is a representative reference script aligned with the published quantization configuration.
Reference quantization script (FP8 block)
from compressed_tensors.entrypoints.convert import CompressedTensorsDequantizer
from llmcompressor import model_free_ptq
MODEL_ID = "moonshotai/Kimi-K2.6"
SAVE_DIR = "Kimi-K2.6-FP8-BLOCK"
ignore = [
"re:.*mlp.gate$",
"re:.*lm_head",
"re:.*kv_a_proj_with_mqa$",
"re:.*q_a_proj$",
"re:.*vision_tower.*",
"re:.*embed_tokens$",
"re:.*norm$",
"re:.*mm_projector.*",
"re:.*vision.*",
]
model_free_ptq(
model_stub=MODEL_ID,
save_directory=SAVE_DIR,
scheme="FP8_BLOCK",
ignore=ignore,
converter=CompressedTensorsDequantizer(
MODEL_ID,
quant_config_key="text_config.quantization_config",
ignore=ignore,
),
max_workers=2,
device="cuda:0",
)
Deployment
Use with vLLM
vllm serve RedHatAI/Kimi-K2.6-FP8-BLOCK \
--trust-remote-code \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="RedHatAI/Kimi-K2.6-FP8-BLOCK",
messages=[{"role": "user", "content": "Explain how transformers use attention."}],
)
print(resp.choices[0].message.content)
Evaluation
We evaluated this model with lm-evaluation-harness, lighteval, BFCL v4, and SWE-Bench Lite served through a vLLM (0.22.1) OpenAI-compatible endpoint.
| Category | Benchmark | Score |
|---|---|---|
| Reasoning and instruction following | AIME25 (pass@1, avg@8) | 96.25% |
| Reasoning and instruction following | GPQA Diamond (pass@1, avg@3) | 89.39% |
| Reasoning and instruction following | MATH-500 (pass@1, avg@3) | 94.27% |
| Reasoning and instruction following | MMLU-Pro Chat (custom-extract, avg@3) | 86.55% |
| Reasoning and instruction following | GSM8K Platinum CoT (strict-match, avg@3) | 93.13% |
| Reasoning and instruction following | GSM8K Platinum CoT (flexible-extract, avg@3) | 96.31% |
| Reasoning and instruction following | IFEval (prompt-level strict, avg@3) | 95.63% |
| Reasoning and instruction following | IFEval (instruction-level strict, avg@3) | 96.92% |
| Agentic function calling (accuracy) | BFCL v4 non_live | 85.46% |
| Agentic function calling (accuracy) | BFCL v4 live | 79.50% |
| Agentic function calling (accuracy) | BFCL v4 multi_turn | 60.50% |
| Agentic function calling (accuracy) | BFCL v4 memory | 57.42% |
| Agentic function calling (accuracy) | BFCL v4 web_search | 45.00% |
| Agentic coding | SWE-Bench Lite (dev) | 34.78% |
BFCL rows report category accuracy. SWE-Bench follows the official harness score style. For run transparency: 8 of 23 tasks were resolved, and 22 instances produced non-empty graded patches.
Historical preliminary check
| Benchmark | Base model (moonshotai/Kimi-K2.6) |
This model |
|---|---|---|
| GSM8K Platinum accuracy | 94.29% | 93.55% |
| Recovery | - | 99.2% |
Reproduction
Representative commands used to produce and aggregate these runs:
vLLM + lm-eval (example)
lm_eval --model local-chat-completions \
--tasks gsm8k_platinum_cot_llama \
--model_args "model=RedHatAI/Kimi-K2.6-FP8-BLOCK,max_length=40960,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
--num_fewshot 0 \
--apply_chat_template \
--output_path results_gsm8k_platinum.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=1234"
lighteval config used
model_parameters:
provider: "hosted_vllm"
model_name: "hosted_vllm/RedHatAI/Kimi-K2.6-FP8-BLOCK"
base_url: "http://127.0.0.1:8000/v1"
api_key: "EMPTY"
timeout: 3600
max_model_length: 40960
concurrent_requests: 8
generation_parameters:
temperature: 1.0
max_new_tokens: 65536
top_p: 0.95
seed: 1234
top_k: 20
presence_penalty: 1.5
lighteval endpoint litellm litellm_config.yaml \
"aime25@1@8|0,math_500@1@3|0,gpqa:diamond@1@3|0" \
--output-dir results_lighteval \
--save-details
BFCL v4 and SWE-Bench Lite scripts
# BFCL categories: non_live, live, multi_turn, memory, web_search
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 non_live
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 live
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 multi_turn
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 memory
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 web_search
# SWE-Bench Lite dev (full split)
SWEBENCH_SUBSET=lite SWEBENCH_SPLIT=dev SWEBENCH_SLICE= \
./scripts/swebench/run_swebench_lite_local.sh kimi_fp8
# Official SWE-bench resolved-rate evaluation
/home/shubhra/environments/mini-swe-agent/bin/python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-Bench_Lite \
--split dev \
--predictions_path /home/shubhra/kimik2.6_evals/results/swebench_resolved_eval/kimi_fp8_lite_dev_preds_merged.json \
--max_workers 4 \
--run_id kimi_fp8_lite_dev_20260701_resolved
Most lm-eval/lighteval tasks were run with 3 seeds and then averaged; AIME25 was run with 8 seeds. BFCL v4 and SWE-Bench Lite numbers come from the aggregated run artifacts listed below.
Every Eval Ever Artifacts
every_eval_ever/aime25.jsonevery_eval_ever/gpqa_diamond.jsonevery_eval_ever/gsm8k_platinum_cot_llama.jsonevery_eval_ever/ifeval.jsonevery_eval_ever/math_500.jsonevery_eval_ever/mmlu_pro_chat.jsonevery_eval_ever/bfcl_v4.jsonevery_eval_ever/swebench_lite_dev.json
- Downloads last month
- 3,181
Model tree for RedHatAI/Kimi-K2.6-FP8-BLOCK
Base model
moonshotai/Kimi-K2.6Evaluation results
- ScaleAI/SWE-bench_Pro 路 SWE Bench Pro View evaluation results source leaderboard
58.6 - SWE-bench/SWE-bench_Verified 路 Swe Bench Resolved View evaluation results source leaderboard
80.2 - Idavidrein/gpqa 路 Diamond View evaluation results source leaderboard
90.5 - harborframework/terminal-bench-2.0 路 Terminalbench 2 View evaluation results source leaderboard
66.7 - MathArena/hmmt_feb_2026 路 MathArena Hmmt Feb 2026 View evaluation results source leaderboard
92.7 - MathArena/aime_2026 路 MathArena Aime 2026 View evaluation results source leaderboard
96.4 - cais/hle 路 Hle