Instructions to use RedHatAI/Kimi-K2.6-FP8-BLOCK with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="RedHatAI/Kimi-K2.6-FP8-BLOCK", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("RedHatAI/Kimi-K2.6-FP8-BLOCK", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/Kimi-K2.6-FP8-BLOCK"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Kimi-K2.6-FP8-BLOCK",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/RedHatAI/Kimi-K2.6-FP8-BLOCK

SGLang

How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/Kimi-K2.6-FP8-BLOCK" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Kimi-K2.6-FP8-BLOCK",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/Kimi-K2.6-FP8-BLOCK" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Kimi-K2.6-FP8-BLOCK",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use RedHatAI/Kimi-K2.6-FP8-BLOCK with Docker Model Runner:
```
docker model run hf.co/RedHatAI/Kimi-K2.6-FP8-BLOCK
```

RedHatAI/Kimi-K2.6-FP8-BLOCK

Model Overview

Model Architecture: moonshotai/Kimi-K2.6 (KimiK25ForConditionalGeneration)
Input: Text, image, and video
Output: Text
Weight Quantization: FP8 (block-wise scaling)
Activation Quantization: FP8 (dynamic grouped scaling)
Release Date: 2026-04-29
Model Developers: RedHatAI

This model is a quantized variant of moonshotai/Kimi-K2.6, exported in compressed-tensors format for vLLM deployment and evaluated on instruction-following, reasoning, function-calling, and agentic coding workloads.

Model Optimizations

This checkpoint applies FP8 block quantization to transformer linear layers and FP8 dynamic quantization to activations. The resulting representation is optimized for high-throughput serving while maintaining strong benchmark retention on Kimi-K2.6 evaluation suites.

The model is exported in compressed-tensors format and is intended for OpenAI-compatible inference with vLLM.

Creation

This model was quantized with LLM Compressor and exported as compressed-tensors. The script below is a representative reference script aligned with the published quantization configuration.

Reference quantization script (FP8 block)

from compressed_tensors.entrypoints.convert import CompressedTensorsDequantizer
from llmcompressor import model_free_ptq

MODEL_ID = "moonshotai/Kimi-K2.6"
SAVE_DIR = "Kimi-K2.6-FP8-BLOCK"

ignore = [
    "re:.*mlp.gate$",
    "re:.*lm_head",
    "re:.*kv_a_proj_with_mqa$",
    "re:.*q_a_proj$",
    "re:.*vision_tower.*",
    "re:.*embed_tokens$",
    "re:.*norm$",
    "re:.*mm_projector.*",
    "re:.*vision.*",
]

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="FP8_BLOCK",
    ignore=ignore,
    converter=CompressedTensorsDequantizer(
        MODEL_ID,
        quant_config_key="text_config.quantization_config",
        ignore=ignore,
    ),
    max_workers=2,
    device="cuda:0",
)

Deployment

Use with vLLM

vllm serve RedHatAI/Kimi-K2.6-FP8-BLOCK \
  --trust-remote-code \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --enable-auto-tool-choice

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="RedHatAI/Kimi-K2.6-FP8-BLOCK",
    messages=[{"role": "user", "content": "Explain how transformers use attention."}],
)

print(resp.choices[0].message.content)

Evaluation

We evaluated this model with lm-evaluation-harness, lighteval, BFCL v4, and SWE-Bench Lite served through a vLLM (0.22.1) OpenAI-compatible endpoint.

Category	Benchmark	Score
Reasoning and instruction following	AIME25 (pass@1, avg@8)	96.25%
Reasoning and instruction following	GPQA Diamond (pass@1, avg@3)	89.39%
Reasoning and instruction following	MATH-500 (pass@1, avg@3)	94.27%
Reasoning and instruction following	MMLU-Pro Chat (custom-extract, avg@3)	86.55%
Reasoning and instruction following	GSM8K Platinum CoT (strict-match, avg@3)	93.13%
Reasoning and instruction following	GSM8K Platinum CoT (flexible-extract, avg@3)	96.31%
Reasoning and instruction following	IFEval (prompt-level strict, avg@3)	95.63%
Reasoning and instruction following	IFEval (instruction-level strict, avg@3)	96.92%
Agentic function calling (accuracy)	BFCL v4 non_live	85.46%
Agentic function calling (accuracy)	BFCL v4 live	79.50%
Agentic function calling (accuracy)	BFCL v4 multi_turn	60.50%
Agentic function calling (accuracy)	BFCL v4 memory	57.42%
Agentic function calling (accuracy)	BFCL v4 web_search	45.00%
Agentic coding	SWE-Bench Lite (dev)	34.78%

BFCL rows report category accuracy. SWE-Bench follows the official harness score style. For run transparency: 8 of 23 tasks were resolved, and 22 instances produced non-empty graded patches.

Historical preliminary check

Benchmark	Base model (`moonshotai/Kimi-K2.6`)	This model
GSM8K Platinum accuracy	94.29%	93.55%
Recovery	-	99.2%

Reproduction

Representative commands used to produce and aggregate these runs:

vLLM + lm-eval (example)

lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=RedHatAI/Kimi-K2.6-FP8-BLOCK,max_length=40960,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
  --num_fewshot 0 \
  --apply_chat_template \
  --output_path results_gsm8k_platinum.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=1234"

lighteval config used

model_parameters:
  provider: "hosted_vllm"
  model_name: "hosted_vllm/RedHatAI/Kimi-K2.6-FP8-BLOCK"
  base_url: "http://127.0.0.1:8000/v1"
  api_key: "EMPTY"
  timeout: 3600
  max_model_length: 40960
  concurrent_requests: 8
  generation_parameters:
    temperature: 1.0
    max_new_tokens: 65536
    top_p: 0.95
    seed: 1234
    top_k: 20
    presence_penalty: 1.5

lighteval endpoint litellm litellm_config.yaml \
  "aime25@1@8|0,math_500@1@3|0,gpqa:diamond@1@3|0" \
  --output-dir results_lighteval \
  --save-details

BFCL v4 and SWE-Bench Lite scripts

# BFCL categories: non_live, live, multi_turn, memory, web_search
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 non_live
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 live
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 multi_turn
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 memory
./scripts/bfcl/run_bfcl_local.sh kimi_fp8 web_search

# SWE-Bench Lite dev (full split)
SWEBENCH_SUBSET=lite SWEBENCH_SPLIT=dev SWEBENCH_SLICE= \
  ./scripts/swebench/run_swebench_lite_local.sh kimi_fp8

# Official SWE-bench resolved-rate evaluation
/home/shubhra/environments/mini-swe-agent/bin/python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-Bench_Lite \
  --split dev \
  --predictions_path /home/shubhra/kimik2.6_evals/results/swebench_resolved_eval/kimi_fp8_lite_dev_preds_merged.json \
  --max_workers 4 \
  --run_id kimi_fp8_lite_dev_20260701_resolved

Most lm-eval/lighteval tasks were run with 3 seeds and then averaged; AIME25 was run with 8 seeds. BFCL v4 and SWE-Bench Lite numbers come from the aggregated run artifacts listed below.

Every Eval Ever Artifacts

every_eval_ever/aime25.json
every_eval_ever/gpqa_diamond.json
every_eval_ever/gsm8k_platinum_cot_llama.json
every_eval_ever/ifeval.json
every_eval_ever/math_500.json
every_eval_ever/mmlu_pro_chat.json
every_eval_ever/bfcl_v4.json
every_eval_ever/swebench_lite_dev.json

Downloads last month: 3,181

Safetensors

Model size

1T params

Tensor type

BF16

F8_E4M3

F32

Model tree for RedHatAI/Kimi-K2.6-FP8-BLOCK

Base model

moonshotai/Kimi-K2.6

Quantized

(41)

this model

Evaluation results

ScaleAI/SWE-bench_Pro · SWE Bench Pro View evaluation results

source leaderboard

58.6
SWE-bench/SWE-bench_Verified · Swe Bench Resolved View evaluation results

source leaderboard

80.2
Idavidrein/gpqa · Diamond View evaluation results

source leaderboard

90.5
harborframework/terminal-bench-2.0 · Terminalbench 2 View evaluation results

source leaderboard

66.7
MathArena/hmmt_feb_2026 · MathArena Hmmt Feb 2026 View evaluation results

source leaderboard

92.7
MathArena/aime_2026 · MathArena Aime 2026 View evaluation results

source leaderboard

96.4
cais/hle · Hle