alexmarques commited on
Commit
3996d8b
·
verified ·
1 Parent(s): 61555f2

Upload kimik2int4-readme.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. kimik2int4-readme.md +372 -0
kimik2int4-readme.md ADDED
@@ -0,0 +1,372 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
3
+ Kimi-K2-Instruct-quantized.w4a16
4
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
5
+ </h1>
6
+
7
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
8
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
9
+ </a>
10
+
11
+ ## Model Overview
12
+ - **Model Architecture:** Mixture-of-Experts (MoE)
13
+ - **Input:** Text / Image
14
+ - **Output:** Text
15
+ - **Model Optimizations:**
16
+ - **Activation quantization:** None
17
+ - **Weight quantization:** INT4
18
+ - **Release Date:** 07/15/2025
19
+ - **Version:** 1.0
20
+ - **Validated on:** RHOAI 2.24, RHAIIS 3.2.1
21
+ - **Model Developers:** Red Hat (Neural Magic)
22
+
23
+
24
+
25
+ ## 1. Model Introduction
26
+
27
+ This model was obtained by quantizing the weights of **`Kimi-K2-Instruct`** to the INT4 data type. This optimization reduces the number of bits used to represent weights from 16 (FP16/BF16) to 4, reducing GPU memory requirements (by approximately 75%). This weight quantization also reduces the model's disk size by approximately 75%.
28
+
29
+ The original `Kimi K2` is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.
30
+
31
+ ### Key Features
32
+ - INT4 Quantization: This model has been quantized to INT4, dramatically reducing memory footprint and enabling high-throughput, low-latency inference.
33
+ - Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability.
34
+ - MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up.
35
+ - Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.
36
+
37
+ ### Model Variants
38
+ - **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
39
+ - **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
40
+ - **RedHatAI/Kimi-K2-Instruct-quantized.int4 (This Model)**: An INT4 quantized version of `Kimi-K2-Instruct` for efficient, high-performance inference, validated by Red Hat.
41
+
42
+ <div align="center">
43
+ <picture>
44
+ <img src="figures/banner.png" width="80%" alt="Evaluation Results">
45
+ </picture>
46
+ </div>
47
+
48
+ ## 2. Model Summary
49
+
50
+ <div align="center">
51
+
52
+
53
+ | | |
54
+ |:---:|:---:|
55
+ | **Architecture** | Mixture-of-Experts (MoE) |
56
+ | **Total Parameters** | 1T |
57
+ | **Activated Parameters** | 32B |
58
+ | **Number of Layers** (Dense layer included) | 61 |
59
+ | **Number of Dense Layers** | 1 |
60
+ | **Attention Hidden Dimension** | 7168 |
61
+ | **MoE Hidden Dimension** (per Expert) | 2048 |
62
+ | **Number of Attention Heads** | 64 |
63
+ | **Number of Experts** | 384 |
64
+ | **Selected Experts per Token** | 8 |
65
+ | **Number of Shared Experts** | 1 |
66
+ | **Vocabulary Size** | 160K |
67
+ | **Context Length** | 128K |
68
+ | **Attention Mechanism** | MLA |
69
+ | **Activation Function** | SwiGLU |
70
+ </div>
71
+
72
+ ## 3. Preliminary Evaluations
73
+
74
+
75
+ - GSM8k, 5-shot via lm-evaluation-harness
76
+ ```
77
+ moonshotai/Kimi-K2-Instruct = 94.92
78
+ RedHatAI/Kimi-K2-Instruct-quantized.w4a16 (this model) = 94.84
79
+ ```
80
+ More evals coming very soon...
81
+
82
+
83
+ ## Deployment
84
+
85
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
86
+
87
+ Deploy on <strong>vLLM</strong>
88
+
89
+ ```python
90
+ from vllm import LLM, SamplingParams
91
+ from transformers import AutoTokenizer
92
+
93
+ model_id = "RedHatAI/Kimi-K2-Instruct-quantized.w4a16"
94
+ number_gpus = 8
95
+
96
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
97
+
98
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
99
+
100
+ prompt = "Give me a short introduction to large language model."
101
+
102
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
103
+
104
+ outputs = llm.generate(prompt, sampling_params)
105
+
106
+ generated_text = outputs[0].outputs[0].text
107
+ print(generated_text)
108
+ ```
109
+
110
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
111
+
112
+
113
+ <details>
114
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
115
+
116
+ ```bash
117
+ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
118
+ --ipc=host \
119
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
120
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
121
+ --name=vllm \
122
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
123
+ vllm serve \
124
+ --tensor-parallel-size 8 \
125
+ --max-model-len 32768 \
126
+ --enforce-eager --model RedHatAI/Kimi-K2-Instruct-quantized.w4a16
127
+ ```
128
+ </details>
129
+
130
+
131
+ <details>
132
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
133
+
134
+ ```python
135
+ # Setting up vllm server with ServingRuntime
136
+ # Save as: vllm-servingruntime.yaml
137
+ apiVersion: serving.kserve.io/v1alpha1
138
+ kind: ServingRuntime
139
+ metadata:
140
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
141
+ annotations:
142
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
143
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
144
+ labels:
145
+ opendatahub.io/dashboard: 'true'
146
+ spec:
147
+ annotations:
148
+ prometheus.io/port: '8080'
149
+ prometheus.io/path: '/metrics'
150
+ multiModel: false
151
+ supportedModelFormats:
152
+ - autoSelect: true
153
+ name: vLLM
154
+ containers:
155
+ - name: kserve-container
156
+ image: quay.io/modh/vllm:rhoai-2.24-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
157
+ command:
158
+ - python
159
+ - -m
160
+ - vllm.entrypoints.openai.api_server
161
+ args:
162
+ - "--port=8080"
163
+ - "--model=/mnt/models"
164
+ - "--served-model-name={{.Name}}"
165
+ env:
166
+ - name: HF_HOME
167
+ value: /tmp/hf_home
168
+ ports:
169
+ - containerPort: 8080
170
+ protocol: TCP
171
+ ```
172
+
173
+ ```python
174
+ # Attach model to vllm server. This is an NVIDIA template
175
+ # Save as: inferenceservice.yaml
176
+ apiVersion: serving.kserve.io/v1beta1
177
+ kind: InferenceService
178
+ metadata:
179
+ annotations:
180
+ openshift.io/display-name: kimi-k2-instruct-quantized-w4a16 # OPTIONAL CHANGE
181
+ serving.kserve.io/deploymentMode: RawDeployment
182
+ name: kimi-k2-instruct-quantized-w4a16 # specify model name. This value will be used to invoke the model in the payload
183
+ labels:
184
+ opendatahub.io/dashboard: 'true'
185
+ spec:
186
+ predictor:
187
+ maxReplicas: 1
188
+ minReplicas: 1
189
+ model:
190
+ modelFormat:
191
+ name: vLLM
192
+ name: ''
193
+ resources:
194
+ limits:
195
+ cpu: '2' # this is model specific
196
+ memory: 8Gi # this is model specific
197
+ nvidia.com/gpu: '1' # this is accelerator specific
198
+ requests: # same comment for this block
199
+ cpu: '1'
200
+ memory: 4Gi
201
+ nvidia.com/gpu: '1'
202
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
203
+ storageUri: oci://registry.stage.redhat.io/rhelai1/modelcar-kimi-k2-instruct-quantized-w4a16:1.5
204
+ tolerations:
205
+ - effect: NoSchedule
206
+ key: nvidia.com/gpu
207
+ operator: Exists
208
+ ```
209
+
210
+ ```bash
211
+ # make sure first to be in the project where you want to deploy the model
212
+ # oc project <project-name>
213
+
214
+ # apply both resources to run model
215
+
216
+ # Apply the ServingRuntime
217
+ oc apply -f vllm-servingruntime.yaml
218
+
219
+ # Apply the InferenceService
220
+ oc apply -f qwen-inferenceservice.yaml
221
+ ```
222
+
223
+ ```python
224
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
225
+ # - Run `oc get inferenceservice` to find your URL if unsure.
226
+
227
+ # Call the server using curl:
228
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
229
+ -H "Content-Type: application/json" \
230
+ -d '{
231
+ "model": "kimi-k2-instruct-quantized-w4a16",
232
+ "stream": true,
233
+ "stream_options": {
234
+ "include_usage": true
235
+ },
236
+ "max_tokens": 1,
237
+ "messages": [
238
+ {
239
+ "role": "user",
240
+ "content": "How can a bee fly when its wings are so small?"
241
+ }
242
+ ]
243
+ }'
244
+
245
+ ```
246
+
247
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
248
+ </details>
249
+
250
+ ## Creation
251
+
252
+ We created this model using **MoE-Quant**, a library developed jointly with **ISTA** and tailored for the quantization of very large Mixture-of-Experts (MoE) models.
253
+
254
+ For more details, please refer to the [MoE-Quant repository](https://github.com/IST-DASLab/MoE-Quant).
255
+
256
+
257
+ ---
258
+
259
+ ## 5. Model Usage
260
+
261
+ ### Chat Completion
262
+
263
+ Once the local inference service is up, you can interact with it through the chat endpoint:
264
+
265
+ ```python
266
+ def simple_chat(client: OpenAI, model_name: str):
267
+ messages = [
268
+ {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
269
+ {"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]},
270
+ ]
271
+ response = client.chat.completions.create(
272
+ model=model_name,
273
+ messages=messages,
274
+ stream=False,
275
+ temperature=0.6,
276
+ max_tokens=256
277
+ )
278
+ print(response.choices[0].message.content)
279
+ ```
280
+
281
+ > [!NOTE]
282
+ > The recommended temperature for Kimi-K2-Instruct.w4a16 is `temperature = 0.6`.
283
+ > If no special instructions are required, the system prompt above is a good default.
284
+
285
+ ---
286
+
287
+ ### Tool Calling
288
+
289
+ Kimi-K2-Instruct.w4a16 has strong tool-calling capabilities.
290
+ To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.
291
+
292
+ The following example demonstrates calling a weather tool end-to-end:
293
+
294
+ ```python
295
+ # Your tool implementation
296
+ def get_weather(city: str) -> dict:
297
+ return {"weather": "Sunny"}
298
+
299
+ # Tool schema definition
300
+ tools = [{
301
+ "type": "function",
302
+ "function": {
303
+ "name": "get_weather",
304
+ "description": "Retrieve current weather information. Call this when the user asks about the weather.",
305
+ "parameters": {
306
+ "type": "object",
307
+ "required": ["city"],
308
+ "properties": {
309
+ "city": {
310
+ "type": "string",
311
+ "description": "Name of the city"
312
+ }
313
+ }
314
+ }
315
+ }
316
+ }]
317
+
318
+ # Map tool names to their implementations
319
+ tool_map = {
320
+ "get_weather": get_weather
321
+ }
322
+
323
+ def tool_call_with_client(client: OpenAI, model_name: str):
324
+ messages = [
325
+ {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
326
+ {"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}
327
+ ]
328
+ finish_reason = None
329
+ while finish_reason is None or finish_reason == "tool_calls":
330
+ completion = client.chat.completions.create(
331
+ model=model_name,
332
+ messages=messages,
333
+ temperature=0.6,
334
+ tools=tools, # tool list defined above
335
+ tool_choice="auto"
336
+ )
337
+ choice = completion.choices[0]
338
+ finish_reason = choice.finish_reason
339
+ if finish_reason == "tool_calls":
340
+ messages.append(choice.message)
341
+ for tool_call in choice.message.tool_calls:
342
+ tool_call_name = tool_call.function.name
343
+ tool_call_arguments = json.loads(tool_call.function.arguments)
344
+ tool_function = tool_map[tool_call_name]
345
+ tool_result = tool_function(**tool_call_arguments)
346
+ print("tool_result:", tool_result)
347
+
348
+ messages.append({
349
+ "role": "tool",
350
+ "tool_call_id": tool_call.id,
351
+ "name": tool_call_name,
352
+ "content": json.dumps(tool_result)
353
+ })
354
+ print("-" * 100)
355
+ print(choice.message.content)
356
+ ```
357
+
358
+ The `tool_call_with_client` function implements the pipeline from user query to tool execution.
359
+ This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic.
360
+ For streaming output and manual tool-parsing, see the [Tool Calling Guide](docs/tool_call_guidance.md).
361
+
362
+ ---
363
+
364
+ ## 6. License
365
+
366
+ Both the code repository and the model weights are released under the [Modified MIT License](LICENSE).
367
+
368
+ ---
369
+
370
+ ## 7. Third Party Notices
371
+
372
+ See [THIRD PARTY NOTICES](THIRD_PARTY_NOTICES.md)