Upload kimik2int4-readme.md with huggingface_hub

Browse files

Files changed (1) hide show

kimik2int4-readme.md +372 -0

kimik2int4-readme.md ADDED Viewed

	@@ -0,0 +1,372 @@

+<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
+  Kimi-K2-Instruct-quantized.w4a16
+  <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
+</h1>
+<a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
+<img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
+</a>
+## Model Overview
+- **Model Architecture:** Mixture-of-Experts (MoE)
+  - **Input:** Text / Image
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Activation quantization:** None
+  - **Weight quantization:** INT4
+- **Release Date:** 07/15/2025
+- **Version:** 1.0
+- **Validated on:** RHOAI 2.24, RHAIIS 3.2.1
+- **Model Developers:** Red Hat (Neural Magic)
+## 1. Model Introduction
+This model was obtained by quantizing the weights of **`Kimi-K2-Instruct`** to the INT4 data type. This optimization reduces the number of bits used to represent weights from 16 (FP16/BF16) to 4, reducing GPU memory requirements (by approximately 75%). This weight quantization also reduces the model's disk size by approximately 75%.
+The original `Kimi K2` is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.
+### Key Features
+- INT4 Quantization: This model has been quantized to INT4, dramatically reducing memory footprint and enabling high-throughput, low-latency inference.
+- Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability.
+- MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up.
+- Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.
+### Model Variants
+- **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
+- **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
+- **RedHatAI/Kimi-K2-Instruct-quantized.int4 (This Model)**: An INT4 quantized version of `Kimi-K2-Instruct` for efficient, high-performance inference, validated by Red Hat.
+<div align="center">
+  <picture>
+      <img src="figures/banner.png" width="80%" alt="Evaluation Results">
+  </picture>
+</div>
+## 2. Model Summary
+<div align="center">
+| | |
+|:---:|:---:|
+| **Architecture** | Mixture-of-Experts (MoE) |
+| **Total Parameters** | 1T |
+| **Activated Parameters** | 32B |
+| **Number of Layers** (Dense layer included) | 61 |
+| **Number of Dense Layers** | 1 |
+| **Attention Hidden Dimension** | 7168 |
+| **MoE Hidden Dimension** (per Expert) | 2048 |
+| **Number of Attention Heads** | 64 |
+| **Number of Experts** | 384 |
+| **Selected Experts per Token** | 8 |
+| **Number of Shared Experts** | 1 |
+| **Vocabulary Size** | 160K |
+| **Context Length** | 128K |
+| **Attention Mechanism** | MLA |
+| **Activation Function** | SwiGLU |
+</div>
+## 3. Preliminary Evaluations
+- GSM8k, 5-shot via lm-evaluation-harness
+```
+moonshotai/Kimi-K2-Instruct                            = 94.92
+RedHatAI/Kimi-K2-Instruct-quantized.w4a16 (this model) = 94.84
+```
+More evals coming very soon...
+## Deployment
+This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
+Deploy on <strong>vLLM</strong>
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_id = "RedHatAI/Kimi-K2-Instruct-quantized.w4a16"
+number_gpus = 8
+sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+prompt = "Give me a short introduction to large language model."
+llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
+outputs = llm.generate(prompt, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+<details>
+  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
+```bash
+podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
+ --ipc=host \
+--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
+--name=vllm \
+registry.access.redhat.com/rhaiis/rh-vllm-cuda \
+vllm serve \
+--tensor-parallel-size 8 \
+--max-model-len 32768  \
+--enforce-eager --model RedHatAI/Kimi-K2-Instruct-quantized.w4a16
+```
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
+```python
+# Setting up vllm server with ServingRuntime
+# Save as: vllm-servingruntime.yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
+ annotations:
+   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
+   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
+ labels:
+   opendatahub.io/dashboard: 'true'
+spec:
+ annotations:
+   prometheus.io/port: '8080'
+   prometheus.io/path: '/metrics'
+ multiModel: false
+ supportedModelFormats:
+   - autoSelect: true
+     name: vLLM
+ containers:
+   - name: kserve-container
+     image: quay.io/modh/vllm:rhoai-2.24-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
+     command:
+       - python
+       - -m
+       - vllm.entrypoints.openai.api_server
+     args:
+       - "--port=8080"
+       - "--model=/mnt/models"
+       - "--served-model-name={{.Name}}"
+     env:
+       - name: HF_HOME
+         value: /tmp/hf_home
+     ports:
+       - containerPort: 8080
+         protocol: TCP
+```
+```python
+# Attach model to vllm server. This is an NVIDIA template
+# Save as: inferenceservice.yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  annotations:
+    openshift.io/display-name: kimi-k2-instruct-quantized-w4a16 # OPTIONAL CHANGE
+    serving.kserve.io/deploymentMode: RawDeployment
+  name: kimi-k2-instruct-quantized-w4a16          # specify model name. This value will be used to invoke the model in the payload
+  labels:
+    opendatahub.io/dashboard: 'true'
+spec:
+  predictor:
+    maxReplicas: 1
+    minReplicas: 1
+    model:
+      modelFormat:
+        name: vLLM
+      name: ''
+      resources:
+        limits:
+          cpu: '2'			# this is model specific
+          memory: 8Gi		# this is model specific
+          nvidia.com/gpu: '1'	# this is accelerator specific
+        requests:			# same comment for this block
+          cpu: '1'
+          memory: 4Gi
+          nvidia.com/gpu: '1'
+      runtime: vllm-cuda-runtime	# must match the ServingRuntime name above
+      storageUri: oci://registry.stage.redhat.io/rhelai1/modelcar-kimi-k2-instruct-quantized-w4a16:1.5
+    tolerations:
+    - effect: NoSchedule
+      key: nvidia.com/gpu
+      operator: Exists
+```
+```bash
+# make sure first to be in the project where you want to deploy the model
+# oc project <project-name>
+# apply both resources to run model
+# Apply the ServingRuntime
+oc apply -f vllm-servingruntime.yaml
+# Apply the InferenceService
+oc apply -f qwen-inferenceservice.yaml
+```
+```python
+# Replace <inference-service-name> and <cluster-ingress-domain> below:
+# - Run `oc get inferenceservice` to find your URL if unsure.
+# Call the server using curl:
+curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
+        -H "Content-Type: application/json" \
+        -d '{
+    "model": "kimi-k2-instruct-quantized-w4a16",
+    "stream": true,
+    "stream_options": {
+        "include_usage": true
+    },
+    "max_tokens": 1,
+    "messages": [
+        {
+            "role": "user",
+            "content": "How can a bee fly when its wings are so small?"
+        }
+    ]
+}'
+```
+See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
+</details>
+## Creation
+We created this model using **MoE-Quant**, a library developed jointly with **ISTA** and tailored for the quantization of very large Mixture-of-Experts (MoE) models.
+For more details, please refer to the [MoE-Quant repository](https://github.com/IST-DASLab/MoE-Quant).
+---
+## 5. Model Usage
+### Chat Completion
+Once the local inference service is up, you can interact with it through the chat endpoint:
+```python
+def simple_chat(client: OpenAI, model_name: str):
+    messages = [
+        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
+        {"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]},
+    ]
+    response = client.chat.completions.create(
+        model=model_name,
+        messages=messages,
+        stream=False,
+        temperature=0.6,
+        max_tokens=256
+    )
+    print(response.choices[0].message.content)
+```
+> [!NOTE]
+> The recommended temperature for Kimi-K2-Instruct.w4a16 is `temperature = 0.6`.
+> If no special instructions are required, the system prompt above is a good default.
+---
+### Tool Calling
+Kimi-K2-Instruct.w4a16 has strong tool-calling capabilities.
+To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.
+The following example demonstrates calling a weather tool end-to-end:
+```python
+# Your tool implementation
+def get_weather(city: str) -> dict:
+    return {"weather": "Sunny"}
+# Tool schema definition
+tools = [{
+    "type": "function",
+    "function": {
+        "name": "get_weather",
+        "description": "Retrieve current weather information. Call this when the user asks about the weather.",
+        "parameters": {
+            "type": "object",
+            "required": ["city"],
+            "properties": {
+                "city": {
+                    "type": "string",
+                    "description": "Name of the city"
+                }
+            }
+        }
+    }
+}]
+# Map tool names to their implementations
+tool_map = {
+    "get_weather": get_weather
+}
+def tool_call_with_client(client: OpenAI, model_name: str):
+    messages = [
+        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
+        {"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}
+    ]
+    finish_reason = None
+    while finish_reason is None or finish_reason == "tool_calls":
+        completion = client.chat.completions.create(
+            model=model_name,
+            messages=messages,
+            temperature=0.6,
+            tools=tools,          # tool list defined above
+            tool_choice="auto"
+        )
+        choice = completion.choices[0]
+        finish_reason = choice.finish_reason
+        if finish_reason == "tool_calls":
+            messages.append(choice.message)
+            for tool_call in choice.message.tool_calls:
+                tool_call_name = tool_call.function.name
+                tool_call_arguments = json.loads(tool_call.function.arguments)
+                tool_function = tool_map[tool_call_name]
+                tool_result = tool_function(**tool_call_arguments)
+                print("tool_result:", tool_result)
+                messages.append({
+                    "role": "tool",
+                    "tool_call_id": tool_call.id,
+                    "name": tool_call_name,
+                    "content": json.dumps(tool_result)
+                })
+    print("-" * 100)
+    print(choice.message.content)
+```
+The `tool_call_with_client` function implements the pipeline from user query to tool execution.
+This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic.
+For streaming output and manual tool-parsing, see the [Tool Calling Guide](docs/tool_call_guidance.md).
+---
+## 6. License
+Both the code repository and the model weights are released under the [Modified MIT License](LICENSE).
+---
+## 7. Third Party Notices
+See [THIRD PARTY NOTICES](THIRD_PARTY_NOTICES.md)