anikifoss
/

Kimi-K2-Instruct-0905-HQ4_K

Text Generation

Model card Files Files and versions

anikifoss commited on Sep 8

Commit

fd65c31

·

verified ·

1 Parent(s): c40e77a

Update README.md

Files changed (1) hide show

README.md +77 -3

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
----
-license: mit
----

+---
+quantized_by: anikifoss
+pipeline_tag: text-generation
+base_model: moonshotai/Kimi-K2-Instruct-0905
+license: other
+license_name: modified-mit
+license_link: LICENSE
+base_model_relation: quantized
+tags:
+- mla
+- conversational
+---
+# Model Card
+High quality quantization of **Kimi-K2-Instruct-0905** without using imatrix.
+## Run
+### System Requirements
+ - 24G VRAM
+ - 768G RAM
+### Run with ik_llama.cpp, 32G VRAM
+See [this detailed guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) on how to setup ik_llama and how to make custom quants.
+```bash
+./build/bin/llama-server \
+    --alias anikifoss/Kimi-K2-Instruct-0905-HQ4_K \
+    --model /mnt/data/Models/anikifoss/anikifoss/Kimi-K2-Instruct-0905-HQ4_K/Kimi-K2-Instruct-0905-HQ4_K-00001-of-00014.gguf \
+    --no-mmap -rtr \
+    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
+    --ctx-size 75000 \
+    -ctk f16 \
+    -mla 3 -fa \
+    -amb 512 \
+    -b 1024 -ub 1024 \
+    -fmoe \
+    --n-gpu-layers 99 \
+    --override-tensor exps=CPU \
+    --parallel 1 \
+    --threads 32 \
+    --threads-batch 64 \
+    --host 127.0.0.1 \
+    --port 8090
+```
+### Run with llama, 32G VRAM
+```bash
+./build/bin/llama-server \
+    --alias anikifoss/Kimi-K2-Instruct-0905-HQ4_K \
+    --model /mnt/data/Models/anikifoss/anikifoss/Kimi-K2-Instruct-0905-HQ4_K/Kimi-K2-Instruct-0905-HQ4_K-00001-of-00014.gguf \
+    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
+    --ctx-size 75000 \
+    -ctk f16 \
+    -fa on \
+    -b 1024 -ub 1024 \
+    --n-gpu-layers 99 \
+    --override-tensor exps=CPU \
+    --parallel 1 \
+    --threads 32 \
+    --threads-batch 64 \
+    --host 127.0.0.1 \
+    --port 8090
+```
+## Quantization Approach
+- Keep all the small `F32` tensors untouched
+- Quantize all the **attention** and related tensors to `Q8_0`
+- Quantize all the **ffn_down_exps** tensors to `Q6_K`
+- Quantize all the **ffn_up_exps** and **ffn_gate_exps** tensors to `Q4_K`
+### No imatrix
+Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.