anikifoss commited on
Commit
fd65c31
·
verified ·
1 Parent(s): c40e77a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ quantized_by: anikifoss
3
+ pipeline_tag: text-generation
4
+ base_model: moonshotai/Kimi-K2-Instruct-0905
5
+ license: other
6
+ license_name: modified-mit
7
+ license_link: LICENSE
8
+ base_model_relation: quantized
9
+ tags:
10
+ - mla
11
+ - conversational
12
+ ---
13
+
14
+ # Model Card
15
+
16
+ High quality quantization of **Kimi-K2-Instruct-0905** without using imatrix.
17
+
18
+ ## Run
19
+
20
+ ### System Requirements
21
+ - 24G VRAM
22
+ - 768G RAM
23
+
24
+
25
+ ### Run with ik_llama.cpp, 32G VRAM
26
+
27
+ See [this detailed guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) on how to setup ik_llama and how to make custom quants.
28
+
29
+ ```bash
30
+ ./build/bin/llama-server \
31
+ --alias anikifoss/Kimi-K2-Instruct-0905-HQ4_K \
32
+ --model /mnt/data/Models/anikifoss/anikifoss/Kimi-K2-Instruct-0905-HQ4_K/Kimi-K2-Instruct-0905-HQ4_K-00001-of-00014.gguf \
33
+ --no-mmap -rtr \
34
+ --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
35
+ --ctx-size 75000 \
36
+ -ctk f16 \
37
+ -mla 3 -fa \
38
+ -amb 512 \
39
+ -b 1024 -ub 1024 \
40
+ -fmoe \
41
+ --n-gpu-layers 99 \
42
+ --override-tensor exps=CPU \
43
+ --parallel 1 \
44
+ --threads 32 \
45
+ --threads-batch 64 \
46
+ --host 127.0.0.1 \
47
+ --port 8090
48
+ ```
49
+
50
+ ### Run with llama, 32G VRAM
51
+
52
+ ```bash
53
+ ./build/bin/llama-server \
54
+ --alias anikifoss/Kimi-K2-Instruct-0905-HQ4_K \
55
+ --model /mnt/data/Models/anikifoss/anikifoss/Kimi-K2-Instruct-0905-HQ4_K/Kimi-K2-Instruct-0905-HQ4_K-00001-of-00014.gguf \
56
+ --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
57
+ --ctx-size 75000 \
58
+ -ctk f16 \
59
+ -fa on \
60
+ -b 1024 -ub 1024 \
61
+ --n-gpu-layers 99 \
62
+ --override-tensor exps=CPU \
63
+ --parallel 1 \
64
+ --threads 32 \
65
+ --threads-batch 64 \
66
+ --host 127.0.0.1 \
67
+ --port 8090
68
+ ```
69
+
70
+ ## Quantization Approach
71
+ - Keep all the small `F32` tensors untouched
72
+ - Quantize all the **attention** and related tensors to `Q8_0`
73
+ - Quantize all the **ffn_down_exps** tensors to `Q6_K`
74
+ - Quantize all the **ffn_up_exps** and **ffn_gate_exps** tensors to `Q4_K`
75
+
76
+ ### No imatrix
77
+ Generally, imatrix is not recommended for Q4 and larger quants. The problem with imatrix is that it will guide what model remembers, while anything not covered by the text sample used to generate the imartrix is more likely to be forgotten. For example, an imatrix derived from wikipedia sample is likely to negatively affect tasks like coding. In other words, while imatrix can improve specific benchmarks, that are similar to the imatrix input sample, it will also skew the model performance towards tasks similar to the imatrix sample at the expense of other tasks.