Link to paper page

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +102 -100
README.md CHANGED
@@ -1,100 +1,102 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - zh
5
- - en
6
- pipeline_tag: text-generation
7
- library_name: transformers
8
- ---
9
- <div align="center">
10
- <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
- </div>
12
-
13
- <p align="center">
14
- <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
- <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
- </p>
17
- <p align="center">
18
- 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
- </p>
20
-
21
- ## What's New
22
- - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
-
24
- ## MiniCPM4 Series
25
- MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
- - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
27
- - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
28
- - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
- - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
30
- - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
- - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
- - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width. (**<-- you are here**)
33
- - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
- - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
- - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
-
37
- ## Introduction
38
- BitCPM4 are ternary quantized models derived from the MiniCPM series models through quantization-aware training (QAT), achieving significant improvements in both training efficiency and model parameter efficiency.
39
- - Improvements of the training method
40
- - Searching hyperparameters with a wind-tunnel on a small model.
41
- - Using a two-stage training method: training in high-precision first and then QAT, making the best of the trained high-precision models and significantly reducing the computational resources required for the QAT phase.
42
- - High parameter efficiency
43
- - Achieving comparable performance to full-precision models of similar parameter models with a bit width of only 1.58 bits, demonstrating high parameter efficiency.
44
-
45
- ## Usage
46
- ### Inference with Transformers
47
- BitCPM4's parameters are stored in a fake-quantized format, which supports direct inference within the Huggingface framework.
48
- ```python
49
- from transformers import AutoModelForCausalLM, AutoTokenizer
50
- import torch
51
-
52
- path = "openbmb/BitCPM4-0.5B"
53
- device = "cuda"
54
-
55
- tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
56
- model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
57
-
58
- messages = [
59
- {"role": "user", "content": "推荐5个北京的景点。"},
60
- ]
61
- model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
62
-
63
- model_outputs = model.generate(
64
- model_inputs,
65
- max_new_tokens=1024,
66
- top_p=0.7,
67
- temperature=0.7
68
- )
69
-
70
- output_token_ids = [
71
- model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
72
- ]
73
-
74
- responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
75
- print(responses)
76
- ```
77
-
78
- ## Evaluation Results
79
- BitCPM4's performance is comparable with other full-precision models in same model size.
80
- ![Becchmark of BitCPM](https://raw.githubusercontent.com/OpenBMB/MiniCPM/refs/heads/minicpm-4/assets/minicpm4/bitcpm4-benchmark.png)
81
-
82
- ## Statement
83
- - As a language model, MiniCPM generates content by learning from a vast amount of text.
84
- - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
85
- - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
86
- - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
87
-
88
- ## LICENSE
89
- - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
90
-
91
- ## Citation
92
- - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
93
-
94
- ```bibtex
95
- @article{minicpm4,
96
- title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
97
- author={MiniCPM Team},
98
- year={2025}
99
- }
100
- ```
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ library_name: transformers
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
+ ---
9
+
10
+ <div align="center">
11
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
12
+ </div>
13
+
14
+ <p align="center">
15
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
16
+ <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a> |
17
+ <a href="https://huggingface.co/papers/2506.07900" target="_blank">Paper</a>
18
+ </p>
19
+ <p align="center">
20
+ 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
21
+ </p>
22
+
23
+ ## What's New
24
+ - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
25
+
26
+ ## MiniCPM4 Series
27
+ MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
28
+ - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
29
+ - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
30
+ - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
31
+ - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
32
+ - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
33
+ - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
34
+ - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width. (**<-- you are here**)
35
+ - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
36
+ - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
37
+ - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
38
+
39
+ ## Introduction
40
+ BitCPM4 are ternary quantized models derived from the MiniCPM series models through quantization-aware training (QAT), achieving significant improvements in both training efficiency and model parameter efficiency.
41
+ - Improvements of the training method
42
+ - Searching hyperparameters with a wind-tunnel on a small model.
43
+ - Using a two-stage training method: training in high-precision first and then QAT, making the best of the trained high-precision models and significantly reducing the computational resources required for the QAT phase.
44
+ - High parameter efficiency
45
+ - Achieving comparable performance to full-precision models of similar parameter models with a bit width of only 1.58 bits, demonstrating high parameter efficiency.
46
+
47
+ ## Usage
48
+ ### Inference with Transformers
49
+ BitCPM4's parameters are stored in a fake-quantized format, which supports direct inference within the Huggingface framework.
50
+ ```python
51
+ from transformers import AutoModelForCausalLM, AutoTokenizer
52
+ import torch
53
+
54
+ path = "openbmb/BitCPM4-0.5B"
55
+ device = "cuda"
56
+
57
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
58
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
59
+
60
+ messages = [
61
+ {"role": "user", "content": "推荐5个北京的景点。"},
62
+ ]
63
+ model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
64
+
65
+ model_outputs = model.generate(
66
+ model_inputs,
67
+ max_new_tokens=1024,
68
+ top_p=0.7,
69
+ temperature=0.7
70
+ )
71
+
72
+ output_token_ids = [
73
+ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
74
+ ]
75
+
76
+ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
77
+ print(responses)
78
+ ```
79
+
80
+ ## Evaluation Results
81
+ BitCPM4's performance is comparable with other full-precision models in same model size.
82
+ ![Becchmark of BitCPM](https://raw.githubusercontent.com/OpenBMB/MiniCPM/refs/heads/minicpm-4/assets/minicpm4/bitcpm4-benchmark.png)
83
+
84
+ ## Statement
85
+ - As a language model, MiniCPM generates content by learning from a vast amount of text.
86
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
87
+ - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
88
+ - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
89
+
90
+ ## LICENSE
91
+ - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
92
+
93
+ ## Citation
94
+ - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
95
+
96
+ ```bibtex
97
+ @article{minicpm4,
98
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
99
+ author={MiniCPM Team},
100
+ year={2025}
101
+ }
102
+ ```