chiaraboretti commited on
Commit
4cf69a4
·
verified ·
1 Parent(s): 2311543

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - quantization
7
+ - sinq
8
+ - int4
9
+ - efficient-inference
10
+ - text-generation
11
+ - qwen
12
+ - llm
13
+ - compression
14
+ base_model:
15
+ - Qwen/Qwen3-32B
16
+ ---
17
+
18
+ <p align="center">
19
+ <img src="logo.png" alt="Logo" style="max-width: 80%; height: auto;">
20
+ </p>
21
+
22
+ <p align="center">🐙 <a href="https://github.com/huawei-csl/SINQ">Github</a>&nbsp;&nbsp; | &nbsp;&nbsp;📄 <a href="http://arxiv.org/abs/2509.22944">Paper</a></p>
23
+
24
+
25
+ # A-SINQ 4-bit Quantized Qwen3-32B model
26
+
27
+ This repository contains the official **4-bit quantized** version of the [`Qwen3-32B`](https://huggingface.co/Qwen/Qwen3-32B) model using the *calibrated* version of **SINQ (Sinkhorn-Normalized Quantization)** method.
28
+ SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.
29
+
30
+ To support the project please put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository.
31
+
32
+ ## Model Details
33
+ - **Model Name:** `Qwen3-32B-4bit-ASINQ `
34
+ - **Base Model:** [`Qwen/Qwen3-32B`](https://huggingface.co/Qwen/Qwen3-32B)
35
+ - **Task:** Text Generation
36
+ - **Framework:** PyTorch / Transformers
37
+ - **License:** [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
38
+ - **Quantized By:** *Huawei - Computing System Lab*
39
+
40
+
41
+ ## Quantization Details
42
+
43
+ - **Quantization Method:** A-SINQ (Sinkhorn-Normalized Quantization)
44
+ - **Precision:** INT4
45
+ - **Group Size:** 64
46
+ - **Framework:** PyTorch
47
+ - **Quantization Library:** `sinq`
48
+
49
+ ---
50
+
51
+ # 🚀 Usage</span>
52
+
53
+ ## Prerequisite
54
+ Before running the quantization script, make sure the **SINQ** library is installed.
55
+ Installation instructions and setup details are available in the [SINQ official github repository](https://github.com/huawei-csl/SINQ).
56
+
57
+ ## Usage example
58
+ You can load and use the model with our wrapper based on the 🤗 Transformers library:
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer
62
+ from sinq.patch_model import AutoSINQHFModel
63
+
64
+ model_name = "huawei-cls/Qwen3-32B-4bit-ASINQ"
65
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
66
+ sinq_model = AutoSINQHFModel.from_quantized_safetensors(
67
+ model_name,
68
+ device="cuda:0",
69
+ compute_dtype=torch.bfloat16
70
+ )
71
+
72
+ prompt = "Explain neural network quantization in one sentence."
73
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
74
+ with torch.inference_mode():
75
+ out_ids = sinq_model.generate(**inputs, max_new_tokens=32, do_sample=False)
76
+ print(tokenizer.decode(out_ids[0], skip_special_tokens=True))
77
+
78
+ ```
79
+
80
+ <details>
81
+ <summary><span style="font-size:1.1em; font-weight:bold;">🧩 Quantization Process</span></summary>
82
+
83
+ The quantized model was obtained using the **SINQ** quantization library, following the steps below:
84
+
85
+ ```python
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer
87
+ from sinq.patch_model import AutoSINQHFModel
88
+ from sinq.sinqlinear import BaseQuantizeConfig
89
+
90
+ # Load base model
91
+ base_model_name = "Qwen/Qwen3-32B"
92
+ model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16")
93
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name)
94
+
95
+ # Apply 4-bit SINQ quantization
96
+ quant_cfg = BaseQuantizeConfig(
97
+ nbits=4, # quantization bit-width
98
+ group_size=64, # group size
99
+ tiling_mode="1D", # tiling strategy
100
+ method="asinq" # quantization method ("asinq" for the calibrated version)
101
+ )
102
+
103
+ qmodel = AutoSINQHFModel.quantize_model(
104
+ model,
105
+ tokenizer=tokenizer,
106
+ quant_config=quant_cfg,
107
+ compute_dtype=torch.bfloat16,
108
+ device="cuda:0"
109
+ )
110
+ ```
111
+
112
+ > **Reproducibility Note**: This model was quantized using the SINQ implementation from commit [`14ad847`](https://github.com/huawei-csl/SINQ/commit/14ad847d0ab25f1794b8820506f59b5c9c1fc979) of the [SINQ](https://github.com/huawei-csl/SINQ) repository.
113
+
114
+ </details>
115
+
116
+ </br>
117
+
118
+ ---
119
+
120
+ # 🧾 How to Cite This Work
121
+
122
+ If you find **SINQ** useful in your research or applications, please
123
+ - Put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository.
124
+ - Cite our <a href="http://arxiv.org/abs/2509.22944" target="_blank"><strong>paper</strong></a>:
125
+
126
+ ```bibtex
127
+ @misc{muller2025sinq,
128
+ title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
129
+ author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
130
+ year={2025},
131
+ eprint={2509.22944},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.LG},
134
+ url={http://arxiv.org/abs/2509.22944}
135
+ }
136
+ ```