File size: 1,411 Bytes
66a2e77 abdeb79 66a2e77 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
---
library_name: transformers
license: mit
---
# GPTQ 4bit quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)
## Model Details
See details in the official model page: [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)
Quantized using [GPTQModel](https://github.com/ModelCloud/GPTQModel) using [wikitext2 dataset](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py) with `nsamples=512` and `seqlen=2048`. Quantization config:
```
bits=4,
group_size=128,
desc_act=False,
damp_percent=0.01,
```
Minimum VRAM required: ~20GB
## How to use
Using `transformers` library with integrated GPTQ support:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
model_name = "avoroshilov/DeepSeek-R1-Distill-Qwen-32B-GPTQ_4bit-128g"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda')
chat = [{"role": "user", "content": "Why is grass green?"},]
question_tokens = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt").to(quantized_model.device)
answer_tokens = quantized_model.generate(question_tokens, generation_config=GenerationConfig(max_length=2048, ))[0]
print(tokenizer.decode(answer_tokens))
``` |