Qwen2.5-Math-7B-Instruct-4bit
Model Description
Qwen2.5-Math-7B-Instruct-4bit is a 4-bit quantized version of the Qwen/Qwen2.5-Math-7B-Instruct model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).
This model is optimized to:
- Reduce model size by ~75% compared to the original model
- Reduce GPU memory requirements during inference
- Increase inference speed
- Maintain high accuracy for mathematical tasks
Model Details
- Developed by: Community
- Model type: Causal Language Model (Quantized)
- Language(s): English, Mathematics
- License: MIT
- Finetuned from model: Qwen/Qwen2.5-Math-7B-Instruct
- Quantization method: GPTQ (W4A16) via LLM Compressor
- Calibration dataset: GSM8K (256 samples)
Model Sources
- Base Model: Qwen/Qwen2.5-Math-7B-Instruct
- Quantization Tool: vLLM LLM Compressor
Uses
Direct Use
This model is designed for direct use in mathematical and reasoning tasks, including:
- Solving arithmetic, algebra, and geometry problems
- Mathematical reasoning and proofs
- Analyzing and explaining mathematical concepts
- Educational mathematics support
Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
dtype="float16",
trust_remote_code=True,
low_cpu_mem_usage=False, # Important for compressed models
)
# Create prompt
prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n"
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Downstream Use
This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.
Out-of-Scope Use
This model is NOT designed for:
- Generating harmful or inappropriate content
- Use in applications requiring absolute accuracy (such as critical financial calculations)
- Tasks unrelated to mathematics or reasoning
Bias, Risks, and Limitations
Limitations
- The model has been quantized and may have slightly lower accuracy compared to the original model
- May encounter errors with some complex problems or edge cases
- Model was primarily trained on English data
Recommendations
Users should:
- Verify results for important mathematical problems
- Use the original model (full precision) if maximum accuracy is required
- Understand that quantization may affect some tasks
How to Get Started with the Model
Installation
pip install transformers torch accelerate
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
dtype="float16",
trust_remote_code=True,
low_cpu_mem_usage=False,
)
# Use the model
prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Quantization Procedure
The model was quantized using:
- Method: GPTQ (W4A16)
- Tool: vLLM LLM Compressor
- Calibration dataset: GSM8K (256 samples)
- Max sequence length: 2048 tokens
- Target layers: All Linear layers except
lm_head
Quantization Hyperparameters
- Scheme: W4A16 (4-bit weights, 16-bit activations)
- Block size: 128
- Dampening fraction: 0.01
- Calibration samples: 256
Evaluation
Testing Data
The model was evaluated on the GSM8K test set.
Metrics
- Accuracy: Measured on GSM8K test set
- Model size: ~3.5GB (compared to ~14GB of the original model)
- Compression ratio: ~75% reduction
- Memory usage: Significantly reduced compared to the original model
Results
The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.
Technical Specifications
Model Architecture
- Base Architecture: Qwen2.5 (Transformer-based)
- Parameters: 7B (quantized to 4-bit)
- Context Length: 8192 tokens (original model), 2048 tokens (optimized for quantization)
- Quantization: GPTQ W4A16
Compute Infrastructure
Hardware
- Training/Quantization: NVIDIA RTX 3060 12GB (or equivalent)
- Minimum Inference: GPU with at least 8GB VRAM
Software
- Quantization Tool: vLLM LLM Compressor
- Framework: PyTorch, Transformers
- Python: >=3.12
Citation
If you use this model, please cite:
Base Model:
@article{qwen2.5,
title={Qwen2.5: A Large Language Model for Mathematics},
author={Qwen Team},
year={2024}
}
Quantization Method:
@article{gptq,
title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
journal={arXiv preprint arXiv:2210.17323},
year={2022}
}
Model Card Contact
To report issues or ask questions, please open an issue on the repository.
Acknowledgments
- Qwen Team for the original Qwen2.5-Math-7B-Instruct model
- vLLM team for the LLM Compressor tool
- Hugging Face for infrastructure and support
- Downloads last month
- 20