Qwen2.5-Math-7B-Instruct-4bit

Model Description

Qwen2.5-Math-7B-Instruct-4bit is a 4-bit quantized version of the Qwen/Qwen2.5-Math-7B-Instruct model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).

This model is optimized to:

Reduce model size by ~75% compared to the original model
Reduce GPU memory requirements during inference
Increase inference speed
Maintain high accuracy for mathematical tasks

Model Details

Developed by: Community
Model type: Causal Language Model (Quantized)
Language(s): English, Mathematics
License: MIT
Finetuned from model: Qwen/Qwen2.5-Math-7B-Instruct
Quantization method: GPTQ (W4A16) via LLM Compressor
Calibration dataset: GSM8K (256 samples)

Model Sources

Base Model: Qwen/Qwen2.5-Math-7B-Instruct
Quantization Tool: vLLM LLM Compressor

Uses

Direct Use

This model is designed for direct use in mathematical and reasoning tasks, including:

Solving arithmetic, algebra, and geometry problems
Mathematical reasoning and proofs
Analyzing and explaining mathematical concepts
Educational mathematics support

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,  # Important for compressed models
)

# Create prompt
prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n"

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Downstream Use

This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.

Out-of-Scope Use

This model is NOT designed for:

Generating harmful or inappropriate content
Use in applications requiring absolute accuracy (such as critical financial calculations)
Tasks unrelated to mathematics or reasoning

Bias, Risks, and Limitations

Limitations

The model has been quantized and may have slightly lower accuracy compared to the original model
May encounter errors with some complex problems or edge cases
Model was primarily trained on English data

Recommendations

Users should:

Verify results for important mathematical problems
Use the original model (full precision) if maximum accuracy is required
Understand that quantization may affect some tasks

How to Get Started with the Model

Installation

pip install transformers torch accelerate

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Use the model
prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Quantization Procedure

The model was quantized using:

Method: GPTQ (W4A16)
Tool: vLLM LLM Compressor
Calibration dataset: GSM8K (256 samples)
Max sequence length: 2048 tokens
Target layers: All Linear layers except lm_head

Quantization Hyperparameters

Scheme: W4A16 (4-bit weights, 16-bit activations)
Block size: 128
Dampening fraction: 0.01
Calibration samples: 256

Evaluation

Testing Data

The model was evaluated on the GSM8K test set.

Metrics

Accuracy: Measured on GSM8K test set
Model size: ~3.5GB (compared to ~14GB of the original model)
Compression ratio: ~75% reduction
Memory usage: Significantly reduced compared to the original model

Results

The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.

Technical Specifications

Model Architecture

Base Architecture: Qwen2.5 (Transformer-based)
Parameters: 7B (quantized to 4-bit)
Context Length: 8192 tokens (original model), 2048 tokens (optimized for quantization)
Quantization: GPTQ W4A16

Compute Infrastructure

Hardware

Training/Quantization: NVIDIA RTX 3060 12GB (or equivalent)
Minimum Inference: GPU with at least 8GB VRAM

Software

Quantization Tool: vLLM LLM Compressor
Framework: PyTorch, Transformers
Python: >=3.12

Citation

If you use this model, please cite:

Base Model:

@article{qwen2.5,
  title={Qwen2.5: A Large Language Model for Mathematics},
  author={Qwen Team},
  year={2024}
}

Quantization Method:

@article{gptq,
  title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
  author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
}

Model Card Contact

To report issues or ask questions, please open an issue on the repository.

Acknowledgments

Qwen Team for the original Qwen2.5-Math-7B-Instruct model
vLLM team for the LLM Compressor tool
Hugging Face for infrastructure and support

Downloads last month: 20

Safetensors

Model size

2B params

Tensor type

F16

I64

I32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support