Qwen2.5-Math-7B-Instruct-4bit

Model Description

Qwen2.5-Math-7B-Instruct-4bit is a 4-bit quantized version of the Qwen/Qwen2.5-Math-7B-Instruct model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).

This model is optimized to:

  • Reduce model size by ~75% compared to the original model
  • Reduce GPU memory requirements during inference
  • Increase inference speed
  • Maintain high accuracy for mathematical tasks

Model Details

  • Developed by: Community
  • Model type: Causal Language Model (Quantized)
  • Language(s): English, Mathematics
  • License: MIT
  • Finetuned from model: Qwen/Qwen2.5-Math-7B-Instruct
  • Quantization method: GPTQ (W4A16) via LLM Compressor
  • Calibration dataset: GSM8K (256 samples)

Model Sources

Uses

Direct Use

This model is designed for direct use in mathematical and reasoning tasks, including:

  • Solving arithmetic, algebra, and geometry problems
  • Mathematical reasoning and proofs
  • Analyzing and explaining mathematical concepts
  • Educational mathematics support

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,  # Important for compressed models
)

# Create prompt
prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n"

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Downstream Use

This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.

Out-of-Scope Use

This model is NOT designed for:

  • Generating harmful or inappropriate content
  • Use in applications requiring absolute accuracy (such as critical financial calculations)
  • Tasks unrelated to mathematics or reasoning

Bias, Risks, and Limitations

Limitations

  • The model has been quantized and may have slightly lower accuracy compared to the original model
  • May encounter errors with some complex problems or edge cases
  • Model was primarily trained on English data

Recommendations

Users should:

  • Verify results for important mathematical problems
  • Use the original model (full precision) if maximum accuracy is required
  • Understand that quantization may affect some tasks

How to Get Started with the Model

Installation

pip install transformers torch accelerate

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Use the model
prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Quantization Procedure

The model was quantized using:

  • Method: GPTQ (W4A16)
  • Tool: vLLM LLM Compressor
  • Calibration dataset: GSM8K (256 samples)
  • Max sequence length: 2048 tokens
  • Target layers: All Linear layers except lm_head

Quantization Hyperparameters

  • Scheme: W4A16 (4-bit weights, 16-bit activations)
  • Block size: 128
  • Dampening fraction: 0.01
  • Calibration samples: 256

Evaluation

Testing Data

The model was evaluated on the GSM8K test set.

Metrics

  • Accuracy: Measured on GSM8K test set
  • Model size: ~3.5GB (compared to ~14GB of the original model)
  • Compression ratio: ~75% reduction
  • Memory usage: Significantly reduced compared to the original model

Results

The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.

Technical Specifications

Model Architecture

  • Base Architecture: Qwen2.5 (Transformer-based)
  • Parameters: 7B (quantized to 4-bit)
  • Context Length: 8192 tokens (original model), 2048 tokens (optimized for quantization)
  • Quantization: GPTQ W4A16

Compute Infrastructure

Hardware

  • Training/Quantization: NVIDIA RTX 3060 12GB (or equivalent)
  • Minimum Inference: GPU with at least 8GB VRAM

Software

  • Quantization Tool: vLLM LLM Compressor
  • Framework: PyTorch, Transformers
  • Python: >=3.12

Citation

If you use this model, please cite:

Base Model:

@article{qwen2.5,
  title={Qwen2.5: A Large Language Model for Mathematics},
  author={Qwen Team},
  year={2024}
}

Quantization Method:

@article{gptq,
  title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
  author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
}

Model Card Contact

To report issues or ask questions, please open an issue on the repository.

Acknowledgments

  • Qwen Team for the original Qwen2.5-Math-7B-Instruct model
  • vLLM team for the LLM Compressor tool
  • Hugging Face for infrastructure and support
Downloads last month
20
Safetensors
Model size
2B params
Tensor type
F16
I64
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support