Model Card: Qwen2.5-3B Code Reasoning Fine-tuned
Model Details
Model Description
This model is a fine-tuned version of Qwen/Qwen2.5-3B, specifically optimized for competitive programming and code generation tasks with step-by-step reasoning capabilities. The model has been trained using a two-stage approach: Supervised Fine-Tuning (SFT) followed by Generalized Reward-guided Policy Optimization (GRPO).
- Developed by: XXXXXX
- Model type: Causal Language Model
- Language(s): English (primary), Python code
- Finetuned from model: Qwen/Qwen2.5-3B
- Model size: 3B parameters + LoRA adapters (rank 32)
Model Sources
- Base Model: Qwen/Qwen2.5-3B
- Training Dataset: nvidia/OpenCodeReasoning (split_0)
- Training Framework: Unsloth + TRL (Transformers Reinforcement Learning)
Uses
Direct Use
This model is designed for:
- Competitive programming problem solving
- Code generation with step-by-step reasoning
- Algorithm implementation and explanation
Training Details
Training Data
- Primary Dataset: nvidia/OpenCodeReasoning (split_0)
- Training Samples:
- SFT: 80 samples (reasoning length < 2000 tokens)
- GRPO: 100 samples (reasoning length < 3000 tokens)
- Data Filtering: Samples were filtered based on reasoning token length.
Training Procedure
Stage 1: Supervised Fine-Tuning (SFT)
- Training objective: Next token prediction on formatted reasoning + code pairs
- Batch size: 1 (with gradient accumulation steps: 2)
- Learning rate: 2e-4
- Epochs: 2
- Optimizer: AdamW 8-bit
- Weight decay: 0.01
- Warmup steps: 5
Stage 2: Generalized Reward-guided Policy Optimization (GRPO)
- Training objective: Policy optimization using multiple reward functions
- Reward functions:
- Format matching (exact and approximate)
- Solution correctness evaluation (using Gemini-2.0-flash as reward model)
- Learning rate: 5e-5
- Max steps: 100
- Temperature: 0.6
- Generations per step: 4
Technical Specifications
- Maximum sequence length: 8192 tokens
- LoRA configuration:
- Rank: 32
- Alpha: 64 (2 × rank)
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Precision: 16-bit training
- Hardware: GPU A100 40GB
Evaluation
Testing Data, Factors & Metrics
Evaluation
Testing Data, Factors & Metrics
LiveCodeBench Evaluation
The model was evaluated on LiveCodeBench problem set v1, focusing on code generation tasks.
Performance Comparison:
| Model | Pass@1 | Pass@5 | Easy Pass@1 | Medium Pass@1 | Hard Pass@1 |
|---|---|---|---|---|---|
| Fine-tuned Model | 0.1885 (18.85%) | 0.2075 (20.75%) | 0.4239 (42.39%) | 0.0905 (9.05%) | 0.0 (0%) |
| Base Qwen2.5-3B | 0.1585 (15.85%) | 0.2175 (21.75%) | 0.3127 (31.27%) | 0.1131 (11.31%) | 0.0 (0%) |
| Improvement | +3.0% | +1.0% | +11.12% | +2.26% | ±0% |
Key Improvements & Analysis:
- Pass@5 Performance: +1.0% improvement in overall Pass@5, indicating better solution diversity
- Medium Problem Solving: +2.26% improvement on medium-difficulty problems, showing enhanced reasoning for moderately complex tasks
- Trade-offs: Slight decrease in easy problem performance (-11.12%) and overall Pass@1 (-3.0%), potentially due to the model learning more structured reasoning patterns that may be less optimal for simpler problems
- Consistency: Maintained 0% performance on hard problems, indicating the need for additional training data or techniques for the most challenging tasks
Model Architecture & Reasoning Format
The model generates responses in a structured format:
<think>
[Step-by-step reasoning and problem analysis]
</think>
```python
[Python code solution]
This format encourages the model to:
- Think through the problem systematically
- Provide clear reasoning steps
- Generate clean, executable code solutions
Technical Limitations and Biases
Biases
- Dataset Bias: Inherits biases from the nvidia/OpenCodeReasoning dataset
- Problem Type Bias: Optimized for competitive programming style problems
- Language Bias: Strongly biased toward Python implementations
Additional Information
Not Recommended For
- Production code generation without review
- Complex software architecture decisions
- Security-critical code implementation
- Problems requiring extensive domain knowledge beyond basic algorithms
Model Access
- Inference: Compatible with vLLM for fast inference
- Format: LoRA adapters can be merged with base model or used separately
- Hardware Requirements: Supports both CPU and GPU inference
Citation
If you use this model in your research, please cite:
@misc{qwen25-3b-code-reasoning,
title={Qwen2.5-3B Fine-tuned for Code Reasoning},
author={[Your Name]},
year={2025},
howpublished={\\url{[Your Model URL]}},
}
Model card created following the guidelines from Mitchell et al. (2019) and Hugging Face documentation.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for MarioCap/Qwen2.5-3B-OCR-100S-GGUF
Base model
Qwen/Qwen2.5-3B