Model Card: Qwen2.5-3B Code Reasoning Fine-tuned

Model Details

Model Description

This model is a fine-tuned version of Qwen/Qwen2.5-3B, specifically optimized for competitive programming and code generation tasks with step-by-step reasoning capabilities. The model has been trained using a two-stage approach: Supervised Fine-Tuning (SFT) followed by Generalized Reward-guided Policy Optimization (GRPO).

Developed by: XXXXXX
Model type: Causal Language Model
Language(s): English (primary), Python code
Finetuned from model: Qwen/Qwen2.5-3B
Model size: 3B parameters + LoRA adapters (rank 32)

Model Sources

Base Model: Qwen/Qwen2.5-3B
Training Dataset: nvidia/OpenCodeReasoning (split_0)
Training Framework: Unsloth + TRL (Transformers Reinforcement Learning)

Uses

Direct Use

This model is designed for:

Competitive programming problem solving
Code generation with step-by-step reasoning
Algorithm implementation and explanation

Training Details

Training Data

Primary Dataset: nvidia/OpenCodeReasoning (split_0)
Training Samples:
- SFT: 80 samples (reasoning length < 2000 tokens)
- GRPO: 100 samples (reasoning length < 3000 tokens)
Data Filtering: Samples were filtered based on reasoning token length.

Training Procedure

Stage 1: Supervised Fine-Tuning (SFT)

Training objective: Next token prediction on formatted reasoning + code pairs
Batch size: 1 (with gradient accumulation steps: 2)
Learning rate: 2e-4
Epochs: 2
Optimizer: AdamW 8-bit
Weight decay: 0.01
Warmup steps: 5

Stage 2: Generalized Reward-guided Policy Optimization (GRPO)

Training objective: Policy optimization using multiple reward functions
Reward functions:
- Format matching (exact and approximate)
- Solution correctness evaluation (using Gemini-2.0-flash as reward model)
Learning rate: 5e-5
Max steps: 100
Temperature: 0.6
Generations per step: 4

Technical Specifications

Maximum sequence length: 8192 tokens
LoRA configuration:
- Rank: 32
- Alpha: 64 (2 × rank)
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Precision: 16-bit training
Hardware: GPU A100 40GB

Evaluation

Testing Data, Factors & Metrics

Evaluation

Testing Data, Factors & Metrics

LiveCodeBench Evaluation

The model was evaluated on LiveCodeBench problem set v1, focusing on code generation tasks.

Performance Comparison:

Model	Pass@1	Pass@5	Easy Pass@1	Medium Pass@1	Hard Pass@1
Fine-tuned Model	0.1885 (18.85%)	0.2075 (20.75%)	0.4239 (42.39%)	0.0905 (9.05%)	0.0 (0%)
Base Qwen2.5-3B	0.1585 (15.85%)	0.2175 (21.75%)	0.3127 (31.27%)	0.1131 (11.31%)	0.0 (0%)
Improvement	+3.0%	+1.0%	+11.12%	+2.26%	±0%

Key Improvements & Analysis:

Pass@5 Performance: +1.0% improvement in overall Pass@5, indicating better solution diversity
Medium Problem Solving: +2.26% improvement on medium-difficulty problems, showing enhanced reasoning for moderately complex tasks
Trade-offs: Slight decrease in easy problem performance (-11.12%) and overall Pass@1 (-3.0%), potentially due to the model learning more structured reasoning patterns that may be less optimal for simpler problems
Consistency: Maintained 0% performance on hard problems, indicating the need for additional training data or techniques for the most challenging tasks

Model Architecture & Reasoning Format

The model generates responses in a structured format:

<think>
[Step-by-step reasoning and problem analysis]
</think>
```python
[Python code solution]

This format encourages the model to:

Think through the problem systematically
Provide clear reasoning steps
Generate clean, executable code solutions

Technical Limitations and Biases

Biases

Dataset Bias: Inherits biases from the nvidia/OpenCodeReasoning dataset
Problem Type Bias: Optimized for competitive programming style problems
Language Bias: Strongly biased toward Python implementations

Additional Information

Not Recommended For

Production code generation without review
Complex software architecture decisions
Security-critical code implementation
Problems requiring extensive domain knowledge beyond basic algorithms

Model Access

Inference: Compatible with vLLM for fast inference
Format: LoRA adapters can be merged with base model or used separately
Hardware Requirements: Supports both CPU and GPU inference

Citation

If you use this model in your research, please cite:

@misc{qwen25-3b-code-reasoning,
  title={Qwen2.5-3B Fine-tuned for Code Reasoning},
  author={[Your Name]},
  year={2025},
  howpublished={\\url{[Your Model URL]}},
}

Model card created following the guidelines from Mitchell et al. (2019) and Hugging Face documentation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MarioCap/Qwen2.5-3B-OCR-100S-GGUF

Base model

Qwen/Qwen2.5-3B

Finetuned

(291)

this model

MarioCap
/

Qwen2.5-3B-OCR-100S-GGUF

Model Card: Qwen2.5-3B Code Reasoning Fine-tuned

Model Details

Model Description

Model Sources

Uses

Direct Use

Training Details

Training Data

Training Procedure

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Generalized Reward-guided Policy Optimization (GRPO)

Technical Specifications

Evaluation

Testing Data, Factors & Metrics

Evaluation

Testing Data, Factors & Metrics

LiveCodeBench Evaluation

Model Architecture & Reasoning Format

Technical Limitations and Biases

Biases

Additional Information

Not Recommended For

Model Access

Citation

Model tree for MarioCap/Qwen2.5-3B-OCR-100S-GGUF

Dataset used to train MarioCap/Qwen2.5-3B-OCR-100S-GGUF