Model Card: Qwen2.5-3B Code Reasoning Fine-tuned

Model Details

Model Description

This model is a fine-tuned version of Qwen/Qwen2.5-3B, specifically optimized for competitive programming and code generation tasks with step-by-step reasoning capabilities. The model has been trained using a two-stage approach: Supervised Fine-Tuning (SFT) followed by Generalized Reward-guided Policy Optimization (GRPO).

  • Developed by: XXXXXX
  • Model type: Causal Language Model
  • Language(s): English (primary), Python code
  • Finetuned from model: Qwen/Qwen2.5-3B
  • Model size: 3B parameters + LoRA adapters (rank 32)

Model Sources

  • Base Model: Qwen/Qwen2.5-3B
  • Training Dataset: nvidia/OpenCodeReasoning (split_0)
  • Training Framework: Unsloth + TRL (Transformers Reinforcement Learning)

Uses

Direct Use

This model is designed for:

  • Competitive programming problem solving
  • Code generation with step-by-step reasoning
  • Algorithm implementation and explanation

Training Details

Training Data

  • Primary Dataset: nvidia/OpenCodeReasoning (split_0)
  • Training Samples:
    • SFT: 80 samples (reasoning length < 2000 tokens)
    • GRPO: 100 samples (reasoning length < 3000 tokens)
  • Data Filtering: Samples were filtered based on reasoning token length.

Training Procedure

Stage 1: Supervised Fine-Tuning (SFT)

  • Training objective: Next token prediction on formatted reasoning + code pairs
  • Batch size: 1 (with gradient accumulation steps: 2)
  • Learning rate: 2e-4
  • Epochs: 2
  • Optimizer: AdamW 8-bit
  • Weight decay: 0.01
  • Warmup steps: 5

Stage 2: Generalized Reward-guided Policy Optimization (GRPO)

  • Training objective: Policy optimization using multiple reward functions
  • Reward functions:
    • Format matching (exact and approximate)
    • Solution correctness evaluation (using Gemini-2.0-flash as reward model)
  • Learning rate: 5e-5
  • Max steps: 100
  • Temperature: 0.6
  • Generations per step: 4

Technical Specifications

  • Maximum sequence length: 8192 tokens
  • LoRA configuration:
    • Rank: 32
    • Alpha: 64 (2 × rank)
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Precision: 16-bit training
  • Hardware: GPU A100 40GB

Evaluation

Testing Data, Factors & Metrics

Evaluation

Testing Data, Factors & Metrics

LiveCodeBench Evaluation

The model was evaluated on LiveCodeBench problem set v1, focusing on code generation tasks.

Performance Comparison:

Model Pass@1 Pass@5 Easy Pass@1 Medium Pass@1 Hard Pass@1
Fine-tuned Model 0.1885 (18.85%) 0.2075 (20.75%) 0.4239 (42.39%) 0.0905 (9.05%) 0.0 (0%)
Base Qwen2.5-3B 0.1585 (15.85%) 0.2175 (21.75%) 0.3127 (31.27%) 0.1131 (11.31%) 0.0 (0%)
Improvement +3.0% +1.0% +11.12% +2.26% ±0%

Key Improvements & Analysis:

  • Pass@5 Performance: +1.0% improvement in overall Pass@5, indicating better solution diversity
  • Medium Problem Solving: +2.26% improvement on medium-difficulty problems, showing enhanced reasoning for moderately complex tasks
  • Trade-offs: Slight decrease in easy problem performance (-11.12%) and overall Pass@1 (-3.0%), potentially due to the model learning more structured reasoning patterns that may be less optimal for simpler problems
  • Consistency: Maintained 0% performance on hard problems, indicating the need for additional training data or techniques for the most challenging tasks

Model Architecture & Reasoning Format

The model generates responses in a structured format:

<think>
[Step-by-step reasoning and problem analysis]
</think>
```python
[Python code solution]

This format encourages the model to:

  1. Think through the problem systematically
  2. Provide clear reasoning steps
  3. Generate clean, executable code solutions

Technical Limitations and Biases

Biases

  • Dataset Bias: Inherits biases from the nvidia/OpenCodeReasoning dataset
  • Problem Type Bias: Optimized for competitive programming style problems
  • Language Bias: Strongly biased toward Python implementations

Additional Information

Not Recommended For

  • Production code generation without review
  • Complex software architecture decisions
  • Security-critical code implementation
  • Problems requiring extensive domain knowledge beyond basic algorithms

Model Access

  • Inference: Compatible with vLLM for fast inference
  • Format: LoRA adapters can be merged with base model or used separately
  • Hardware Requirements: Supports both CPU and GPU inference

Citation

If you use this model in your research, please cite:

@misc{qwen25-3b-code-reasoning,
  title={Qwen2.5-3B Fine-tuned for Code Reasoning},
  author={[Your Name]},
  year={2025},
  howpublished={\\url{[Your Model URL]}},
}

Model card created following the guidelines from Mitchell et al. (2019) and Hugging Face documentation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MarioCap/Qwen2.5-3B-OCR-100S-GGUF

Base model

Qwen/Qwen2.5-3B
Finetuned
(291)
this model

Dataset used to train MarioCap/Qwen2.5-3B-OCR-100S-GGUF