|
|
--- |
|
|
base_model: Qwen/Qwen3-14B |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- math |
|
|
metrics: |
|
|
- accuracy |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Qwen/Qwen3-14B-GRPO-MATH-1EPOCH |
|
|
|
|
|
**Description:** |
|
|
|
|
|
A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{zhao2025learning, |
|
|
title = {Learning to Reason without External Rewards}, |
|
|
author = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn}, |
|
|
journal = {arXiv preprint arXiv:2505.19590}, |
|
|
year = {2025} |
|
|
} |
|
|
``` |
|
|
|