gaotang
/

RM-R1-DeepSeek-Distilled-Qwen-7B

Text Generation

text-generation-inference

Model card Files Files and versions

gaotang commited on May 6

Commit

a3763db

·

verified ·

1 Parent(s): ff6bcc3

Create README.md

Files changed (1) hide show

README.md +31 -0

README.md ADDED Viewed

	@@ -0,0 +1,31 @@

+---
+license: mit
+language:
+- en
+base_model:
+- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
+---
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/654d784d71a30c4bca09a319/Q7MVJfIHDerQ24c1zwZwK.png)
+<font size=3><div align='center' >
+[[**🤗 Model & Dataset**](https://huggingface.co/collections/gaotang/rm-r1-681128cdab932701cad844c8)]
+[[**📊 Code**](https://github.com/RM-R1-UIUC/RM-R1)]
+[[**📖 Paper**](https://arxiv.org/abs/2505.02387)]
+</div></font>
+# 🚀 Can we cast reward modeling as a reasoning task?
+**RM-R1** is a training framework for *Reasoning Reward Model* (ReasRM) that judges two candidate answers by first **thinking out loud**—generating rubrics or reasoning traces—then emitting its preference.
+Compared with prior scalar or vanilla generative reward models, RM-R1 delivers up to **+13.8 % absolute accuracy gains** on public reward model benchmarks while providing *fully interpretable* critiques.
+## TL;DR
+* **Two-stage training**
+  1. **Distillation** of ~8.7 K high-quality reasoning traces (Chain-of-Rubrics).
+  2. **Reinforcement Learning with Verifiable Rewards** (RLVR) on ~64 K preference pairs.
+* **Backbones** released: 7 B / 14 B / 32 B Qwen-2.5-Instruct variants + DeepSeek-distilled checkpoints.
+## Intended uses
+* **RLHF / RLAIF**: plug-and-play reward function for policy optimisation.
+* **Automated evaluation**: LLM-as-a-judge for open-domain QA, chat, and reasoning.
+* **Research**: study process supervision, chain-of-thought verification, or rubric generation.