gaotang
/

RM-R1-DeepSeek-Distilled-Qwen-7B

Text Generation

text-generation-inference

Model card Files Files and versions

RM-R1-DeepSeek-Distilled-Qwen-7B / README.md

gaotang's picture

Create README.md

a3763db verified 7 months ago

|

1.54 kB

	---
	license: mit
	language:
	- en
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
	---
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/654d784d71a30c4bca09a319/Q7MVJfIHDerQ24c1zwZwK.png)

	<font size=3><div align='center' >
	[[🤗 Model & Dataset](https://huggingface.co/collections/gaotang/rm-r1-681128cdab932701cad844c8)]
	[[📊 Code](https://github.com/RM-R1-UIUC/RM-R1)]
	[[📖 Paper](https://arxiv.org/abs/2505.02387)]
	</div></font>

	# 🚀 Can we cast reward modeling as a reasoning task?

	RM-R1 is a training framework for Reasoning Reward Model (ReasRM) that judges two candidate answers by first thinking out loud—generating rubrics or reasoning traces—then emitting its preference.
	Compared with prior scalar or vanilla generative reward models, RM-R1 delivers up to +13.8 % absolute accuracy gains on public reward model benchmarks while providing fully interpretable critiques.

	## TL;DR
	* Two-stage training
	1. Distillation of ~8.7 K high-quality reasoning traces (Chain-of-Rubrics).
	2. Reinforcement Learning with Verifiable Rewards (RLVR) on ~64 K preference pairs.

	* Backbones released: 7 B / 14 B / 32 B Qwen-2.5-Instruct variants + DeepSeek-distilled checkpoints.

	## Intended uses
	* RLHF / RLAIF: plug-and-play reward function for policy optimisation.
	* Automated evaluation: LLM-as-a-judge for open-domain QA, chat, and reasoning.
	* Research: study process supervision, chain-of-thought verification, or rubric generation.