File size: 3,888 Bytes
5e795c3 934e271 bc141cf 5e795c3 0de1791 5a74058 996fe2e 5a74058 bc141cf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
# OpenRubrics/RubricRM-8B-Judge
This is a 8B RubricRM-Judge model, finetuned from [Qwen3/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "OpenRubrics/RubricRM-8B-Judge"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
```
To evaluate the model, please use the following format to build up message.
Here `rubric` should be generated with a `RubricRM-Rubric`
```python
JUDGE_PROMPT_TEMPLATE = (
f"You are a fair and impartial judge. Your task is to evaluate 'Response A' and 'Response B' "
f"based on a given instruction and a rubric. You will conduct this evaluation in distinct "
f"phases as outlined below.\n\n"
f"### Phase 1: Compliance Check Instructions\n"
f"First, identify the single most important, objective 'Gatekeeper Criterion' from the rubric.\n"
f"- **A rule is objective (and likely a Gatekeeper) if it can be verified without opinion. "
f"Key examples are: word/paragraph limits, required output format (e.g., JSON validity), "
f"required/forbidden sections, or forbidden content.**\n"
f"- **Conversely, a rule is subjective if it requires interpretation or qualitative judgment. "
f"Subjective rules about quality are NOT Gatekeepers. Examples include criteria like \"be creative,\" "
f"\"write clearly,\" \"be engaging,\" or \"use a professional tone.\"**\n\n"
f"### Phase 2: Analyze Each Response\n"
f"Next, for each Gatekeeper Criterion and all other criteria in the rubric, evaluate each "
f"response item by item.\n\n"
f"### Phase 3: Final Judgment Instructions\n"
f"Based on the results from the previous phases, determine the winner using these simple rules. "
f"Provide a final justification explaining your decision first and then give your decision.\n\n"
f"---\n"
f"### REQUIRED OUTPUT FORMAT\n"
f"You must follow this exact output format below.\n\n"
f"--- Compliance Check ---\n"
f"Identified Gatekeeper Criterion: <e.g., Criterion 1: Must be under 50 words.>\n\n"
f"--- Analysis ---\n"
f"**Response A:**\n"
f"- Criterion 1 [Hard Rule]: Justification: <...>\n"
f"- Criterion 2 [Hard Rule]: Justification: <...>\n"
f"- Criterion 3 [Principle]: Justification: <...>\n"
f"- ... (and so on for all other criteria)\n\n"
f"**Response B:**\n"
f"- Criterion 1 [Hard Rule]: Justification: <...>\n"
f"- Criterion 2 [Hard Rule]: Justification: <...>\n"
f"- Criterion 3 [Principle]: Justification: <...>\n"
f"- ... (and so on for all other criteria)\n\n"
f"--- Final Judgment ---\n"
f"Justification: <...>\n"
f"Winner: <Response A / Response B>\n\n\n"
f"Task to Evaluate:\n"
"Instruction:\n{instruction}\n\n"
"Rubric:\n{rubric}\n\n"
"Response A:\n{response_a}\n\n"
"Response B:\n{response_b}"
)
user_text = JUDGE_PROMPT_TEMPLATE.format(
instruction=instruction,
rubric=rubric,
response_a=response_a,
response_b=response_b
)
messages_list = [
{"role": "user", "content": user_text},
]
message = tok.apply_chat_template(
messages_list,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
# Remaining step: Use either HF or vLLM for evaluation.
# ...
# ...
```
If you fidn our work helpful, please consider citing our paper:
```
@misc{liu2025openrubricsscalablesyntheticrubric,
title={OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment},
author={Tianci Liu and Ran Xu and Tony Yu and Ilgee Hong and Carl Yang and Tuo Zhao and Haoyu Wang},
year={2025},
eprint={2510.07743},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.07743},
}
``` |