Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- OpenMMReasoner/OpenMMReasoner-SFT-874K
|
| 4 |
+
base_model:
|
| 5 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
|
| 9 |
+
|
| 10 |
+
<div align="center">
|
| 11 |
+
|
| 12 |
+
[](https://huggingface.co/collections/lmms-lab/openmmreasoner)
|
| 13 |
+
[](https://arxiv.org)
|
| 14 |
+
[]()
|
| 15 |
+
</div>
|
| 16 |
+
|
| 17 |
+
## Overview
|
| 18 |
+
|
| 19 |
+
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research.
|
| 20 |
+
|
| 21 |
+
In this work, we introduce **OpenMMReasoner**, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
## Model Card
|
| 25 |
+
|
| 26 |
+
The model is the coldstart version of the OpenMMReasoner and was trained on https://huggingface.co/datasets/OpenMMReasoner/OpenMMReasoner-SFT-874K.
|
| 27 |
+
|
| 28 |
+
## Basic Usage
|
| 29 |
+
|
| 30 |
+
We present a very basic inference usage here for our model. Our model can be used just as Qwen2.5-VL-7B-Instruct and using vllm. For more detail about using and evaluation of our model, please visit [GitHub](https://github.com/EvolvingLMMs-Lab/OpenMMReasoner) for more information.
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|
| 34 |
+
from qwen_vl_utils import process_vision_info
|
| 35 |
+
|
| 36 |
+
SYSTEM_PROMPT = (
|
| 37 |
+
"You are a helpful assistant. When the user asks a question, your response must include two parts: "
|
| 38 |
+
"first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags."
|
| 39 |
+
"Please provide a clear, concise response within <answer> </answer> tags that directly addresses the question."
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 43 |
+
"OpenMMReasoner/OpenMMReasoner-ColdStart", torch_dtype="auto", device_map="auto"
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
|
| 47 |
+
|
| 48 |
+
messages = [
|
| 49 |
+
{
|
| 50 |
+
"role": "system",
|
| 51 |
+
"content": [
|
| 52 |
+
{"type": "text", "text": SYSTEM_PROMPT},
|
| 53 |
+
],
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"role": "user",
|
| 57 |
+
"content": [
|
| 58 |
+
{
|
| 59 |
+
"type": "image",
|
| 60 |
+
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
|
| 61 |
+
},
|
| 62 |
+
{"type": "text", "text": "Describe this image."},
|
| 63 |
+
],
|
| 64 |
+
}
|
| 65 |
+
]
|
| 66 |
+
|
| 67 |
+
# Preparation for inference
|
| 68 |
+
text = processor.apply_chat_template(
|
| 69 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 70 |
+
)
|
| 71 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 72 |
+
inputs = processor(
|
| 73 |
+
text=[text],
|
| 74 |
+
images=image_inputs,
|
| 75 |
+
videos=video_inputs,
|
| 76 |
+
padding=True,
|
| 77 |
+
return_tensors="pt",
|
| 78 |
+
)
|
| 79 |
+
inputs = inputs.to("cuda")
|
| 80 |
+
|
| 81 |
+
# Inference: Generation of the output
|
| 82 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
| 83 |
+
generated_ids_trimmed = [
|
| 84 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 85 |
+
]
|
| 86 |
+
output_text = processor.batch_decode(
|
| 87 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 88 |
+
)
|
| 89 |
+
print(output_text)
|
| 90 |
+
|
| 91 |
+
```
|