kolerk commited on
Commit
da79f9a
·
verified ·
1 Parent(s): f9db444

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -1
README.md CHANGED
@@ -10,4 +10,96 @@ base_model:
10
  - Qwen/Qwen2.5-VL-7B-Instruct
11
  pipeline_tag: image-text-to-text
12
  ---
13
- This is the model cited in the paper: [Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models](https://arxiv.org/abs/2505.16854).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - Qwen/Qwen2.5-VL-7B-Instruct
11
  pipeline_tag: image-text-to-text
12
  ---
13
+
14
+
15
+ # TON-Math
16
+ TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL.
17
+ We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a preliminary step.
18
+ ## Introduction
19
+
20
+ Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose *TON*, a two-stage training strategy:
21
+
22
+ 1. **(i)** A supervised fine-tuning (SFT) stage with a simple yet effective “**thought dropout**” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning.
23
+ 2. **(ii)** A GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.
24
+
25
+ Experimental results show that *TON* can *reduce the completion length by up to **90%** compared to vanilla GRPO, without sacrificing performance or even improving it*. Further evaluations across diverse vision-language tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the *model progressively learns to bypass unnecessary reasoning steps as training advances*. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches.
26
+
27
+ ## Quickstart
28
+
29
+ ```python
30
+ from transformers import AutoModelForCausalLM, AutoTokenizer
31
+
32
+
33
+ example={
34
+ "image": "./Geo170K/images/test/0.png", ### your image path
35
+ "problem": "As shown in the figure, in triangle ABC, it is known that angle A = 80.0, angle B = 60.0, DE parallel BC, then the size of angle CED is ()",
36
+
37
+ }
38
+
39
+ def make_conversation_image(example):
40
+ return {
41
+ 'image': example['image'], # Store path instead of loaded image
42
+ 'prompt': [{
43
+ 'role': 'user',
44
+ 'content': [
45
+ {'type': 'image', 'text': None},
46
+ {'type': 'text', 'text': example['problem']}
47
+ ]
48
+ }]
49
+ }
50
+
51
+ model_name = "kolerk/TON-3B-AITZ"
52
+
53
+ model = AutoModelForCausalLM.from_pretrained(
54
+ model_name,
55
+ torch_dtype="auto",
56
+ device_map="auto"
57
+ )
58
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
59
+
60
+
61
+ text = tokenizer.apply_chat_template(
62
+ make_conversation_image(example),
63
+ tokenize=False,
64
+ add_generation_prompt=True
65
+ )
66
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
67
+
68
+ generated_ids = model.generate(
69
+ **model_inputs,
70
+ max_new_tokens=4096,
71
+ top_p=0.95,
72
+ top_k=1,
73
+ temperature=0.6
74
+ )
75
+ generated_ids = [
76
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
77
+ ]
78
+
79
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
80
+ print(response)
81
+ ```
82
+
83
+ ## Evaluation
84
+
85
+ Run our test Python file in the [code repository](https://github.com/kokolerk/TON/blob/main/src/eval/test_qwen25vl_geoqa.py) for more details.
86
+
87
+
88
+ ## Citation
89
+
90
+ If you find our work helpful, feel free to give us a cite.
91
+
92
+ ```
93
+ @misc{wang2025think,
94
+ title={Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models},
95
+ author={Jiaqi Wang and Kevin Qinghong Lin and James Cheng and Mike Zheng Shou},
96
+ year={2025},
97
+ eprint={2505.16854},
98
+ archivePrefix={arXiv},
99
+ primaryClass={cs.AI}
100
+ }
101
+ ```
102
+
103
+
104
+
105
+