zengw commited on
Commit
017de69
Β·
verified Β·
1 Parent(s): 3202f35

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +208 -3
README.md CHANGED
@@ -1,3 +1,208 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ ### UI-Venus
5
+ This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
6
+
7
+
8
+
9
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
10
+ [![Report](https://img.shields.io/badge/Report-Technical%20Report-blueviolet?logo=notion)](http://arxiv.org/abs/2508.10833)
11
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-green?logo=github)](https://github.com/inclusionAI/UI-Venus)
12
+ [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-orange?logo=huggingface)](https://huggingface.co/inclusionAI/UI-Venus-Ground-7B)
13
+
14
+ ---
15
+
16
+ <p align="center">
17
+ πŸ“ˆ UI-Venus Benchmark Performance
18
+ </p>
19
+
20
+ <p align="center">
21
+ <img src="performance_venus.png" alt="UI-Venus Performance Across Datasets" width="1200" />
22
+ <br>
23
+ </p>
24
+
25
+ > **Figure:** Performance of UI-Venus across multiple benchmark datasets. UI-Venus achieves **State-of-the-Art (SOTA)** results on key UI understanding and interaction benchmarks, including **ScreenSpot-Pro**, **ScreenSpot-v2**, **OS-World-G**, **UI-Vision**, and **Android World**. The results demonstrate its superior capability in visual grounding, UI navigation, cross-platform generalization, and complex task reasoning.
26
+
27
+ ### Model Description
28
+
29
+ UI-Venus is a multimodal UI agent built on Qwen2.5-VL that performs accurate UI grounding and navigation using only screenshots as input. The 7B and 72B variants achieve 94.1%/50.8% and 95.3%/61.9% on Screenspot-V2 and Screenspot-Pro benchmarks, surpassing prior SOTA models such as GTA1 and UI-TARS-1.5. On the AndroidWorld navigation benchmark, they achieve 49.1% and 65.9% success rates, respectively, demonstrating strong planning and generalization capabilities
30
+
31
+ Key innovations include:
32
+ - **SOTA Open-Source UI Agent**: Publicly released to advance research in autonomous UI interaction and agent-based systems.
33
+ - **Reinforcement Fine-Tuning (RFT)**: Utilizes carefully designed reward functions for both grounding and navigation tasks
34
+ - **Efficient Data Cleaning**: Trained on several hundred thousand high-quality samples to ensure robustness.
35
+ - **Self-Evolving Trajectory History Alignment & Sparse Action Enhancement**: Improves reasoning coherence and action distribution for better long-horizon planning.
36
+
37
+
38
+
39
+
40
+
41
+ ---
42
+ ## Installation
43
+
44
+ First, install the required dependencies:
45
+
46
+ ```python
47
+ pip install transformers==4.49.0 qwen-vl-utils
48
+ ```
49
+ ---
50
+
51
+
52
+
53
+ ## Quick Start
54
+
55
+ Use the shell scripts to launch the evaluation. The evaluation setup follows the same protocol as **ScreenSpot**, including data format, annotation structure, and metric calculation.
56
+
57
+ ```python
58
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
59
+ import torch
60
+ import os
61
+ from qwen_vl_utils import process_vision_info
62
+
63
+
64
+ # model path
65
+ model_name = "inclusionAI/UI-Venus-Ground-7B"
66
+
67
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
68
+ model_name,
69
+ device_map="auto",
70
+ trust_remote_code=True,
71
+ torch_dtype=torch.bfloat16,
72
+ attn_implementation="flash_attention_2"
73
+ ).eval()
74
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
75
+ processor = AutoProcessor.from_pretrained(model_name)
76
+
77
+ generation_config = {
78
+ "max_new_tokens": 2048,
79
+ "do_sample": False,
80
+ "temperature": 0.0
81
+ }
82
+
83
+ def inference(instruction, image_path):
84
+ assert os.path.exists(image_path) and os.path.isfile(image_path), "Invalid input image path."
85
+
86
+ prompt_origin = 'Outline the position corresponding to the instruction: {}. The output should be only [x1,y1,x2,y2].'
87
+ full_prompt = prompt_origin.format(instruction)
88
+
89
+ min_pixels = 2000000
90
+ max_pixels = 4800000
91
+
92
+ messages = [
93
+ {
94
+ "role": "user",
95
+ "content": [
96
+ {
97
+ "type": "image",
98
+ "image": image_path,
99
+ "min_pixels": min_pixels,
100
+ "max_pixels": max_pixels
101
+ },
102
+ {"type": "text", "text": full_prompt},
103
+ ],
104
+ }
105
+ ]
106
+
107
+ text = processor.apply_chat_template(
108
+ messages, tokenize=False, add_generation_prompt=True
109
+ )
110
+ image_inputs, video_inputs = process_vision_info(messages)
111
+ model_inputs = processor(
112
+ text=[text],
113
+ images=image_inputs,
114
+ videos=video_inputs,
115
+ padding=True,
116
+ return_tensors="pt"
117
+ ).to(model.device)
118
+
119
+ generated_ids = model.generate(**model_inputs, **generation_config)
120
+ generated_ids_trimmed = [
121
+ out_ids[len(in_ids):]
122
+ for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
123
+ ]
124
+ output_text = processor.batch_decode(
125
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
126
+ )
127
+
128
+ # normalized coordinates
129
+ try:
130
+ box = eval(output_text[0])
131
+ input_height = model_inputs['image_grid_thw'][0][1] * 14
132
+ input_width = model_inputs['image_grid_thw'][0][2] * 14
133
+ abs_x1 = float(box[0]) / input_width
134
+ abs_y1 = float(box[1]) / input_height
135
+ abs_x2 = float(box[2]) / input_width
136
+ abs_y2 = float(box[3]) / input_height
137
+ bbox = [abs_x1, abs_y1, abs_x2, abs_y2]
138
+ except Exception:
139
+ bbox = [0, 0, 0, 0]
140
+
141
+ point = [(bbox[0] + bbox[2]) / 2, (bbox[1] + bbox[3]) / 2]
142
+ result_dict = {
143
+ "result": "positive",
144
+ "format": "x1y1x2y2",
145
+ "raw_response": output_text,
146
+ "bbox": bbox,
147
+ "point": point
148
+ }
149
+
150
+ return result_dict
151
+ ```
152
+ ---
153
+ ### Results on ScreenSpot-v2
154
+
155
+ | **Model** | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
156
+ |--------------------------|-----------------|-----------------|------------------|------------------|--------------|--------------|----------|
157
+ | uitars-1.5 | - | - | - | - | - | - | 94.2 |
158
+ | Seed-1.5-VL | - | - | - | - | - | - | 95.2 |
159
+ | GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 |
160
+ | Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 |
161
+ | UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 |
162
+ | UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 |
163
+ | LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 |
164
+ | **UI-Venus-Ground-7B (Ours)** | **99.0** | **90.0** | **97.0** | **90.7** | **96.2** | **88.7** | **94.1** |
165
+ | **UI-Venus-Ground-72B (Ours)** | **99.7** | **93.8** | **95.9** | **90.0** | **96.2** | **92.6** | **95.3** |
166
+
167
+ ---
168
+
169
+
170
+ ### Results on ScreenSpot-Pro
171
+
172
+ Performance comparison of GUI agent models across six task categories on **ScreenSpot-Pro**.
173
+ Scores are in percentage (%). `T` = Text, `I` = Icon.
174
+ `*`: reproduced; `†`: trained from UI-TARS-1.5-7B.
175
+
176
+ | Model | CAD (T/I) | Dev (T/I) | Creative (T/I) | Scientific (T/I) | Office (T/I) | OS (T/I) | Avg T | Avg I | **Overall** | Type |
177
+ |-------|-----------|-----------|----------------|------------------|--------------|---------|--------|--------|------------|------|
178
+ | GPT-4o | 2.0 / 0.0 | 1.3 / 0.0 | 1.0 / 0.0 | 2.1 / 0.0 | 1.1 / 0.0 | 0.0 / 0.0 | 1.3 | 0.0 | 0.8 | Closed |
179
+ | Claude Computer Use | 14.5 / 3.7 | 22.0 / 3.9 | 25.9 / 3.4 | 33.9 / 15.8 | 30.1 / 16.3 | 11.0 / 4.5 | 23.4 | 7.1 | 17.1 | Closed |
180
+ | UI-TARS-1.5 | – / – | – / – | – / – | – / – | – / – | – / – | – | – | **61.6** | Closed |
181
+ | Seed1.5-VL | – / – | – / – | – / – | – / – | – / – | – / – | – | – | 60.9 | Closed |
182
+ | Qwen2.5-VL-7B\* | 16.8 / 1.6 | 46.8 / 4.1 | 35.9 / 7.7 | 49.3 / 7.3 | 52.5 / 20.8 | 37.4 / 6.7 | 38.9 | 7.1 | 26.8 | SFT |
183
+ | Qwen2.5-VL-72B* | 54.8 / 15.6 | 65.6 / 16.6 | 63.1 / 19.6 | 78.5 / 34.5 | 79.1 / 47.2 | 66.4 / 29.2 | 67.3 | 25.0 | 51.2 | SFT |
184
+ | UI-TARS-7B | 20.8 / 9.4 | 58.4 / 12.4 | 50.0 / 9.1 | 63.9 / 31.8 | 63.3 / 20.8 | 30.8 / 16.9 | 47.8 | 16.2 | 35.7 | SFT |
185
+ | UI-TARS-72B | 18.8 / 12.5 | 62.9 / 17.2 | 57.1 / 15.4 | 64.6 / 20.9 | 63.3 / 26.4 | 42.1 / 15.7 | 50.9 | 17.6 | 38.1 | SFT |
186
+ | Phi-Ground-7B | 26.9 / 17.2 | 70.8 / 16.7 | 56.6 / 13.3 | 58.0 / 29.1 | 76.4 / 44.0 | 55.1 / 25.8 | 56.4 | 21.8 | 43.2 | RL |
187
+ | UI-TARS-1.5-7B | – / – | – / – | – / – | – / – | – / – | – / – | – | – | 49.6 | RL |
188
+ | GTA1-7B† | 53.3 / 17.2 | 66.9 / 20.7 | 62.6 / 18.2 | 76.4 / 31.8 | 82.5 / 50.9 | 48.6 / 25.9 | 65.5 | 25.2 | 50.1 | RL |
189
+ | GTA1-72B | 56.9 / 28.1 | 79.9 / 33.1 | 73.2 / 20.3 | 81.9 / 38.2 | 85.3 / 49.1 | 73.8 / 39.1 | 74.5 | 32.5 | 58.4 | RL |
190
+ | **UI-Venus-Ground-7B** | 60.4 / 21.9 | 74.7 / 24.1 | 63.1 / 14.7 | 76.4 / 31.8 | 75.7 / 41.5 | 49.5 / 22.5 | 67.1 | 24.3 | **50.8** | Ours (RL) |
191
+ | **UI-Venus-Ground-72B** | 66.5 / 29.7 | 84.4 / 33.1 | 73.2 / 30.8 | 84.7 / 42.7 | 83.1 / 60.4 | 75.7 / 36.0 | 77.4 | 36.8 | **61.9** | Ours (RL) |
192
+
193
+ > οΏ½οΏ½οΏ½ **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
194
+
195
+
196
+ # Citation
197
+ Please consider citing if you find our work useful:
198
+ ```plain
199
+ @misc{gu2025uivenustechnicalreportbuilding,
200
+ title={UI-Venus Technical Report: Building High-performance UI Agents with RFT},
201
+ author={Zhangxuan Gu and Zhengwen Zeng and Zhenyu Xu and Xingran Zhou and Shuheng Shen and Yunfei Liu and Beitong Zhou and Changhua Meng and Tianyu Xia and Weizhi Chen and Yue Wen and Jingya Dou and Fei Tang and Jinzhen Lin and Yulin Liu and Zhenlin Guo and Yichen Gong and Heng Jia and Changlong Gao and Yuan Guo and Yong Deng and Zhenyu Guo and Liang Chen and Weiqiang Wang},
202
+ year={2025},
203
+ eprint={2508.10833},
204
+ archivePrefix={arXiv},
205
+ primaryClass={cs.CV},
206
+ url={https://arxiv.org/abs/2508.10833},
207
+ }
208
+ ```