SiriusL commited on
Commit
e524043
·
verified ·
1 Parent(s): 39d7903

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # InfiGUIAgent-2B-Stage1
6
+
7
+ This repository contains the **Stage 1 model** from the [InfiGUIAgent](https://arxiv.org/pdf/2501.04575) paper. The model is based on `Qwen2-VL-2B-Instruct` and enhanced with Supervised Fine-Tuning (SFT) on extensive GUI task data to improve fundamental GUI understanding capabilities.
8
+
9
+ ## Quick Start
10
+
11
+ ### Installation
12
+ First install required dependencies:
13
+ ```bash
14
+ pip install transformers qwen-vl-utils
15
+ ```
16
+
17
+ ### GUI Element Localization Example
18
+ ```python
19
+ import cv2
20
+ import json
21
+ import torch
22
+ import requests
23
+ from PIL import Image
24
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
25
+ from qwen_vl_utils import process_vision_info
26
+
27
+ # Load model and processor
28
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
29
+ "Reallm-Labs/InfiGUIAgent-2B-Stage1",
30
+ torch_dtype=torch.bfloat16,
31
+ attn_implementation="flash_attention_2",
32
+ device_map="auto"
33
+ )
34
+ processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUIAgent-2B-Stage1")
35
+
36
+ # Prepare inputs
37
+ img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUIAgent/main/images/test_img.png"
38
+ prompt_template = """Output the relative coordinates of the icon, widget, or text most closely related to "{instruction}" in this screenshot, in the format of "{"x": x, "y": y}", where x and y are in the positive directions of horizontal left and vertical down respectively, with the origin at the top left corner, and the range is 0-1000."""
39
+
40
+ # Download image
41
+ response = requests.get(img_url)
42
+ with open("test_img.png", "wb") as f:
43
+ f.write(response.content)
44
+
45
+ # Build message template
46
+ messages = [{
47
+ "role": "user",
48
+ "content": [
49
+ {"type": "image", "image": "test_img.png"},
50
+ {"type": "text", "text": prompt_template.format(instruction="View detailed storage space usage")},
51
+ ]
52
+ }]
53
+
54
+ # Process and generate
55
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
56
+ image_inputs, video_inputs = process_vision_info(messages)
57
+ inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
58
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
59
+ output_text = processor.batch_decode(
60
+ [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
61
+ skip_special_tokens=True,
62
+ clean_up_tokenization_spaces=False
63
+ )[0]
64
+
65
+ # Visualize results
66
+ try:
67
+ coords = json.loads(output_text)
68
+ img = cv2.imread("test_img.png")
69
+ height, width = img.shape[:2]
70
+ x = int(coords['x'] * width / 1000)
71
+ y = int(coords['y'] * height / 1000)
72
+
73
+ cv2.circle(img, (x, y), 10, (0, 0, 255), -1)
74
+ cv2.putText(img, f"({coords['x']}, {coords['y']})", (x+10, y-10),
75
+ cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2)
76
+ cv2.imwrite("output.png", img)
77
+ except:
78
+ print("Error: Failed to parse coordinates or process image")
79
+
80
+ print("Predicted coordinates:", output_text)
81
+ ```
82
+
83
+ ## Limitations
84
+ This is a **Stage 1 model** focused on establishing fundamental GUI understanding capabilities. It may demonstrate suboptimal performance on:
85
+ - Complex reasoning tasks
86
+ - Multi-step operations
87
+ - Abstract instruction following
88
+
89
+ ## Citation
90
+ ```bibtex
91
+ @article{liu2025infiguiagent,
92
+ title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
93
+ author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
94
+ journal={arXiv preprint arXiv:2501.04575},
95
+ year={2025}
96
+ }
97
+ ```