InfiX-ai
/

InfiGUIAgent-2B-Stage1

+---
+license: apache-2.0
+---
+# InfiGUIAgent-2B-Stage1
+This repository contains the **Stage 1 model** from the [InfiGUIAgent](https://arxiv.org/pdf/2501.04575) paper. The model is based on `Qwen2-VL-2B-Instruct` and enhanced with Supervised Fine-Tuning (SFT) on extensive GUI task data to improve fundamental GUI understanding capabilities.
+## Quick Start
+### Installation
+First install required dependencies:
+```bash
+pip install transformers qwen-vl-utils
+```
+### GUI Element Localization Example
+```python
+import cv2
+import json
+import torch
+import requests
+from PIL import Image
+from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# Load model and processor
+model = Qwen2VLForConditionalGeneration.from_pretrained(
+    "Reallm-Labs/InfiGUIAgent-2B-Stage1",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    device_map="auto"
+)
+processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUIAgent-2B-Stage1")
+# Prepare inputs
+img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUIAgent/main/images/test_img.png"
+prompt_template = """Output the relative coordinates of the icon, widget, or text most closely related to "{instruction}" in this screenshot, in the format of "{"x": x, "y": y}", where x and y are in the positive directions of horizontal left and vertical down respectively, with the origin at the top left corner, and the range is 0-1000."""
+# Download image
+response = requests.get(img_url)
+with open("test_img.png", "wb") as f:
+    f.write(response.content)
+# Build message template
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image", "image": "test_img.png"},
+        {"type": "text", "text": prompt_template.format(instruction="View detailed storage space usage")},
+    ]
+}]
+# Process and generate
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+output_text = processor.batch_decode(
+    [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=False
+)[0]
+# Visualize results
+try:
+    coords = json.loads(output_text)
+    img = cv2.imread("test_img.png")
+    height, width = img.shape[:2]
+    x = int(coords['x'] * width / 1000)
+    y = int(coords['y'] * height / 1000)
+    cv2.circle(img, (x, y), 10, (0, 0, 255), -1)
+    cv2.putText(img, f"({coords['x']}, {coords['y']})", (x+10, y-10),
+                cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2)
+    cv2.imwrite("output.png", img)
+except:
+    print("Error: Failed to parse coordinates or process image")
+print("Predicted coordinates:", output_text)
+```
+## Limitations
+This is a **Stage 1 model** focused on establishing fundamental GUI understanding capabilities. It may demonstrate suboptimal performance on:
+- Complex reasoning tasks
+- Multi-step operations
+- Abstract instruction following
+## Citation
+```bibtex
+@article{liu2025infiguiagent,
+  title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
+  author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
+  journal={arXiv preprint arXiv:2501.04575},
+  year={2025}
+}
+```