Document & UI Intelligence - a johannhartmann Collection

johannhartmann 's Collections

Music

Computer Use Models

Document & UI Intelligence

Multimodal Models

Medical MultiModal

Document & UI Intelligence

updated Jan 20, 2025

xlangai/Aguvis-7B-720P

8B • Updated Jan 7, 2025 • 45 • 9
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Paper • 2412.04454 • Published Dec 5, 2024 • 71
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Paper • 2401.10935 • Published Jan 17, 2024 • 5
cckevinn/SeeClick

Text Generation • 10B • Updated Jan 29, 2024 • 245 • 18
jadechoghari/Ferret-UI-Llama8b

Image-Text-to-Text • 8B • Updated Jan 8, 2025 • 216 • 68
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Paper • 2410.18967 • Published Oct 24, 2024 • 1
microsoft/OmniParser

Image-Text-to-Text • Updated Dec 2, 2024 • 339 • 1.7k
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Paper • 2501.04575 • Published Jan 8, 2025 • 25
showlab/ShowUI-2B

Updated Mar 11, 2025 • 2.5k • 269
AskUI/PTA-1

Image-Text-to-Text • 0.3B • Updated Nov 28, 2024 • 607 • 98