Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 80
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 46
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published 22 days ago • 46
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper • 2409.08264 • Published Sep 12 • 43
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Paper • 2402.04615 • Published Feb 7 • 38
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Paper • 2410.05243 • Published Oct 7 • 16
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Paper • 2401.10935 • Published Jan 17 • 4
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Paper • 2210.03347 • Published Oct 7, 2022 • 3
ScreenAgent: A Vision Language Model-driven Computer Control Agent Paper • 2402.07945 • Published Feb 9
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces Paper • 2306.00245 • Published May 31, 2023
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents Paper • 2406.10819 • Published Jun 16
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots Paper • 2209.08199 • Published Sep 16, 2022
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms Paper • 2410.18967 • Published 28 days ago • 1