Post
1612
๐๐ป๐ฑ๐ฟ๐ผ๐ถ๐ฑ๐๐ฎ๐ฏ: ๐๐ถ๐ฟ๐๐ ๐ฒ๐๐ฒ๐ฟ ๐๐๐๐๐ฒ๐บ๐ฎ๐๐ถ๐ฐ ๐ฏ๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐๐ป๐ฑ๐ฟ๐ผ๐ถ๐ฑ ๐บ๐ผ๐ฏ๐ถ๐น๐ฒ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐๐ต๐ผ๐๐ ๐๐ต๐ฎ๐ ๐๐บ๐ฎ๐น๐น, ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ฒ๐ฑ ๐ผ๐ฝ๐ฒ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ฐ๐ฎ๐ป ๐ฝ๐ผ๐๐ฒ๐ฟ ๐ฎ ๐๐๐ฅ๐ฉ๐๐ฆ ๐๐๐๐๐ฒ๐บ ๐ผ๐ป ๐๐ผ๐๐ฟ ๐๐บ๐ฎ๐ฟ๐๐ฝ๐ต๐ผ๐ป๐ฒ ๐ฑ๐ฅ
A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
๐ย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
๐๐ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
โ ย An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- ๐ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโs much smaller
- โ๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.
Read their paper here ๐ AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents (2410.24024)
A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
๐ย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
๐๐ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
โ ย An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- ๐ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโs much smaller
- โ๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.
Read their paper here ๐ AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents (2410.24024)