AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Abstract
Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.
Community
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Benchmark to evaluate the ability of web agents to solve realistic and time-consuming tasks.
Hi @Ori cool work!
I opened a PR to link the dataset: https://huggingface.co/datasets/AssistantBench/AssistantBench/discussions/2.
Moreover, in case there's a Space, it could similarly be linked to the paper :)
Cheers,
Niels
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tree Search for Language Model Agents (2024)
- Large Language Models Can Self-Improve At Web Agent Tasks (2024)
- Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (2024)
- Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning (2024)
- LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper