Papers
arxiv:2407.15711

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Published on Jul 22
· Submitted by Ori on Jul 23
Authors:
,
,
,

Abstract

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

Community

Paper author Paper submitter

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Benchmark to evaluate the ability of web agents to solve realistic and time-consuming tasks.

·

Hi @Ori cool work!

I opened a PR to link the dataset: https://huggingface.co/datasets/AssistantBench/AssistantBench/discussions/2.

Moreover, in case there's a Space, it could similarly be linked to the paper :)

Cheers,

Niels

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.15711 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 4