The FinBen: An Holistic Financial Benchmark for Large Language Models Paper • 2402.12659 • Published Feb 20 • 16
Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs Paper • 2312.17080 • Published Dec 28, 2023 • 1
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Paper • 1804.07461 • Published Apr 20, 2018 • 4
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent Paper • 2404.03648 • Published Apr 4 • 24
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 44
DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation Paper • 2404.07917 • Published Apr 11 • 1
Introducing v0.5 of the AI Safety Benchmark from MLCommons Paper • 2404.12241 • Published Apr 18 • 10
When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes Paper • 2404.12365 • Published Apr 18 • 1
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7