Post
1644
๐๐ฟ๐ฒ ๐๐ด๐ฒ๐ป๐๐ ๐ฐ๐ฎ๐ฝ๐ฎ๐ฏ๐น๐ฒ ๐ฒ๐ป๐ผ๐๐ด๐ต ๐ณ๐ผ๐ฟ ๐๐ฎ๐๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ? โ ๐ ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ต๐ฒ๐ถ๐ฟ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐๐ถ๐๐ต ๐๐ฆ๐๐ฒ๐ป๐ฐ๐ต ๐
A team from Tencent AI wanted to evaluate agentic systems on data science (DS) tasks : but they noticed that existing agentic benchmarks were severely limited in several aspects: they were limited to text and did not include tables or images, were only specific to certain packages, only performed exact match evaluationโฆ
โก๏ธ So they set out to build a much more exhaustive approach, to finally make the definitive DS agent benchmark.
๐ง๐ต๐ฒ ๐๐ฆ๐๐ฒ๐ป๐ฐ๐ต ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐
โช๏ธDS bench has 466 data analysis tasks and 74 data modelling tasks
โช๏ธThe tasks are sourced from ModelOff and Kaggle, the platforms hosting the most popular data science competitions
โช๏ธDifference with previous DS benchmarks:
โถ This benchmark leverages various modalities on top of text: images, Excel files, tables
โท Complex tables: sometimes several tables should be leveraged to answer one question
โธ The context is richer, with longer descriptions.
โช๏ธ Evaluation metrics : the benchmark is scored with an LLM as a judge, using a specific prompt.
๐๐ป๐๐ถ๐ด๐ต๐๐ ๐ณ๐ฟ๐ผ๐บ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐
โช๏ธ Their evaluation confirms that using LLMs in an agent setup, for instance by allowing them to run a single step of code execution, is more costly (especially with multi-turn frameworks like autogen) but also much more performant than the vanilla LLM.
โช๏ธ The sets of tasks solved by different models (like GPT-3.5 vs Llama-3-8B) has quite low overlap, which suggests that different models tend to try very different approches.
This new benchmark is really welcome, can't wait to try transformers agents on it! ๐ค
Read their full paper ๐ DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (2409.07703)
A team from Tencent AI wanted to evaluate agentic systems on data science (DS) tasks : but they noticed that existing agentic benchmarks were severely limited in several aspects: they were limited to text and did not include tables or images, were only specific to certain packages, only performed exact match evaluationโฆ
โก๏ธ So they set out to build a much more exhaustive approach, to finally make the definitive DS agent benchmark.
๐ง๐ต๐ฒ ๐๐ฆ๐๐ฒ๐ป๐ฐ๐ต ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐
โช๏ธDS bench has 466 data analysis tasks and 74 data modelling tasks
โช๏ธThe tasks are sourced from ModelOff and Kaggle, the platforms hosting the most popular data science competitions
โช๏ธDifference with previous DS benchmarks:
โถ This benchmark leverages various modalities on top of text: images, Excel files, tables
โท Complex tables: sometimes several tables should be leveraged to answer one question
โธ The context is richer, with longer descriptions.
โช๏ธ Evaluation metrics : the benchmark is scored with an LLM as a judge, using a specific prompt.
๐๐ป๐๐ถ๐ด๐ต๐๐ ๐ณ๐ฟ๐ผ๐บ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐
โช๏ธ Their evaluation confirms that using LLMs in an agent setup, for instance by allowing them to run a single step of code execution, is more costly (especially with multi-turn frameworks like autogen) but also much more performant than the vanilla LLM.
โช๏ธ The sets of tasks solved by different models (like GPT-3.5 vs Llama-3-8B) has quite low overlap, which suggests that different models tend to try very different approches.
This new benchmark is really welcome, can't wait to try transformers agents on it! ๐ค
Read their full paper ๐ DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? (2409.07703)