Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singh96amanย 
posted an update Jun 24
Post
2086
๐—๐˜‚๐—ฑ๐—ด๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ ๐—๐˜‚๐—ฑ๐—ด๐—ฒ๐˜€: ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—”๐—น๐—ถ๐—ด๐—ป๐—บ๐—ฒ๐—ป๐˜ ๐—ฎ๐—ป๐—ฑ ๐—ฉ๐˜‚๐—น๐—ป๐—ฒ๐—ฟ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐—ถ๐—ฒ๐˜€ ๐—ถ๐—ป ๐—Ÿ๐—Ÿ๐— ๐˜€-๐—ฎ๐˜€-๐—๐˜‚๐—ฑ๐—ด๐—ฒ๐˜€
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (2406.12624)

๐‚๐š๐ง ๐‹๐‹๐Œ๐ฌ ๐ฌ๐ž๐ซ๐ฏ๐ž ๐š๐ฌ ๐ซ๐ž๐ฅ๐ข๐š๐›๐ฅ๐ž ๐ฃ๐ฎ๐๐ ๐ž๐ฌ โš–๏ธ?

We aim to identify the right metrics for evaluating Judge LLMs and understand their sensitivities to prompt guidelines, engineering, and specificity. With this paper, we want to raise caution โš ๏ธ to blindly using LLMs as human proxy.

Blog - https://huggingface.co/blog/singh96aman/judgingthejudges
Arxiv - https://arxiv.org/abs/2406.12624
Tweet - https://x.com/iamsingh96aman/status/1804148173008703509

@singh96aman @kartik727 @Srinik-1 @sankaranv @dieuwkehupkes
In this post