@singh96aman on Hugging Face: "𝗝𝘂𝗱𝗴𝗶𝗻𝗴 𝘁𝗵𝗲 𝗝𝘂𝗱𝗴𝗲𝘀: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁…"

Post

2086

𝗝𝘂𝗱𝗴𝗶𝗻𝗴 𝘁𝗵𝗲 𝗝𝘂𝗱𝗴𝗲𝘀: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝗩𝘂𝗹𝗻𝗲𝗿𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀 𝗶𝗻 𝗟𝗟𝗠𝘀-𝗮𝘀-𝗝𝘂𝗱𝗴𝗲𝘀
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges (2406.12624)

𝐂𝐚𝐧 𝐋𝐋𝐌𝐬 𝐬𝐞𝐫𝐯𝐞 𝐚𝐬 𝐫𝐞𝐥𝐢𝐚𝐛𝐥𝐞 𝐣𝐮𝐝𝐠𝐞𝐬 ⚖️?

We aim to identify the right metrics for evaluating Judge LLMs and understand their sensitivities to prompt guidelines, engineering, and specificity. With this paper, we want to raise caution ⚠️ to blindly using LLMs as human proxy.

Blog - https://huggingface.co/blog/singh96aman/judgingthejudges
Arxiv - https://arxiv.org/abs/2406.12624
Tweet - https://x.com/iamsingh96aman/status/1804148173008703509

@singh96aman @kartik727 @Srinik-1 @sankaranv @dieuwkehupkes