Please share with me the drugs you guys are smoking to make this leaderboard
Brother...
CodeLlama-13B-Base at place 84 and Llama-3-8B-Instruct at place 90 bellow it. when llama-3 is 10x better coder.
And dont get me started on Phi-3-Mini-128K-Instruct being better than both Codestral-22B-v0.1 and Gemma-2-27B-Instruct.
Like ya'll have actually lost your ducking minds lol. ππ
Tell me you dont test a single coding model by hand before putting it on a leaderboard without telling me you dont test a single model by hand before putting it on a leaderboard
Hi,
I want to clarify that this benchmark focuses on (1) instruction following and (2) code function calling, besides the general coding capability.
In addition, I want to highlight that:
- The dataset and evaluation setup are publicly available, and you can check out each task description yourself. Let us know if you think they are appropriate.
- The dataset only assesses Python problem-solving skills.
- This leaderboard only presents the Hard subset for now. To help you understand the difference, I added the full results back again.
- The leaderboard is very up-to-date, covering many recent SOTA models. Other leaderboards like EvalPlus are not updated frequently.
For your criticism, here is the response:
- As shown in EvalPlus, CodeLlama-13B-Base has a score of 38.4, while Llama-3-8B-Instruct is 56.7. The full BigCodeBench-Complete shows that CodeLlama-13B-Base is 32 and Llama-3-8B-Instruct is 36.9.
- Phi-3-Mini-128K-Instruct was updated in June and performs much better on coding. Please refer to their model card.
- The old Phi-3-Mini-128K-Instruct is not better than Codestral-22B-v0.1 and Gemma-2-27B-Instruct.
Cheers
Closed this issue, as there is no further update.
Yea I just dont agree, and I didnt wanna say anything mean. So thats fine
Thank you for your understanding. IMHO, no single benchmark can completely satisfy everyone, whether it's for NLP or coding tasks. Please let us know if you have any ideas for improving the design.