ferran-espuna commited on
Commit
5b61f57
1 Parent(s): 9b2dce0

Update README.md

Browse files

Added LLM as a Judge section

Files changed (1) hide show
  1. README.md +84 -0
README.md CHANGED
@@ -857,6 +857,90 @@ All results reported below are on a 0-shot setting.
857
  </tbody>
858
  </table>
859
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
860
 
861
  ---
862
 
 
857
  </tbody>
858
  </table>
859
 
860
+ ### LLM-as-a-judge
861
+
862
+ We use [Prometheus-2 8x7B](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0) as a judge to evaluate the responses of the model. Tasks are created from existing multilingual evaluation datasets covering the same categories as the ones measured in our gold-standard benchmarks. We randomly select a subset of 250 instances per language from the `test` set of each source dataset. To evaluate the responses of our model, we use task-specific criteria developed in-house for the _LLM-judge_ to use. Each criterion is measured either as a 5-point Likert scale or as a binary task depending on the idiosyncrasy of the task and criterion.
863
+
864
+ Prompts for each task are created in various ways to score the model's robustness in addition to these criteria. This is done by presenting the same source instance within three different prompts. We then calculate the variance between the scores assigned by the _LLM-judge_ to our model's responses to the three prompt styles and average it across all instances. Prompts are human translated to all languages measured. We do not provide the _LLM-judge_ with a reference answer.
865
+
866
+ The _judge_ prompt we use during evaluation is the same used to fine tune the Prometheus-2 family. We keep the _judge_ prompt and criteria used to present the _LLM-judge_ with the task prompts and model responses in English for evaluation across languages. The _judge_ prompt used is:
867
+
868
+ ```python
869
+ "You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.
870
+
871
+ ###Task Description:
872
+ An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
873
+ 1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
874
+ 2. After writing a feedback, write a score that is an integer between {a} and {b}. You should refer to the score rubric.
875
+ 3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between {a} and {b})\"
876
+ 4. Please do not generate any other opening, closing, and explanations.
877
+
878
+ ###The instruction to evaluate:
879
+ {input}
880
+
881
+ ###Response to evaluate:
882
+ {prediction}
883
+
884
+ ###Score Rubrics:
885
+ {criteria}
886
+
887
+ ###Feedback:"
888
+ ```
889
+
890
+ As an example, prompts for the Math task in English are based on instances from [MGSM](https://huggingface.co/datasets/juletxara/mgsm), and each instance is presented within these prompts:
891
+
892
+ ```python
893
+ "en": [
894
+ ("I need help with this math problem: \"", "\" Give me the answer step by step and also the final result separately."),
895
+ ("Can you please help me answer this? \"", "\" Explain the answer and give me the final result as well. Thanks."),
896
+ ("Help me with this problem: \"", "\" I need the answer explained and the final result separately.")
897
+ ]
898
+ ```
899
+
900
+
901
+ This task is then evaluated by the _LLM-judge_ using two criteria, reasoning capability (5-point Likert) and mathematical correctness (binary):
902
+
903
+ ```python
904
+ reasoning_capability_criteria = {
905
+ "reasoning_capability": """
906
+ [Does the model's answer demonstrate reasoning capability?]
907
+ Score 1: The answer demonstrates poor reasoning, with illogical arguments or conclusions that do not follow from the provided information.
908
+ Score 2: The answer shows weak reasoning, with some logical connections but also contains significant flaws or gaps in the argumentation.
909
+ Score 3: The answer demonstrates adequate reasoning, with generally logical arguments, but may have minor flaws or a lack of depth in the reasoning process.
910
+ Score 4: The answer shows strong reasoning, with well-structured arguments and conclusions that logically follow from the information provided.
911
+ Score 5: The answer demonstrates exceptional reasoning, with clear, coherent, and insightful arguments that are logically sound and well-supported by the information provided."""
912
+ }
913
+
914
+ mathematical_correctness_binary_criteria = {
915
+ "mathematical_correctness_binary": """
916
+ [Is the model's answer mathematically correct?]
917
+ Score 0: The answer contains mathematical errors that render the solution incorrect or unreliable.
918
+ Score 1: The answer is mathematically correct, with accurate calculations and appropriate use of mathematical concepts."""
919
+ }
920
+ ```
921
+
922
+ #### Multilingual results
923
+
924
+ Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
925
+
926
+ Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
927
+
928
+ | **Category** | **Dataset** | **Metric** | **es** | **ca** | **gl** | **eu** | **en** |
929
+ |---------|---------|-----------|-------|-------|-------|-------|-------|
930
+ | **Commonsense Reasoning** | **XStoryCloze** | Ending Coherence (1 to 5) | 2.36/0.63 | 2.49/0.51 | 2.45/0.59 | 2.30/0.52 | 3.06/0.50 |
931
+ | **Paraphrasing** | **PAWS** | Paraphrase Completeness (0/1) | 0.60/0.07 | 0.54/0.09 | 0.64/0.10 | ----/---- | 0.79/0.05 |
932
+ | | | Paraphrase Generation (1 to 5) | 2.89/0.54 | 2.71/0.55 | 2.80/0.57 | ----/---- | 3.64/0.37 |
933
+ | | | Paraphrase Grammatical Correctness (0/1) | 0.74/0.03 | 0.68/0.05 | 0.78/0.06 | ----/---- | 0.89/0.03 |
934
+ | **Reading Comprehension** | **Belebele** | Passage Comprehension (1 to 5) | 3.05/0.43 | 2.81/0.50 | 2.74/0.56 | 2.52/0.43 | 3.11/0.58 |
935
+ | | | Answer Relevance (0/1) | 0.74/0.05 | 0.66/0.05 | 0.65/0.08 | 0.59/0.11 | 0.75/0.06 |
936
+ | **Extreme Summarization** | **XLSum & caBreu & summarization_gl** | Extreme Summarization Informativeness (1 to 5) | 3.07/0.34 | 3.33/0.31 | 3.11/0.31 | ----/---- | 3.06/0.26 |
937
+ | | | Extreme Summarization Conciseness (1 to 5) | 2.92/0.34 | 2.67/0.50 | 2.93/0.38 | ----/---- | 3.13/0.22 |
938
+ | **Mathematics** | **mgsm** | Reasoning Capability (1 to 5) | 1.89/0.72 | 1.91/0.65 | 1.97/0.60 | 2.17/0.52 | 2.16/0.65 |
939
+ | | | Mathematical Correctness (0/1) | 0.24/0.12 | 0.28/0.13 | 0.27/0.11 | 0.44/0.13 | 0.27/0.12 |
940
+ | **Translation form Language** | **FLoRes** | Translation Fluency (1 to 5) | 3.74/0.11 | 3.69/0.15 | ----/---- | ----/---- | 3.69/0.14 |
941
+ | | | Translation Accuracy (1 to 5) | 4.01/0.15 | 3.98/0.21 | ----/---- | ----/---- | 3.98/0.23 |
942
+ | **Translation to Language** | **FLoRes** | Translation Fluency (1 to 5) | 3.75/0.11 | 3.69/0.14 | ----/---- | ----/---- | 4.09/0.14 |
943
+ | | | Translation Accuracy (1 to 5) | 4.08/0.16 | 3.98/0.20 | ----/---- | ----/---- | 4.47/0.15 |
944
 
945
  ---
946