ferran-espuna commited on
Commit
ba0d108
1 Parent(s): 5b61f57

Update README.md

Browse files

Added the correct robustness to LLM as a Judge table

Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -927,20 +927,20 @@ Further details on all tasks and criteria, a full list of results compared to ot
927
 
928
  | **Category** | **Dataset** | **Metric** | **es** | **ca** | **gl** | **eu** | **en** |
929
  |---------|---------|-----------|-------|-------|-------|-------|-------|
930
- | **Commonsense Reasoning** | **XStoryCloze** | Ending Coherence (1 to 5) | 2.36/0.63 | 2.49/0.51 | 2.45/0.59 | 2.30/0.52 | 3.06/0.50 |
931
- | **Paraphrasing** | **PAWS** | Paraphrase Completeness (0/1) | 0.60/0.07 | 0.54/0.09 | 0.64/0.10 | ----/---- | 0.79/0.05 |
932
- | | | Paraphrase Generation (1 to 5) | 2.89/0.54 | 2.71/0.55 | 2.80/0.57 | ----/---- | 3.64/0.37 |
933
- | | | Paraphrase Grammatical Correctness (0/1) | 0.74/0.03 | 0.68/0.05 | 0.78/0.06 | ----/---- | 0.89/0.03 |
934
- | **Reading Comprehension** | **Belebele** | Passage Comprehension (1 to 5) | 3.05/0.43 | 2.81/0.50 | 2.74/0.56 | 2.52/0.43 | 3.11/0.58 |
935
- | | | Answer Relevance (0/1) | 0.74/0.05 | 0.66/0.05 | 0.65/0.08 | 0.59/0.11 | 0.75/0.06 |
936
- | **Extreme Summarization** | **XLSum & caBreu & summarization_gl** | Extreme Summarization Informativeness (1 to 5) | 3.07/0.34 | 3.33/0.31 | 3.11/0.31 | ----/---- | 3.06/0.26 |
937
- | | | Extreme Summarization Conciseness (1 to 5) | 2.92/0.34 | 2.67/0.50 | 2.93/0.38 | ----/---- | 3.13/0.22 |
938
- | **Mathematics** | **mgsm** | Reasoning Capability (1 to 5) | 1.89/0.72 | 1.91/0.65 | 1.97/0.60 | 2.17/0.52 | 2.16/0.65 |
939
- | | | Mathematical Correctness (0/1) | 0.24/0.12 | 0.28/0.13 | 0.27/0.11 | 0.44/0.13 | 0.27/0.12 |
940
- | **Translation form Language** | **FLoRes** | Translation Fluency (1 to 5) | 3.74/0.11 | 3.69/0.15 | ----/---- | ----/---- | 3.69/0.14 |
941
- | | | Translation Accuracy (1 to 5) | 4.01/0.15 | 3.98/0.21 | ----/---- | ----/---- | 3.98/0.23 |
942
- | **Translation to Language** | **FLoRes** | Translation Fluency (1 to 5) | 3.75/0.11 | 3.69/0.14 | ----/---- | ----/---- | 4.09/0.14 |
943
- | | | Translation Accuracy (1 to 5) | 4.08/0.16 | 3.98/0.20 | ----/---- | ----/---- | 4.47/0.15 |
944
 
945
  ---
946
 
 
927
 
928
  | **Category** | **Dataset** | **Metric** | **es** | **ca** | **gl** | **eu** | **en** |
929
  |---------|---------|-----------|-------|-------|-------|-------|-------|
930
+ | **Commonsense Reasoning** | **XStoryCloze** | Ending Coherence (1 to 5) | 2.36/0.66 | 2.49/0.76 | 2.45/0.68 | 2.30/0.67 | 3.06/0.77 |
931
+ | **Paraphrasing** | **PAWS** | Paraphrase Completeness (0/1) | 0.60/0.15 | 0.54/0.17 | 0.64/0.14 | ----/---- | 0.79/0.11 |
932
+ | | | Paraphrase Generation (1 to 5) | 2.89/1.46 | 2.71/1.70 | 2.80/1.21 | ----/---- | 3.64/0.80 |
933
+ | | | Paraphrase Grammatical Correctness (0/1) | 0.74/0.13 | 0.68/0.15 | 0.78/0.10 | ----/---- | 0.89/0.07 |
934
+ | **Reading Comprehension** | **Belebele** | Passage Comprehension (1 to 5) | 3.05/0.60 | 2.81/0.66 | 2.74/0.78 | 2.52/0.46 | 3.11/0.71 |
935
+ | | | Answer Relevance (0/1) | 0.74/0.09 | 0.66/0.11 | 0.65/0.12 | 0.59/0.12 | 0.75/0.09 |
936
+ | **Extreme Summarization** | **XLSum & caBreu & summarization_gl** | Extreme Summarization Informativeness (1 to 5) | 3.07/0.39 | 3.33/0.43 | 3.11/0.36 | ----/---- | 3.06/0.35 |
937
+ | | | Extreme Summarization Conciseness (1 to 5) | 2.92/0.42 | 2.67/0.54 | 2.93/0.39 | ----/---- | 3.13/0.31 |
938
+ | **Mathematics** | **mgsm** | Reasoning Capability (1 to 5) | 1.89/0.47 | 1.91/0.45 | 1.97/0.43 | 2.17/0.44 | 2.16/0.56 |
939
+ | | | Mathematical Correctness (0/1) | 0.24/0.10 | 0.28/0.11 | 0.27/0.11 | 0.44/0.13 | 0.27/0.10 |
940
+ | **Translation form Language** | **FLoRes** | Translation Fluency (1 to 5) | 3.74/0.15 | 3.69/0.22 | ----/---- | ----/---- | 3.69/0.18 |
941
+ | | | Translation Accuracy (1 to 5) | 4.01/0.24 | 3.98/0.31 | ----/---- | ----/---- | 3.98/0.25 |
942
+ | **Translation to Language** | **FLoRes** | Translation Fluency (1 to 5) | 3.75/0.14 | 3.69/0.17 | ----/---- | ----/---- | 4.09/0.16 |
943
+ | | | Translation Accuracy (1 to 5) | 4.08/0.22 | 3.98/0.24 | ----/---- | ----/---- | 4.47/0.18 |
944
 
945
  ---
946