Spaces:
Runtime error
Runtime error
evaluate-bot
commited on
Commit
•
3f0ff07
1
Parent(s):
eb023e7
Update Space (evaluate main: af3c3056)
Browse files- README.md +38 -3
- comet.py +36 -10
- requirements.txt +1 -1
README.md
CHANGED
@@ -36,7 +36,11 @@ reference = ["They were able to control the fire.", "Schools and kindergartens o
|
|
36 |
comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
37 |
```
|
38 |
|
39 |
-
It has several configurations, named after the COMET model to be used.
|
|
|
|
|
|
|
|
|
40 |
|
41 |
It also has several optional arguments:
|
42 |
|
@@ -44,7 +48,7 @@ It also has several optional arguments:
|
|
44 |
|
45 |
`progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
|
46 |
|
47 |
-
More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/
|
48 |
|
49 |
## Output values
|
50 |
|
@@ -107,9 +111,40 @@ Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, B
|
|
107 |
|
108 |
Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
|
109 |
|
110 |
-
Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
|
112 |
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
```bibtex
|
115 |
@inproceedings{rei-EtAl:2020:WMT,
|
|
|
36 |
comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
37 |
```
|
38 |
|
39 |
+
It has several configurations, named after the COMET model to be used. For versions below 2.0 it will default to `wmt20-comet-da` (previously known as `wmt-large-da-estimator-1719`) and for the latest versions (>= 2.0) it will default to `Unbabel/wmt22-comet-da`.
|
40 |
+
|
41 |
+
Alternative models that can be chosen include `wmt20-comet-qe-da`, `wmt21-comet-mqm`, `wmt21-cometinho-da`, `wmt21-comet-qe-mqm` and `emnlp20-comet-rank`. Notably, a distilled model is also available, which is 80% smaller and 2.128x faster while performing close to non-distilled alternatives. You can use it with the identifier `eamt22-cometinho-da`. This version, called Cometinho, was elected as [the best paper](https://aclanthology.org/2022.eamt-1.9) at the annual European conference on Machine Translation.
|
42 |
+
|
43 |
+
> NOTE: In `unbabel-comet>=2.0` all models were moved to Hugging Face Hub and you need to add the suffix `Unbabel/` to be able to download and use them. For example for the distilled version replace `eamt22-cometinho-da` with `Unbabel/eamt22-cometinho-da`.
|
44 |
|
45 |
It also has several optional arguments:
|
46 |
|
|
|
48 |
|
49 |
`progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
|
50 |
|
51 |
+
More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/index.html).
|
52 |
|
53 |
## Output values
|
54 |
|
|
|
111 |
|
112 |
Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
|
113 |
|
114 |
+
Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `wmt22-comet-da`, takes over 2.32GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `eamt22-cometinho-da` is 344MB.
|
115 |
+
|
116 |
+
### Interpreting Scores:
|
117 |
+
|
118 |
+
When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.
|
119 |
+
|
120 |
+
In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.
|
121 |
+
|
122 |
+
However, for the latest COMET models like `Unbabel/wmt22-comet-da`, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.
|
123 |
+
|
124 |
+
It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run statistical significance measures to reliably compare scores between systems.
|
125 |
|
126 |
## Citation
|
127 |
+
```bibtex
|
128 |
+
@inproceedings{rei-etal-2022-comet,
|
129 |
+
title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
|
130 |
+
author = "Rei, Ricardo and
|
131 |
+
C. de Souza, Jos{\'e} G. and
|
132 |
+
Alves, Duarte and
|
133 |
+
Zerva, Chrysoula and
|
134 |
+
Farinha, Ana C and
|
135 |
+
Glushkova, Taisiya and
|
136 |
+
Lavie, Alon and
|
137 |
+
Coheur, Luisa and
|
138 |
+
Martins, Andr{\'e} F. T.",
|
139 |
+
booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
|
140 |
+
month = dec,
|
141 |
+
year = "2022",
|
142 |
+
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
|
143 |
+
publisher = "Association for Computational Linguistics",
|
144 |
+
url = "https://aclanthology.org/2022.wmt-1.52",
|
145 |
+
pages = "578--585",
|
146 |
+
}
|
147 |
+
```
|
148 |
|
149 |
```bibtex
|
150 |
@inproceedings{rei-EtAl:2020:WMT,
|
comet.py
CHANGED
@@ -22,7 +22,7 @@ Usage:
|
|
22 |
from evaluate import load
|
23 |
comet_metric = load('metrics/comet/comet.py')
|
24 |
#comet_metric = load('comet')
|
25 |
-
#comet_metric = load('comet', '
|
26 |
|
27 |
|
28 |
source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
|
@@ -37,6 +37,7 @@ predictions['scores']
|
|
37 |
import comet # From: unbabel-comet
|
38 |
import datasets
|
39 |
import torch
|
|
|
40 |
|
41 |
import evaluate
|
42 |
|
@@ -44,6 +45,25 @@ import evaluate
|
|
44 |
logger = evaluate.logging.get_logger(__name__)
|
45 |
|
46 |
_CITATION = """\
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
@inproceedings{rei-EtAl:2020:WMT,
|
48 |
author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
|
49 |
title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
|
@@ -85,13 +105,11 @@ Args:
|
|
85 |
`sources` (list of str): Source sentences
|
86 |
`predictions` (list of str): candidate translations
|
87 |
`references` (list of str): reference translations
|
88 |
-
`
|
89 |
-
`
|
90 |
-
`model`: COMET model to be used. Will default to `wmt-large-da-estimator-1719` if None.
|
91 |
|
92 |
Returns:
|
93 |
-
|
94 |
-
`scores`: List of scores.
|
95 |
|
96 |
Examples:
|
97 |
|
@@ -101,8 +119,8 @@ Examples:
|
|
101 |
>>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
|
102 |
>>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
|
103 |
>>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
104 |
-
>>> print([round(v,
|
105 |
-
[0.
|
106 |
"""
|
107 |
|
108 |
|
@@ -125,6 +143,7 @@ class COMET(evaluate.Metric):
|
|
125 |
codebase_urls=["https://github.com/Unbabel/COMET"],
|
126 |
reference_urls=[
|
127 |
"https://github.com/Unbabel/COMET",
|
|
|
128 |
"https://www.aclweb.org/anthology/2020.emnlp-main.213/",
|
129 |
"http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
|
130 |
],
|
@@ -132,7 +151,10 @@ class COMET(evaluate.Metric):
|
|
132 |
|
133 |
def _download_and_prepare(self, dl_manager):
|
134 |
if self.config_name == "default":
|
135 |
-
|
|
|
|
|
|
|
136 |
else:
|
137 |
self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
|
138 |
|
@@ -141,5 +163,9 @@ class COMET(evaluate.Metric):
|
|
141 |
gpus = 1 if torch.cuda.is_available() else 0
|
142 |
data = {"src": sources, "mt": predictions, "ref": references}
|
143 |
data = [dict(zip(data, t)) for t in zip(*data.values())]
|
144 |
-
|
|
|
|
|
|
|
|
|
145 |
return {"mean_score": mean_score, "scores": scores}
|
|
|
22 |
from evaluate import load
|
23 |
comet_metric = load('metrics/comet/comet.py')
|
24 |
#comet_metric = load('comet')
|
25 |
+
#comet_metric = load('comet', 'Unbabel/wmt20-comet-da')
|
26 |
|
27 |
|
28 |
source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
|
|
|
37 |
import comet # From: unbabel-comet
|
38 |
import datasets
|
39 |
import torch
|
40 |
+
from packaging import version
|
41 |
|
42 |
import evaluate
|
43 |
|
|
|
45 |
logger = evaluate.logging.get_logger(__name__)
|
46 |
|
47 |
_CITATION = """\
|
48 |
+
@inproceedings{rei-etal-2022-comet,
|
49 |
+
title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
|
50 |
+
author = "Rei, Ricardo and
|
51 |
+
C. de Souza, Jos{\'e} G. and
|
52 |
+
Alves, Duarte and
|
53 |
+
Zerva, Chrysoula and
|
54 |
+
Farinha, Ana C and
|
55 |
+
Glushkova, Taisiya and
|
56 |
+
Lavie, Alon and
|
57 |
+
Coheur, Luisa and
|
58 |
+
Martins, Andr{\'e} F. T.",
|
59 |
+
booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
|
60 |
+
month = dec,
|
61 |
+
year = "2022",
|
62 |
+
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
|
63 |
+
publisher = "Association for Computational Linguistics",
|
64 |
+
url = "https://aclanthology.org/2022.wmt-1.52",
|
65 |
+
pages = "578--585",
|
66 |
+
}
|
67 |
@inproceedings{rei-EtAl:2020:WMT,
|
68 |
author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
|
69 |
title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
|
|
|
105 |
`sources` (list of str): Source sentences
|
106 |
`predictions` (list of str): candidate translations
|
107 |
`references` (list of str): reference translations
|
108 |
+
`gpus` (bool): Number of GPUs to use. 0 for CPU
|
109 |
+
`progress_bar` (bool): Flag that turns on and off the predict progress bar. Defaults to True
|
|
|
110 |
|
111 |
Returns:
|
112 |
+
Dict with all sentence-level scores (`scores` key) a system-level score (`mean_score` key).
|
|
|
113 |
|
114 |
Examples:
|
115 |
|
|
|
119 |
>>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
|
120 |
>>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
|
121 |
>>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
122 |
+
>>> print([round(v, 3) for v in results["scores"]])
|
123 |
+
[0.839, 0.972]
|
124 |
"""
|
125 |
|
126 |
|
|
|
143 |
codebase_urls=["https://github.com/Unbabel/COMET"],
|
144 |
reference_urls=[
|
145 |
"https://github.com/Unbabel/COMET",
|
146 |
+
"https://aclanthology.org/2022.wmt-1.52/",
|
147 |
"https://www.aclweb.org/anthology/2020.emnlp-main.213/",
|
148 |
"http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
|
149 |
],
|
|
|
151 |
|
152 |
def _download_and_prepare(self, dl_manager):
|
153 |
if self.config_name == "default":
|
154 |
+
if version.parse(comet.__version__) >= version.parse("2.0.0"):
|
155 |
+
self.scorer = comet.load_from_checkpoint(comet.download_model("Unbabel/wmt22-comet-da"))
|
156 |
+
else:
|
157 |
+
self.scorer = comet.load_from_checkpoint(comet.download_model("wmt20-comet-da"))
|
158 |
else:
|
159 |
self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
|
160 |
|
|
|
163 |
gpus = 1 if torch.cuda.is_available() else 0
|
164 |
data = {"src": sources, "mt": predictions, "ref": references}
|
165 |
data = [dict(zip(data, t)) for t in zip(*data.values())]
|
166 |
+
if version.parse(comet.__version__) >= version.parse("2.0.0"):
|
167 |
+
output = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
|
168 |
+
scores, mean_score = output.scores, output.system_score
|
169 |
+
else:
|
170 |
+
scores, mean_score = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
|
171 |
return {"mean_score": mean_score, "scores": scores}
|
requirements.txt
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
-
git+https://github.com/huggingface/evaluate@
|
2 |
unbabel-comet
|
3 |
torch
|
|
|
1 |
+
git+https://github.com/huggingface/evaluate@af3c30561d840b83e54fc5f7150ea58046d6af69
|
2 |
unbabel-comet
|
3 |
torch
|