evaluate-bot commited on
Commit
3f0ff07
1 Parent(s): eb023e7

Update Space (evaluate main: af3c3056)

Browse files
Files changed (3) hide show
  1. README.md +38 -3
  2. comet.py +36 -10
  3. requirements.txt +1 -1
README.md CHANGED
@@ -36,7 +36,11 @@ reference = ["They were able to control the fire.", "Schools and kindergartens o
36
  comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
37
  ```
38
 
39
- It has several configurations, named after the COMET model to be used. It will default to `wmt20-comet-da` (previously known as `wmt-large-da-estimator-1719`). Alternate models that can be chosen include `wmt20-comet-qe-da`, `wmt21-comet-mqm`, `wmt21-cometinho-da`, `wmt21-comet-qe-mqm` and `emnlp20-comet-rank`. Notably, a distilled model is also available, which is 80% smaller and 2.128x faster while performing close to non-distilled alternatives. You can use it with the identifier `eamt22-cometinho-da`. This version, called Cometinho, was elected as [the best paper](https://aclanthology.org/2022.eamt-1.9) at the annual European conference on machine translation.
 
 
 
 
40
 
41
  It also has several optional arguments:
42
 
@@ -44,7 +48,7 @@ It also has several optional arguments:
44
 
45
  `progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
46
 
47
- More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/models.html).
48
 
49
  ## Output values
50
 
@@ -107,9 +111,40 @@ Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, B
107
 
108
  Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
109
 
110
- Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `wmt20-comet-da`, takes over 1.79GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `wmt21-cometinho-da` is 344MB.
 
 
 
 
 
 
 
 
 
 
111
 
112
  ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  ```bibtex
115
  @inproceedings{rei-EtAl:2020:WMT,
 
36
  comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
37
  ```
38
 
39
+ It has several configurations, named after the COMET model to be used. For versions below 2.0 it will default to `wmt20-comet-da` (previously known as `wmt-large-da-estimator-1719`) and for the latest versions (>= 2.0) it will default to `Unbabel/wmt22-comet-da`.
40
+
41
+ Alternative models that can be chosen include `wmt20-comet-qe-da`, `wmt21-comet-mqm`, `wmt21-cometinho-da`, `wmt21-comet-qe-mqm` and `emnlp20-comet-rank`. Notably, a distilled model is also available, which is 80% smaller and 2.128x faster while performing close to non-distilled alternatives. You can use it with the identifier `eamt22-cometinho-da`. This version, called Cometinho, was elected as [the best paper](https://aclanthology.org/2022.eamt-1.9) at the annual European conference on Machine Translation.
42
+
43
+ > NOTE: In `unbabel-comet>=2.0` all models were moved to Hugging Face Hub and you need to add the suffix `Unbabel/` to be able to download and use them. For example for the distilled version replace `eamt22-cometinho-da` with `Unbabel/eamt22-cometinho-da`.
44
 
45
  It also has several optional arguments:
46
 
 
48
 
49
  `progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
50
 
51
+ More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/index.html).
52
 
53
  ## Output values
54
 
 
111
 
112
  Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
113
 
114
+ Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `wmt22-comet-da`, takes over 2.32GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `eamt22-cometinho-da` is 344MB.
115
+
116
+ ### Interpreting Scores:
117
+
118
+ When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.
119
+
120
+ In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.
121
+
122
+ However, for the latest COMET models like `Unbabel/wmt22-comet-da`, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.
123
+
124
+ It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run statistical significance measures to reliably compare scores between systems.
125
 
126
  ## Citation
127
+ ```bibtex
128
+ @inproceedings{rei-etal-2022-comet,
129
+ title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
130
+ author = "Rei, Ricardo and
131
+ C. de Souza, Jos{\'e} G. and
132
+ Alves, Duarte and
133
+ Zerva, Chrysoula and
134
+ Farinha, Ana C and
135
+ Glushkova, Taisiya and
136
+ Lavie, Alon and
137
+ Coheur, Luisa and
138
+ Martins, Andr{\'e} F. T.",
139
+ booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
140
+ month = dec,
141
+ year = "2022",
142
+ address = "Abu Dhabi, United Arab Emirates (Hybrid)",
143
+ publisher = "Association for Computational Linguistics",
144
+ url = "https://aclanthology.org/2022.wmt-1.52",
145
+ pages = "578--585",
146
+ }
147
+ ```
148
 
149
  ```bibtex
150
  @inproceedings{rei-EtAl:2020:WMT,
comet.py CHANGED
@@ -22,7 +22,7 @@ Usage:
22
  from evaluate import load
23
  comet_metric = load('metrics/comet/comet.py')
24
  #comet_metric = load('comet')
25
- #comet_metric = load('comet', 'wmt-large-hter-estimator')
26
 
27
 
28
  source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
@@ -37,6 +37,7 @@ predictions['scores']
37
  import comet # From: unbabel-comet
38
  import datasets
39
  import torch
 
40
 
41
  import evaluate
42
 
@@ -44,6 +45,25 @@ import evaluate
44
  logger = evaluate.logging.get_logger(__name__)
45
 
46
  _CITATION = """\
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  @inproceedings{rei-EtAl:2020:WMT,
48
  author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
49
  title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
@@ -85,13 +105,11 @@ Args:
85
  `sources` (list of str): Source sentences
86
  `predictions` (list of str): candidate translations
87
  `references` (list of str): reference translations
88
- `cuda` (bool): If set to True, runs COMET using GPU
89
- `show_progress` (bool): Shows progress
90
- `model`: COMET model to be used. Will default to `wmt-large-da-estimator-1719` if None.
91
 
92
  Returns:
93
- `samples`: List of dictionaries with `src`, `mt`, `ref` and `score`.
94
- `scores`: List of scores.
95
 
96
  Examples:
97
 
@@ -101,8 +119,8 @@ Examples:
101
  >>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
102
  >>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
103
  >>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
104
- >>> print([round(v, 2) for v in results["scores"]])
105
- [0.19, 0.92]
106
  """
107
 
108
 
@@ -125,6 +143,7 @@ class COMET(evaluate.Metric):
125
  codebase_urls=["https://github.com/Unbabel/COMET"],
126
  reference_urls=[
127
  "https://github.com/Unbabel/COMET",
 
128
  "https://www.aclweb.org/anthology/2020.emnlp-main.213/",
129
  "http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
130
  ],
@@ -132,7 +151,10 @@ class COMET(evaluate.Metric):
132
 
133
  def _download_and_prepare(self, dl_manager):
134
  if self.config_name == "default":
135
- self.scorer = comet.load_from_checkpoint(comet.download_model("wmt20-comet-da"))
 
 
 
136
  else:
137
  self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
138
 
@@ -141,5 +163,9 @@ class COMET(evaluate.Metric):
141
  gpus = 1 if torch.cuda.is_available() else 0
142
  data = {"src": sources, "mt": predictions, "ref": references}
143
  data = [dict(zip(data, t)) for t in zip(*data.values())]
144
- scores, mean_score = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
 
 
 
 
145
  return {"mean_score": mean_score, "scores": scores}
 
22
  from evaluate import load
23
  comet_metric = load('metrics/comet/comet.py')
24
  #comet_metric = load('comet')
25
+ #comet_metric = load('comet', 'Unbabel/wmt20-comet-da')
26
 
27
 
28
  source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
 
37
  import comet # From: unbabel-comet
38
  import datasets
39
  import torch
40
+ from packaging import version
41
 
42
  import evaluate
43
 
 
45
  logger = evaluate.logging.get_logger(__name__)
46
 
47
  _CITATION = """\
48
+ @inproceedings{rei-etal-2022-comet,
49
+ title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
50
+ author = "Rei, Ricardo and
51
+ C. de Souza, Jos{\'e} G. and
52
+ Alves, Duarte and
53
+ Zerva, Chrysoula and
54
+ Farinha, Ana C and
55
+ Glushkova, Taisiya and
56
+ Lavie, Alon and
57
+ Coheur, Luisa and
58
+ Martins, Andr{\'e} F. T.",
59
+ booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
60
+ month = dec,
61
+ year = "2022",
62
+ address = "Abu Dhabi, United Arab Emirates (Hybrid)",
63
+ publisher = "Association for Computational Linguistics",
64
+ url = "https://aclanthology.org/2022.wmt-1.52",
65
+ pages = "578--585",
66
+ }
67
  @inproceedings{rei-EtAl:2020:WMT,
68
  author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
69
  title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
 
105
  `sources` (list of str): Source sentences
106
  `predictions` (list of str): candidate translations
107
  `references` (list of str): reference translations
108
+ `gpus` (bool): Number of GPUs to use. 0 for CPU
109
+ `progress_bar` (bool): Flag that turns on and off the predict progress bar. Defaults to True
 
110
 
111
  Returns:
112
+ Dict with all sentence-level scores (`scores` key) a system-level score (`mean_score` key).
 
113
 
114
  Examples:
115
 
 
119
  >>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
120
  >>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
121
  >>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
122
+ >>> print([round(v, 3) for v in results["scores"]])
123
+ [0.839, 0.972]
124
  """
125
 
126
 
 
143
  codebase_urls=["https://github.com/Unbabel/COMET"],
144
  reference_urls=[
145
  "https://github.com/Unbabel/COMET",
146
+ "https://aclanthology.org/2022.wmt-1.52/",
147
  "https://www.aclweb.org/anthology/2020.emnlp-main.213/",
148
  "http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
149
  ],
 
151
 
152
  def _download_and_prepare(self, dl_manager):
153
  if self.config_name == "default":
154
+ if version.parse(comet.__version__) >= version.parse("2.0.0"):
155
+ self.scorer = comet.load_from_checkpoint(comet.download_model("Unbabel/wmt22-comet-da"))
156
+ else:
157
+ self.scorer = comet.load_from_checkpoint(comet.download_model("wmt20-comet-da"))
158
  else:
159
  self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
160
 
 
163
  gpus = 1 if torch.cuda.is_available() else 0
164
  data = {"src": sources, "mt": predictions, "ref": references}
165
  data = [dict(zip(data, t)) for t in zip(*data.values())]
166
+ if version.parse(comet.__version__) >= version.parse("2.0.0"):
167
+ output = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
168
+ scores, mean_score = output.scores, output.system_score
169
+ else:
170
+ scores, mean_score = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
171
  return {"mean_score": mean_score, "scores": scores}
requirements.txt CHANGED
@@ -1,3 +1,3 @@
1
- git+https://github.com/huggingface/evaluate@7d7d81dd3ffec0812e2edb09f86b3b1e31d61118
2
  unbabel-comet
3
  torch
 
1
+ git+https://github.com/huggingface/evaluate@af3c30561d840b83e54fc5f7150ea58046d6af69
2
  unbabel-comet
3
  torch