lvwerra HF staff commited on
Commit
735107c
1 Parent(s): 740c7d4

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +102 -4
  2. app.py +6 -0
  3. perplexity.py +189 -0
  4. requirements.txt +6 -0
README.md CHANGED
@@ -1,12 +1,110 @@
1
  ---
2
  title: Perplexity
3
- emoji: 🌍
4
- colorFrom: indigo
5
- colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Perplexity
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for Perplexity
16
+
17
+ ## Metric Description
18
+ Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence. This can be used in two main ways:
19
+ 1. to evaluate how well the model has learned the distribution of the text it was trained on
20
+ - In this case, the model input should be the trained model to be evaluated, and the input texts should be the text that the model was trained on.
21
+ 2. to evaluate how well a selection of text matches the distribution of text that the input model was trained on
22
+ - In this case, the model input should be a trained model, and the input texts should be the text to be evaluated.
23
+
24
+ ## Intended Uses
25
+ Any language generation task.
26
+
27
+ ## How to Use
28
+
29
+ The metric takes a list of text as input, as well as the name of the model used to compute the metric:
30
+
31
+ ```python
32
+ from evaluate import load
33
+ perplexity = load("perplexity")
34
+ results = perplexity.compute(input_texts=input_texts, model_id='gpt2')
35
+ ```
36
+
37
+ ### Inputs
38
+ - **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
39
+ - This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
40
+ - **input_texts** (list of str): input text, each separate text snippet is one list entry.
41
+ - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
42
+ - **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
43
+ - **device** (str): device to run on, defaults to 'cuda' when available
44
+
45
+ ### Output Values
46
+ This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
47
+ If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
48
+
49
+ ```
50
+ {'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
51
+ ```
52
+
53
+ This metric's range is 0 and up. A lower score is better.
54
+
55
+ #### Values from Popular Papers
56
+
57
+
58
+ ### Examples
59
+ Calculating perplexity on input_texts defined here:
60
+ ```python
61
+ perplexity = evaluate.load("perplexity")
62
+ input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
63
+ results = perplexity.compute(model_id='gpt2',
64
+ add_start_token=False,
65
+ input_texts=input_texts)
66
+ print(list(results.keys()))
67
+ >>>['perplexities', 'mean_perplexity']
68
+ print(round(results["mean_perplexity"], 2))
69
+ >>>78.22
70
+ print(round(results["perplexities"][0], 2))
71
+ >>>11.11
72
+ ```
73
+ Calculating perplexity on input_texts loaded in from a dataset:
74
+ ```python
75
+ perplexity = evaluate.load("perplexity")
76
+ input_texts = datasets.load_dataset("wikitext",
77
+ "wikitext-2-raw-v1",
78
+ split="test")["text"][:50]
79
+ input_texts = [s for s in input_texts if s!='']
80
+ results = perplexity.compute(model_id='gpt2',
81
+ input_texts=input_texts)
82
+ print(list(results.keys()))
83
+ >>>['perplexities', 'mean_perplexity']
84
+ print(round(results["mean_perplexity"], 2))
85
+ >>>60.35
86
+ print(round(results["perplexities"][0], 2))
87
+ >>>81.12
88
+ ```
89
+
90
+ ## Limitations and Bias
91
+ Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
92
+
93
+
94
+ ## Citation
95
+
96
+ ```bibtex
97
+ @article{jelinek1977perplexity,
98
+ title={Perplexity—a measure of the difficulty of speech recognition tasks},
99
+ author={Jelinek, Fred and Mercer, Robert L and Bahl, Lalit R and Baker, James K},
100
+ journal={The Journal of the Acoustical Society of America},
101
+ volume={62},
102
+ number={S1},
103
+ pages={S63--S63},
104
+ year={1977},
105
+ publisher={Acoustical Society of America}
106
+ }
107
+ ```
108
+
109
+ ## Further References
110
+ - [Hugging Face Perplexity Blog Post](https://huggingface.co/docs/transformers/perplexity)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("perplexity")
6
+ launch_gradio_widget(module)
perplexity.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """Perplexity Metric."""
15
+
16
+ import datasets
17
+ import numpy as np
18
+ import torch
19
+ from torch.nn import CrossEntropyLoss
20
+ from transformers import AutoModelForCausalLM, AutoTokenizer
21
+
22
+ import evaluate
23
+ from evaluate import logging
24
+
25
+
26
+ _CITATION = """\
27
+
28
+ """
29
+
30
+ _DESCRIPTION = """
31
+ Perplexity (PPL) is one of the most common metrics for evaluating language models.
32
+ It is defined as the exponentiated average negative log-likelihood of a sequence.
33
+
34
+ For more information, see https://huggingface.co/docs/transformers/perplexity
35
+ """
36
+
37
+ _KWARGS_DESCRIPTION = """
38
+ Args:
39
+ model_id (str): model used for calculating Perplexity
40
+ NOTE: Perplexity can only be calculated for causal language models.
41
+ This includes models such as gpt2, causal variations of bert,
42
+ causal versions of t5, and more (the full list can be found
43
+ in the AutoModelForCausalLM documentation here:
44
+ https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
45
+
46
+ input_texts (list of str): input text, each separate text snippet
47
+ is one list entry.
48
+ batch_size (int): the batch size to run texts through the model. Defaults to 16.
49
+ add_start_token (bool): whether to add the start token to the texts,
50
+ so the perplexity can include the probability of the first word. Defaults to True.
51
+ device (str): device to run on, defaults to 'cuda' when available
52
+ Returns:
53
+ perplexity: dictionary containing the perplexity scores for the texts
54
+ in the input list, as well as the mean perplexity. If one of the input texts is
55
+ longer than the max input length of the model, then it is truncated to the
56
+ max length for the perplexity computation.
57
+ Examples:
58
+ Example 1:
59
+ >>> perplexity = evaluate.load("perplexity")
60
+ >>> input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
61
+ >>> results = perplexity.compute(model_id='gpt2',
62
+ ... add_start_token=False,
63
+ ... input_texts=input_texts) # doctest:+ELLIPSIS
64
+ >>> print(list(results.keys()))
65
+ ['perplexities', 'mean_perplexity']
66
+ >>> print(round(results["mean_perplexity"], 2))
67
+ 78.22
68
+ >>> print(round(results["perplexities"][0], 2))
69
+ 11.11
70
+
71
+ Example 2:
72
+ >>> from datasets import load_dataset
73
+ >>> perplexity = evaluate.load("perplexity")
74
+ >>> input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10] # doctest: +SKIP
75
+ >>> input_texts = [s for s in input_texts if s!='']
76
+ >>> results = perplexity.compute(model_id='gpt2',
77
+ ... input_texts=input_texts)
78
+ >>> print(list(results.keys()))
79
+ ['perplexities', 'mean_perplexity']
80
+ >>> print(round(results["mean_perplexity"], 2)) # doctest: +SKIP
81
+ 60.35
82
+ >>> print(round(results["perplexities"][0], 2)) # doctest: +SKIP
83
+ 81.12
84
+ """
85
+
86
+
87
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
88
+ class Perplexity(evaluate.EvaluationModule):
89
+ def _info(self):
90
+ return evaluate.EvaluationModuleInfo(
91
+ description=_DESCRIPTION,
92
+ citation=_CITATION,
93
+ inputs_description=_KWARGS_DESCRIPTION,
94
+ features=datasets.Features(
95
+ {
96
+ "input_texts": datasets.Value("string"),
97
+ }
98
+ ),
99
+ reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
100
+ )
101
+
102
+ def _compute(self, input_texts, model_id, batch_size: int = 16, add_start_token: bool = True, device=None):
103
+
104
+ if device is not None:
105
+ assert device in ["gpu", "cpu", "cuda"], "device should be either gpu or cpu."
106
+ if device == "gpu":
107
+ device = "cuda"
108
+ else:
109
+ device = "cuda" if torch.cuda.is_available() else "cpu"
110
+
111
+ model = AutoModelForCausalLM.from_pretrained(model_id)
112
+ model = model.to(device)
113
+
114
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
115
+
116
+ # if batch_size > 1 (which generally leads to padding being required), and
117
+ # if there is not an already assigned pad_token, assign an existing
118
+ # special token to also be the padding token
119
+ if tokenizer.pad_token is None and batch_size > 1:
120
+ existing_special_tokens = list(tokenizer.special_tokens_map_extended.values())
121
+ # check that the model already has at least one special token defined
122
+ assert (
123
+ len(existing_special_tokens) > 0
124
+ ), "If batch_size > 1, model must have at least one special token to use for padding. Please use a different model or set batch_size=1."
125
+ # assign one of the special tokens to also be the pad token
126
+ tokenizer.add_special_tokens({"pad_token": existing_special_tokens[0]})
127
+
128
+ if add_start_token:
129
+ # leave room for <BOS> token to be added:
130
+ assert (
131
+ tokenizer.bos_token is not None
132
+ ), "Input model must already have a BOS token if using add_start_token=True. Please use a different model, or set add_start_token=False"
133
+ max_tokenized_len = model.config.max_length - 1
134
+ else:
135
+ max_tokenized_len = model.config.max_length
136
+
137
+ encodings = tokenizer(
138
+ input_texts,
139
+ add_special_tokens=False,
140
+ padding=True,
141
+ truncation=True,
142
+ max_length=max_tokenized_len,
143
+ return_tensors="pt",
144
+ return_attention_mask=True,
145
+ ).to(device)
146
+
147
+ encoded_texts = encodings["input_ids"]
148
+ attn_masks = encodings["attention_mask"]
149
+
150
+ # check that each input is long enough:
151
+ if add_start_token:
152
+ assert torch.all(torch.ge(attn_masks.sum(1), 1)), "Each input text must be at least one token long."
153
+ else:
154
+ assert torch.all(
155
+ torch.ge(attn_masks.sum(1), 2)
156
+ ), "When add_start_token=False, each input text must be at least two tokens long. Run with add_start_token=True if inputting strings of only one token, and remove all empty input strings."
157
+
158
+ ppls = []
159
+ loss_fct = CrossEntropyLoss(reduction="none")
160
+
161
+ for start_index in logging.tqdm(range(0, len(encoded_texts), batch_size)):
162
+ end_index = min(start_index + batch_size, len(encoded_texts))
163
+ encoded_batch = encoded_texts[start_index:end_index]
164
+ attn_mask = attn_masks[start_index:end_index]
165
+
166
+ if add_start_token:
167
+ bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)
168
+ encoded_batch = torch.cat([bos_tokens_tensor, encoded_batch], dim=1)
169
+ attn_mask = torch.cat(
170
+ [torch.ones(bos_tokens_tensor.size(), dtype=torch.int64).to(device), attn_mask], dim=1
171
+ )
172
+
173
+ labels = encoded_batch
174
+
175
+ with torch.no_grad():
176
+ out_logits = model(encoded_batch, attention_mask=attn_mask).logits
177
+
178
+ shift_logits = out_logits[..., :-1, :].contiguous()
179
+ shift_labels = labels[..., 1:].contiguous()
180
+ shift_attention_mask_batch = attn_mask[..., 1:].contiguous()
181
+
182
+ perplexity_batch = torch.exp2(
183
+ (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
184
+ / shift_attention_mask_batch.sum(1)
185
+ )
186
+
187
+ ppls += perplexity_batch.tolist()
188
+
189
+ return {"perplexities": ppls, "mean_perplexity": np.mean(ppls)}
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ torch
5
+ torch
6
+ transformers