Spaces:

evaluate-metric
/

perplexity

Running

App Files Files Community

lvwerra HF staff commited on May 20, 2022

Commit

735107c

•

1 Parent(s): 740c7d4

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +102 -4
app.py +6 -0
perplexity.py +189 -0
requirements.txt +6 -0

README.md CHANGED Viewed

@@ -1,12 +1,110 @@
 ---
 title: Perplexity
-emoji: 🌍
-colorFrom: indigo
-colorTo: yellow
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
 title: Perplexity
+emoji: 🤗
+colorFrom: blue
+colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for Perplexity
+## Metric Description
+Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence. This can be used in two main ways:
+1. to evaluate how well the model has learned the distribution of the text it was trained on
+    - In this case, the model input should be the trained model to be evaluated, and the input texts should be the text that the model was trained on.
+2. to evaluate how well a selection of text matches the distribution of text that the input model was trained on
+    - In this case, the model input should be a trained model, and the input texts should be the text to be evaluated.
+## Intended Uses
+Any language generation task.
+## How to Use
+The metric takes a list of text as input, as well as the name of the model used to compute the metric:
+```python
+from evaluate import load
+perplexity = load("perplexity")
+results = perplexity.compute(input_texts=input_texts, model_id='gpt2')
+```
+### Inputs
+- **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
+    - This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
+- **input_texts** (list of str): input text, each separate text snippet is one list entry.
+- **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
+- **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
+- **device** (str): device to run on, defaults to 'cuda' when available
+### Output Values
+This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
+If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
+```
+{'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
+```
+This metric's range is 0 and up. A lower score is better.
+#### Values from Popular Papers
+### Examples
+Calculating perplexity on input_texts defined here:
+```python
+perplexity = evaluate.load("perplexity")
+input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
+results = perplexity.compute(model_id='gpt2',
+                             add_start_token=False,
+                             input_texts=input_texts)
+print(list(results.keys()))
+>>>['perplexities', 'mean_perplexity']
+print(round(results["mean_perplexity"], 2))
+>>>78.22
+print(round(results["perplexities"][0], 2))
+>>>11.11
+```
+Calculating perplexity on input_texts loaded in from a dataset:
+```python
+perplexity = evaluate.load("perplexity")
+input_texts = datasets.load_dataset("wikitext",
+                                    "wikitext-2-raw-v1",
+                                    split="test")["text"][:50]
+input_texts = [s for s in input_texts if s!='']
+results = perplexity.compute(model_id='gpt2',
+                             input_texts=input_texts)
+print(list(results.keys()))
+>>>['perplexities', 'mean_perplexity']
+print(round(results["mean_perplexity"], 2))
+>>>60.35
+print(round(results["perplexities"][0], 2))
+>>>81.12
+```
+## Limitations and Bias
+Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
+## Citation
+```bibtex
+@article{jelinek1977perplexity,
+title={Perplexity—a measure of the difficulty of speech recognition tasks},
+author={Jelinek, Fred and Mercer, Robert L and Bahl, Lalit R and Baker, James K},
+journal={The Journal of the Acoustical Society of America},
+volume={62},
+number={S1},
+pages={S63--S63},
+year={1977},
+publisher={Acoustical Society of America}
+}
+```
+## Further References
+- [Hugging Face Perplexity Blog Post](https://huggingface.co/docs/transformers/perplexity)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("perplexity")
+launch_gradio_widget(module)

perplexity.py ADDED Viewed

	@@ -0,0 +1,189 @@

+# Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Perplexity Metric."""
+import datasets
+import numpy as np
+import torch
+from torch.nn import CrossEntropyLoss
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import evaluate
+from evaluate import logging
+_CITATION = """\
+"""
+_DESCRIPTION = """
+Perplexity (PPL) is one of the most common metrics for evaluating language models.
+It is defined as the exponentiated average negative log-likelihood of a sequence.
+For more information, see https://huggingface.co/docs/transformers/perplexity
+"""
+_KWARGS_DESCRIPTION = """
+Args:
+    model_id (str): model used for calculating Perplexity
+            NOTE: Perplexity can only be calculated for causal language models.
+                    This includes models such as gpt2, causal variations of bert,
+                    causal versions of t5, and more (the full list can be found
+                    in the AutoModelForCausalLM documentation here:
+                    https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
+    input_texts (list of str): input text, each separate text snippet
+        is one list entry.
+    batch_size (int): the batch size to run texts through the model. Defaults to 16.
+    add_start_token (bool): whether to add the start token to the texts,
+        so the perplexity can include the probability of the first word. Defaults to True.
+    device (str): device to run on, defaults to 'cuda' when available
+Returns:
+    perplexity: dictionary containing the perplexity scores for the texts
+        in the input list, as well as the mean perplexity. If one of the input texts is
+        longer than the max input length of the model, then it is truncated to the
+        max length for the perplexity computation.
+Examples:
+    Example 1:
+        >>> perplexity = evaluate.load("perplexity")
+        >>> input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
+        >>> results = perplexity.compute(model_id='gpt2',
+        ...                              add_start_token=False,
+        ...                              input_texts=input_texts) # doctest:+ELLIPSIS
+        >>> print(list(results.keys()))
+        ['perplexities', 'mean_perplexity']
+        >>> print(round(results["mean_perplexity"], 2))
+        78.22
+        >>> print(round(results["perplexities"][0], 2))
+        11.11
+    Example 2:
+        >>> from datasets import load_dataset
+        >>> perplexity = evaluate.load("perplexity")
+        >>> input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10] # doctest: +SKIP
+        >>> input_texts = [s for s in input_texts if s!='']
+        >>> results = perplexity.compute(model_id='gpt2',
+        ...                              input_texts=input_texts)
+        >>> print(list(results.keys()))
+        ['perplexities', 'mean_perplexity']
+        >>> print(round(results["mean_perplexity"], 2)) # doctest: +SKIP
+        60.35
+        >>> print(round(results["perplexities"][0], 2)) # doctest: +SKIP
+        81.12
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Perplexity(evaluate.EvaluationModule):
+    def _info(self):
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "input_texts": datasets.Value("string"),
+                }
+            ),
+            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
+        )
+    def _compute(self, input_texts, model_id, batch_size: int = 16, add_start_token: bool = True, device=None):
+        if device is not None:
+            assert device in ["gpu", "cpu", "cuda"], "device should be either gpu or cpu."
+            if device == "gpu":
+                device = "cuda"
+        else:
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+        model = AutoModelForCausalLM.from_pretrained(model_id)
+        model = model.to(device)
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
+        # if batch_size > 1 (which generally leads to padding being required), and
+        # if there is not an already assigned pad_token, assign an existing
+        # special token to also be the padding token
+        if tokenizer.pad_token is None and batch_size > 1:
+            existing_special_tokens = list(tokenizer.special_tokens_map_extended.values())
+            # check that the model already has at least one special token defined
+            assert (
+                len(existing_special_tokens) > 0
+            ), "If batch_size > 1, model must have at least one special token to use for padding. Please use a different model or set batch_size=1."
+            # assign one of the special tokens to also be the pad token
+            tokenizer.add_special_tokens({"pad_token": existing_special_tokens[0]})
+        if add_start_token:
+            # leave room for <BOS> token to be added:
+            assert (
+                tokenizer.bos_token is not None
+            ), "Input model must already have a BOS token if using add_start_token=True. Please use a different model, or set add_start_token=False"
+            max_tokenized_len = model.config.max_length - 1
+        else:
+            max_tokenized_len = model.config.max_length
+        encodings = tokenizer(
+            input_texts,
+            add_special_tokens=False,
+            padding=True,
+            truncation=True,
+            max_length=max_tokenized_len,
+            return_tensors="pt",
+            return_attention_mask=True,
+        ).to(device)
+        encoded_texts = encodings["input_ids"]
+        attn_masks = encodings["attention_mask"]
+        # check that each input is long enough:
+        if add_start_token:
+            assert torch.all(torch.ge(attn_masks.sum(1), 1)), "Each input text must be at least one token long."
+        else:
+            assert torch.all(
+                torch.ge(attn_masks.sum(1), 2)
+            ), "When add_start_token=False, each input text must be at least two tokens long. Run with add_start_token=True if inputting strings of only one token, and remove all empty input strings."
+        ppls = []
+        loss_fct = CrossEntropyLoss(reduction="none")
+        for start_index in logging.tqdm(range(0, len(encoded_texts), batch_size)):
+            end_index = min(start_index + batch_size, len(encoded_texts))
+            encoded_batch = encoded_texts[start_index:end_index]
+            attn_mask = attn_masks[start_index:end_index]
+            if add_start_token:
+                bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)
+                encoded_batch = torch.cat([bos_tokens_tensor, encoded_batch], dim=1)
+                attn_mask = torch.cat(
+                    [torch.ones(bos_tokens_tensor.size(), dtype=torch.int64).to(device), attn_mask], dim=1
+                )
+            labels = encoded_batch
+            with torch.no_grad():
+                out_logits = model(encoded_batch, attention_mask=attn_mask).logits
+            shift_logits = out_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            shift_attention_mask_batch = attn_mask[..., 1:].contiguous()
+            perplexity_batch = torch.exp2(
+                (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
+                / shift_attention_mask_batch.sum(1)
+            )
+            ppls += perplexity_batch.tolist()
+        return {"perplexities": ppls, "mean_perplexity": np.mean(ppls)}

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+torch
+torch
+transformers