Different values observed wrt HF PPL documentation

#3
by nferruz - opened

Dear all,

Thanks for your great work putting these tools together.

We're testing PPL for several sentences (biological ones in this case) on a GPT2-like model.

For the example sequence:

ex = ['1.1.1.2<sep><start>MSIKLAYAAKQPMTFSRRMPTYEAGAIAPWTALPENVAFIGLGNMGRSQLETDEDAENTIAIAISRDWSEGNWIDPQKPMVIHGFSRSGGEAPGTWEDGDWSYDEWKAICDVGVIATSNGWFEDEVVPVSIKAGVDMATLPELMKPIVYAPEDLIPLVQKGVDNDEA<end>']

We obtain the following PPL value:

perplexity = load("perplexity", module_type="metric", )
results = perplexity.compute(predictions=ex, model_id='nferruz/ZymCTRL')
results
{'perplexities': [2262.685302734375], 'mean_perplexity': 2262.685302734375}

However, when using the code from the HuggingFace documentation (adapted for one single sequence): https://huggingface.co/docs/transformers/perplexity, as seen below:

import torch
from tqdm import tqdm

max_length = model.config.n_positions
stride = 512
seq_len = len(output[0])

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = output[0][begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[ :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over input tokens.
        # Multiply it with trg_len to get the summation instead of average.
        # We will take average over all the tokens to get the true average
        # in the last step of this example.
        neg_log_likelihood = outputs.loss * trg_len

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).sum() / end_loc)

We obtain a value 3 orders of magnitude lower:

ppl
tensor(6.9366, device='cuda:0')

This value is in line with what we'd expect from the global PPL of the model.

We would appreciate any insights regarding these differences.
Thanks for your time.
Best,

PS: In case it helps, the input_ids in pt format for this sentence are:

output[0]
tensor([  7, 431,   7, 431,   7, 431,   8,   2,   3, 443, 449, 440, 441, 442,
        432, 455, 432, 432, 441, 447, 446, 443, 450, 437, 449, 448, 448, 443,
        446, 450, 455, 436, 432, 438, 432, 440, 432, 446, 453, 450, 432, 442,
        446, 436, 444, 452, 432, 437, 440, 438, 442, 438, 444, 443, 438, 448,
        449, 447, 442, 436, 450, 435, 436, 435, 432, 436, 444, 450, 440, 432,
        440, 432, 440, 449, 448, 435, 453, 449, 436, 438, 444, 453, 440, 435,
        446, 447, 441, 446, 443, 452, 440, 439, 438, 437, 449, 448, 449, 438,
        438, 436, 432, 446, 438, 450, 453, 436, 435, 438, 435, 453, 449, 455,
        435, 436, 453, 441, 432, 440, 434, 435, 452, 438, 452, 440, 432, 450,
        449, 444, 438, 453, 437, 436, 435, 436, 452, 452, 446, 452, 449, 440,
        441, 432, 438, 452, 435, 443, 432, 450, 442, 446, 436, 442, 443, 441,
        446, 440, 452, 455, 432, 446, 436, 435, 442, 440, 446, 442, 452, 447,
        441, 438, 452, 435, 444, 435, 436, 432,   4,  13, 431,   7, 431,  15,
          2,   3, 440, 448, 453, 442, 449, 432, 438, 438, 452, 435, 438, 452,
        446, 455, 441, 440, 440, 449, 452, 449, 442, 440, 438, 437, 432, 452,
        432, 449, 448, 453, 442, 446, 446, 449, 435, 453, 447, 452, 432, 437,
        432, 432, 442, 452, 437, 448, 432, 442, 446, 438, 435, 432, 449, 446,
        448, 440, 436, 438, 441, 446, 432, 452, 452, 452, 446, 452, 438, 436,
        432, 455, 443, 442], device='cuda:0')
Evaluate Metric org

cc @Tristan who implemented this metric.

I meet the same case. I think that may be the lm_head from the LM Model doesn't init, because the checkpoint 'gpt2' only contains the parameters for GPT2Model

Sign up or log in to comment