Spaces:
Running
Different values observed wrt HF PPL documentation
Dear all,
Thanks for your great work putting these tools together.
We're testing PPL for several sentences (biological ones in this case) on a GPT2-like model.
For the example sequence:
ex = ['1.1.1.2<sep><start>MSIKLAYAAKQPMTFSRRMPTYEAGAIAPWTALPENVAFIGLGNMGRSQLETDEDAENTIAIAISRDWSEGNWIDPQKPMVIHGFSRSGGEAPGTWEDGDWSYDEWKAICDVGVIATSNGWFEDEVVPVSIKAGVDMATLPELMKPIVYAPEDLIPLVQKGVDNDEA<end>']
We obtain the following PPL value:
perplexity = load("perplexity", module_type="metric", )
results = perplexity.compute(predictions=ex, model_id='nferruz/ZymCTRL')
results
{'perplexities': [2262.685302734375], 'mean_perplexity': 2262.685302734375}
However, when using the code from the HuggingFace documentation (adapted for one single sequence): https://huggingface.co/docs/transformers/perplexity, as seen below:
import torch
from tqdm import tqdm
max_length = model.config.n_positions
stride = 512
seq_len = len(output[0])
nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end_loc # may be different from stride on last loop
input_ids = output[0][begin_loc:end_loc].to(device)
target_ids = input_ids.clone()
target_ids[ :-trg_len] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
# loss is calculated using CrossEntropyLoss which averages over input tokens.
# Multiply it with trg_len to get the summation instead of average.
# We will take average over all the tokens to get the true average
# in the last step of this example.
neg_log_likelihood = outputs.loss * trg_len
nlls.append(neg_log_likelihood)
prev_end_loc = end_loc
if end_loc == seq_len:
break
ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
We obtain a value 3 orders of magnitude lower:
ppl
tensor(6.9366, device='cuda:0')
This value is in line with what we'd expect from the global PPL of the model.
We would appreciate any insights regarding these differences.
Thanks for your time.
Best,
PS: In case it helps, the input_ids in pt format for this sentence are:
output[0]
tensor([ 7, 431, 7, 431, 7, 431, 8, 2, 3, 443, 449, 440, 441, 442,
432, 455, 432, 432, 441, 447, 446, 443, 450, 437, 449, 448, 448, 443,
446, 450, 455, 436, 432, 438, 432, 440, 432, 446, 453, 450, 432, 442,
446, 436, 444, 452, 432, 437, 440, 438, 442, 438, 444, 443, 438, 448,
449, 447, 442, 436, 450, 435, 436, 435, 432, 436, 444, 450, 440, 432,
440, 432, 440, 449, 448, 435, 453, 449, 436, 438, 444, 453, 440, 435,
446, 447, 441, 446, 443, 452, 440, 439, 438, 437, 449, 448, 449, 438,
438, 436, 432, 446, 438, 450, 453, 436, 435, 438, 435, 453, 449, 455,
435, 436, 453, 441, 432, 440, 434, 435, 452, 438, 452, 440, 432, 450,
449, 444, 438, 453, 437, 436, 435, 436, 452, 452, 446, 452, 449, 440,
441, 432, 438, 452, 435, 443, 432, 450, 442, 446, 436, 442, 443, 441,
446, 440, 452, 455, 432, 446, 436, 435, 442, 440, 446, 442, 452, 447,
441, 438, 452, 435, 444, 435, 436, 432, 4, 13, 431, 7, 431, 15,
2, 3, 440, 448, 453, 442, 449, 432, 438, 438, 452, 435, 438, 452,
446, 455, 441, 440, 440, 449, 452, 449, 442, 440, 438, 437, 432, 452,
432, 449, 448, 453, 442, 446, 446, 449, 435, 453, 447, 452, 432, 437,
432, 432, 442, 452, 437, 448, 432, 442, 446, 438, 435, 432, 449, 446,
448, 440, 436, 438, 441, 446, 432, 452, 452, 452, 446, 452, 438, 436,
432, 455, 443, 442], device='cuda:0')
I meet the same case. I think that may be the lm_head from the LM Model doesn't init, because the checkpoint 'gpt2' only contains the parameters for GPT2Model