All token embeddings end up identical.
#2
by
drw-graphbook
- opened
from transformers import AutoModel, AutoTokenizer
model_id = "AIRI-Institute/gena-lm-bert-base-t2t"
tokenizer = AutoTokenizer.from_pretrained(model_id)
seq = 'CACCCAGAGAGAGTAACCAGAATGGATACATTTTGGCCAACATGATTCTAACCCAGTGAGACCCATTTTGGGCTTATG'
tokens = tokenizer.tokenize(seq, add_special_tokens=True)
print('tokens:', tokens)
print('n_tokens:', len(tokens))
model = AutoModel.from_pretrained(model_id)
print(model)
with torch.no_grad():
output = model(**tokenizer(seq, return_tensors='pt'), output_hidden_states=True, )
print(output.keys())
There's nothing strictly against all the rows getting the same value, but it is very odd. I see across the layers that they gradually converge to the same embedding for each token. Any explanation for that?
Hi!
Just tried to run your code (but modified it to make it executable):
import torch
from transformers import AutoModel, AutoTokenizer
model_id = "AIRI-Institute/gena-lm-bert-base-t2t"
tokenizer = AutoTokenizer.from_pretrained(model_id)
seq = 'CACCCAGAGAGAGTAACCAGAATGGATACATTTTGGCCAACATGATTCTAACCCAGTGAGACCCATTTTGGGCTTATG'
tokens = tokenizer.tokenize(seq, add_special_tokens=True)
print('tokens:', tokens)
print('n_tokens:', len(tokens))
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
# print(model)
with torch.no_grad():
output = model(**tokenizer(seq, return_tensors='pt'), output_hidden_states=True, )
print(output.keys())
print(output.hidden_states[-1])
and it gives output:
tokens: ['[CLS]', 'CACCC', 'AGAGAGAG', 'TAACC', 'AGAATGG', 'ATACATT', 'TTGGCC', 'AACATG', 'ATTC', 'TAACCC', 'AGTGAGACCC', 'ATTTTGGGC', 'TTATG', '[SEP]']
n_tokens: 14
odict_keys(['logits', 'hidden_states'])
tensor([[[ 1.7935, 9.7426, -0.2816, ..., -12.1350, -4.8205, 2.5903],
[ -3.1385, -9.1969, 8.9219, ..., 6.8791, 3.4876, -4.8887],
[ -1.0978, -5.8843, 13.7623, ..., -3.3654, -4.0092, 1.4867],
...,
[ -0.9357, -6.7364, -3.1276, ..., -6.8377, -10.0209, -23.6296],
[ 2.4616, 2.8872, 1.8704, ..., -2.8146, -2.5070, -17.5171],
[ 2.5083, -1.9531, -4.1511, ..., 7.3335, 0.5013, -2.4499]]])
that looks good.
It is strange that you use output.last_hidden_states on your screenshot as model outputs "logits" only. "hidden_states" are in output only if output_hidden_states is set to True.
The "trust_remote_code"=true gave me the correct bert modeling implementation, thank you for the clarification!
Glad it helped!
yurakuratov
changed discussion status to
closed