Default tokenization differs between sentence_transformers and transformers version
Using the provided example code for sentence_transformers and transformers library leads to different embeddings for the same sentence. This is due to a different truncation of the inputs. sentence_transformers uses a max. sequence length of 128 tokens while transformers version uses 512.
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
st_model_id = 'paraphrase-multilingual-mpnet-base-v2'
hf_model_id = 'sentence-transformers/' + st_model_id
sentences = [
'S' * 10000 # dummy sentence
]
device = 'cuda:0'
st_model = SentenceTransformer(st_model_id, device=device)
st_tokenizer = st_model.tokenizer
print(st_model.get_max_seq_length()) # max seq length 128
st_tokens = st_model.tokenize(sentences)
print(st_tokens['input_ids'].shape) # seq length 128
# Load model from HuggingFace Hub
hf_tokenizer = AutoTokenizer.from_pretrained(hf_model_id)
# Tokenize sentences
hf_tokens = hf_tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
print(hf_tokens['input_ids'].shape) # seq length 512 ?
# get embeddings with transforms and sentence transformers
st_embedding = st_model.encode(sentences)
hf_model = AutoModel.from_pretrained(hf_model_id)
# Compute token embeddings
with torch.no_grad():
hf_out = hf_model(**hf_tokens)
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# # Perform pooling like written in Readme
hf_embedding = mean_pooling(hf_out, hf_tokens['attention_mask'])
# compare embeddings
embedding_diff = torch.abs(torch.tensor(st_embedding).to(device)- hf_embedding.to(device))
print(embedding_diff)
print(torch.sum(embedding_diff > 1e-6)) # embeddings for same sentence are different!
# use sequence length of 128 explicitly with transformers
hf_tokens = hf_tokenizer(sentences, padding=True, truncation='longest_first', return_tensors="pt", max_length=128) # get tokens with max. 128 sequence length
# Compute token embeddings again with new tokens
with torch.no_grad():
hf_out = hf_model(**hf_tokens)
hf_embedding = mean_pooling(hf_out, hf_tokens['attention_mask'])
# Compare embeddings again
print(torch.sum(torch.abs(torch.tensor(st_embedding).to(device)- hf_embedding.to(device)) > 1e-6)) # embeddings match!
Is this on purpose or is it an error in the tokenizer configuration for transformers version? I would suggest updating the transformers example code in the Readme at last to give a hint about this. Getting different "default" results for the same model can cause some confusion.
Additional question: Why is the max. input sequence set to 128 tokens at default with sentence_transformers at all when the architecture can support longer sequences?
PS: I think this problem is the same for other models as well like paraphrase-xlm-r-multilingual-v1, etc.