This is a port of the multilingual SONAR text encoder (https://huggingface.co/facebook/SONAR) to the transformers
format from fairseq2
.
Its embeddings are expected be equal to those the official implementation (https://github.com/facebookresearch/SONAR), but the latter stays the source of truth.
The encoder supports the same 202 languages as NLLB-200 (see also the source model card and FLORES-200 lang code mapping).
How to compute embeddings:
# !pip install transformers sentencepiece -q
import torch
from transformers import AutoTokenizer
from transformers.models.m2m_100.modeling_m2m_100 import M2M100Encoder
model_name = "cointegrated/SONAR_200_text_encoder"
encoder = M2M100Encoder.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def encode_mean_pool(texts, tokenizer, encoder, lang='eng_Latn', norm=False):
tokenizer.src_lang = lang
with torch.inference_mode():
batch = tokenizer(texts, return_tensors='pt', padding=True)
seq_embs = encoder(**batch).last_hidden_state
mask = batch.attention_mask
mean_emb = (seq_embs * mask.unsqueeze(-1)).sum(1) / mask.unsqueeze(-1).sum(1)
if norm:
mean_emb = torch.nn.functional.normalize(mean_emb)
return mean_emb
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embs = encode_mean_pool(sentences, tokenizer, encoder, lang="eng_Latn")
print(embs.shape)
# torch.Size([2, 1024])
print(embs)
# tensor([[-0.0053, 0.0020, -0.0006, ..., 0.0094, -0.0009, 0.0070],
# [-0.0003, -0.0071, 0.0076, ..., 0.0055, 0.0022, -0.0083]])
For advanced examples of usage, please take a look at the readme in https://github.com/facebookresearch/SONAR.
The model was repacked in this notebook.
- Downloads last month
- 4,275
Inference API (serverless) does not yet support transformers models for this pipeline type.