cointegrated/SONAR_200_text_encoder

This is a port of the multilingual SONAR text encoder (https://huggingface.co/facebook/SONAR) to the transformers format from fairseq2.

Its embeddings are expected be equal to those the official implementation (https://github.com/facebookresearch/SONAR), but the latter stays the source of truth.

The encoder supports the same 202 languages as NLLB-200 (see also the source model card and FLORES-200 lang code mapping).

How to compute embeddings:

# !pip install transformers sentencepiece -q

import torch
from transformers import AutoTokenizer
from transformers.models.m2m_100.modeling_m2m_100 import M2M100Encoder

model_name = "cointegrated/SONAR_200_text_encoder"
encoder = M2M100Encoder.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def encode_mean_pool(texts, tokenizer, encoder, lang='eng_Latn', norm=False):
    tokenizer.src_lang = lang
    with torch.inference_mode():
        batch = tokenizer(texts, return_tensors='pt', padding=True)
        seq_embs = encoder(**batch).last_hidden_state
        mask = batch.attention_mask
        mean_emb = (seq_embs * mask.unsqueeze(-1)).sum(1) / mask.unsqueeze(-1).sum(1)
        if norm:
            mean_emb = torch.nn.functional.normalize(mean_emb)
    return mean_emb

sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embs = encode_mean_pool(sentences, tokenizer, encoder, lang="eng_Latn")
print(embs.shape)  
# torch.Size([2, 1024])
print(embs)
# tensor([[-0.0053,  0.0020, -0.0006,  ...,  0.0094, -0.0009,  0.0070],
#         [-0.0003, -0.0071,  0.0076,  ...,  0.0055,  0.0022, -0.0083]])

For advanced examples of usage, please take a look at the readme in https://github.com/facebookresearch/SONAR.

The model was repacked in this notebook.