ai-forever
/

ru-en-RoSBERTa

Feature Extraction

sentence-transformers

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

ai-forever commited on Jul 29

Commit

47d81ba

•

1 Parent(s): e48ff06

Update README.md

Files changed (1) hide show

README.md +52 -3

README.md CHANGED Viewed

@@ -1,3 +1,52 @@
----
-license: mit
----

+---
+license: mit
+language:
+- ru
+- en
+tags:
+- PyTorch
+- Transformers
+---
+# ru-en-RoBERTa-large model for Sentence Embeddings in Russian and English language.
+The model is described [in this article](<link of our arxiv>)
+Russian MTEB [metrics](<lin of our ruMTEB>)
+For better quality, use mean token embeddings.
+## Usage (HuggingFace Models Repository)
+You can use the model directly from the model repository to compute sentence embeddings:
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+#You might to use two mode for embeddings creation: CLS token embs or MEAN Pooling
+#Mean Pooling example - Take attention mask into account for correct averaging
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
+    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+    return sum_embeddings / sum_mask
+#Sentences we want sentence embeddings for
+sentences = ['Привет! Как твои дела?',
+             'А правда, что 42 твое любимое число?']
+#Load AutoModel from huggingface model repository
+tokenizer = AutoTokenizer.from_pretrained("ai-forever/ru-en-RoSBERTa")
+model = AutoModel.from_pretrained("ai-forever/ru-en-RoSBERTa")
+#Tokenize sentences
+encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors='pt')
+#Compute token embeddings
+with torch.no_grad():
+    model_output = model(**encoded_input)
+#In this case, mean pooling
+sentence_mean_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+#In this case, cls "pooling"
+last_hidden_states = model_output[0]
+sentence_cls_embeddings = last_hidden_states[:,0]
+```