SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on eLibrary data with contrastive technics described in habr post. High metrics values were achieved on the ruSciBench benchmark.
How to get embeddings
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
import torch
tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda() # if you want to use a GPU
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
# Tokenize sentences
sentence = '</s>'.join([title, abstract])
encoded_input = tokenizer(
[sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings.cpu().detach().numpy()[0]
print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)
Or you can use the sentence_transformers
:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
print(embeddings[0].shape)
# (312,)
Authors
Benchmark developed by MLSA Lab of Institute for AI, MSU.
Acknowledgement
The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank eLibrary for provided datasets.
Contacts
Nikolai Gerasimenko ([email protected]), Alexey Vatolin ([email protected])
- Downloads last month
- 988
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.