NoInstruct small Embedding v0
NoInstruct Embedding: Asymmetric Pooling is All You Need
This model has improved retrieval performance compared to the avsolatorio/GIST-small-Embedding-v0 model.
One of the things that the GIST
family of models fell short on is the performance on retrieval tasks. We propose a method that produces improved retrieval performance while maintaining independence on crafting arbitrary instructions, a trending paradigm in embedding models for retrieval tasks, when encoding a query.
Technical details of the model will be published shortly.
Usage
from typing import Union
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
def get_embedding(text: Union[str, list[str]], mode: str = "sentence"):
model.eval()
assert mode in ("query", "sentence"), f"mode={mode} was passed but only `query` and `sentence` are the supported modes."
if isinstance(text, str):
text = [text]
inp = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
output = model(**inp)
# The model is optimized to use the mean pooling for queries,
# while the sentence / document embedding uses the [CLS] representation.
if mode == "query":
vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
else:
vectors = output.last_hidden_state[:, 0, :]
return vectors
texts = [
"Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
"Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
"As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]
# Compute embeddings
embeddings = get_embedding(texts, mode="sentence")
# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())
# Test the retrieval performance.
query = get_embedding("Which sentence talks about concept on jobs?", mode="query")
scores = F.cosine_similarity(query, embeddings, dim=-1)
print(scores.cpu().numpy())
Support for the Sentence Transformers library will follow soon.
- Downloads last month
- 50,503
Model tree for avsolatorio/NoInstruct-small-Embedding-v0
Spaces using avsolatorio/NoInstruct-small-Embedding-v0 2
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported75.761
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported39.036
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported69.859
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported93.299
- ap on MTEB AmazonPolarityClassificationtest set self-reported90.035
- f1 on MTEB AmazonPolarityClassificationtest set self-reported93.286
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported49.988
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported49.462
- map_at_1 on MTEB ArguAnatest set self-reported31.935
- map_at_10 on MTEB ArguAnatest set self-reported48.791