Edit model card

SLX-v0.1

SLX-v0.1 is an advanced model developed by BRAHMAI Research, specifically designed for mapping sentences and paragraphs into a 384-dimensional dense vector space. This model is ideal for tasks such as clustering, semantic search, and sentence similarity.

Usage with Sentence-Transformers

Using SLX-v0.1 is straightforward with the Sentence-Transformers library. Follow the instructions below to get started:

Installation

pip install -U sentence-transformers

Example

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('brahmairesearch/slx-v0.1')
embeddings = model.encode(sentences)

print(embeddings)

Usage with Hugging Face Transformers

If you prefer to use Hugging Face Transformers without the Sentence-Transformers library, you can still utilize the SLX-v0.1 model. Below is a guide on how to process input and apply mean pooling to obtain sentence embeddings:

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Mean Pooling - Takes attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # Token embeddings from the model output
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences for embedding
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained('brahmairesearch/slx-v0.1')
model = AutoModel.from_pretrained('brahmairesearch/slx-v0.1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)

Intended Use Cases

SLX-v0.1 is designed for encoding sentences and short paragraphs into dense vectors that capture their semantic information. These embeddings can be used for:

  • Information retrieval
  • Clustering
  • Sentence similarity tasks

Please note that input text longer than 256 word pieces will be truncated by default.

Training Procedure

Pre-training

SLX-v0.1 is based on the pre-trained model nreimers/MiniLM-L6-H384-uncased. For more information on the pre-training process, please refer to the original model card.

Fine-tuning

The model was fine-tuned using a contrastive learning objective. Cosine similarity is computed between all possible sentence pairs within a batch, and cross-entropy loss is applied using the true pairs.

Following fine-tuning, the model underwent transfer learning using the dunzhang/stella_en_400M_v5 model with an internally curated dataset, optimizing it for English language tasks.

Contact

For inquiries or feedback, feel free to reach out to us at [email protected].

Best regards,
The BRAHMAI Team

Downloads last month
971
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Evaluation results