SentenceTransformer based on sentence-transformers/stsb-distilbert-base

This is a sentence-transformers model finetuned from sentence-transformers/stsb-distilbert-base on the mnrl and cl datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/stsb-distilbert-base
Maximum Sequence Length: 128 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity
Training Datasets:
- mnrl
- cl
Language: en

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/stsb-distilbert-base-mnrl-cl-multi")
# Run inference
sentences = [
    'How fast is fast?',
    'How does light travel so fast?',
    'How do I copyright my books?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Binary Classification

Dataset: quora-duplicates
Evaluated with BinaryClassificationEvaluator

Metric	Value
cosine_accuracy	0.846
cosine_accuracy_threshold	0.7969
cosine_f1	0.7791
cosine_f1_threshold	0.714
cosine_precision	0.6978
cosine_recall	0.882
cosine_ap	0.823
dot_accuracy	0.843
dot_accuracy_threshold	151.2908
dot_f1	0.7661
dot_f1_threshold	143.7784
dot_precision	0.7238
dot_recall	0.8137
dot_ap	0.7946
manhattan_accuracy	0.838
manhattan_accuracy_threshold	194.9912
manhattan_f1	0.7704
manhattan_f1_threshold	247.4978
manhattan_precision	0.6537
manhattan_recall	0.9379
manhattan_ap	0.815
euclidean_accuracy	0.841
euclidean_accuracy_threshold	9.0223
euclidean_f1	0.7704
euclidean_f1_threshold	11.3852
euclidean_precision	0.6463
euclidean_recall	0.9534
euclidean_ap	0.8153
max_accuracy	0.846
max_accuracy_threshold	194.9912
max_f1	0.7791
max_f1_threshold	247.4978
max_precision	0.7238
max_recall	0.9534
max_ap	0.823

Paraphrase Mining

Dataset: quora-duplicates-dev
Evaluated with ParaphraseMiningEvaluator

Metric	Value
average_precision	0.5889
f1	0.5762
precision	0.5478
recall	0.6077
threshold	0.7729

Information Retrieval

Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.963
cosine_accuracy@3	0.9906
cosine_accuracy@5	0.9944
cosine_accuracy@10	0.9982
cosine_precision@1	0.963
cosine_precision@3	0.4285
cosine_precision@5	0.2757
cosine_precision@10	0.1449
cosine_recall@1	0.83
cosine_recall@3	0.959
cosine_recall@5	0.9806
cosine_recall@10	0.9926
cosine_ndcg@10	0.9784
cosine_mrr@10	0.9772
cosine_map@100	0.9709
dot_accuracy@1	0.9514
dot_accuracy@3	0.9852
dot_accuracy@5	0.991
dot_accuracy@10	0.9968
dot_precision@1	0.9514
dot_precision@3	0.4247
dot_precision@5	0.2736
dot_precision@10	0.1446
dot_recall@1	0.8194
dot_recall@3	0.952
dot_recall@5	0.9756
dot_recall@10	0.9911
dot_ndcg@10	0.9715
dot_mrr@10	0.9693
dot_map@100	0.9617

Training Details

Training Datasets

mnrl

Dataset: mnrl at 451a485
Size: 100,000 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 6 tokens mean: 13.85 tokens max: 42 tokens	min: 6 tokens mean: 13.65 tokens max: 44 tokens	min: 4 tokens mean: 14.76 tokens max: 64 tokens

Samples:

anchor	positive	negative
`Why in India do we not have one on one political debate as in USA?`	`Why cant we have a public debate between politicians in India like the one in US?`	`Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?`
`What is OnePlus One?`	`How is oneplus one?`	`Why is OnePlus One so good?`
`Does our mind control our emotions?`	`How do smart and successful people control their emotions?`	`How can I control my positive emotions for the people whom I love but they don't care about me?`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

cl

Dataset: cl at 451a485
Size: 100,000 training samples
Columns: sentence1, sentence2, and label
Approximate statistics based on the first 1000 samples:
sentence1 sentence2 label
type string string int
details
min: 6 tokens
mean: 15.3 tokens
max: 57 tokens

min: 6 tokens
mean: 15.66 tokens
max: 56 tokens

0: ~62.00%
1: ~38.00%

	sentence1	sentence2	label
type	string	string	int
details	min: 6 tokens mean: 15.3 tokens max: 57 tokens	min: 6 tokens mean: 15.66 tokens max: 56 tokens	0: ~62.00% 1: ~38.00%

Samples:

sentence1	sentence2	label
`What is the step by step guide to invest in share market in india?`	`What is the step by step guide to invest in share market?`	`0`
`What is the story of Kohinoor (Koh-i-Noor) Diamond?`	`What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?`	`0`
`How can I increase the speed of my internet connection while using a VPN?`	`How can Internet speed be increased by hacking through DNS?`	`0`

Loss: ContrastiveLoss with these parameters:

{
    "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
    "margin": 0.5,
    "size_average": true
}

Evaluation Datasets

mnrl

Dataset: mnrl at 451a485
Size: 1,000 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 7 tokens mean: 13.84 tokens max: 43 tokens	min: 6 tokens mean: 13.8 tokens max: 38 tokens	min: 6 tokens mean: 14.71 tokens max: 56 tokens

Samples:

anchor	positive	negative
`Which programming language is best for developing low-end games?`	`What coding language should I learn first for making games?`	`I am entering the world of video game programming and want to know what language I should learn? Because there are so many languages I do not know which one to start with. Can you recommend a language that's easy to learn and can be used with many platforms?`
`Was it appropriate for Meryl Streep to use her Golden Globes speech to attack Donald Trump?`	`Should Meryl Streep be using her position to attack the president?`	`Why did Kelly Ann Conway say that Meryl Streep incited peoples worst feelings?`
`Where can I found excellent commercial fridges in Sydney?`	`Where can I found impressive range of commercial fridges in Sydney?`	`What is the best grocery delivery service in Sydney?`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

cl

Dataset: cl at 451a485
Size: 1,000 evaluation samples
Columns: sentence1, sentence2, and label
Approximate statistics based on the first 1000 samples:
sentence1 sentence2 label
type string string int
details
min: 5 tokens
mean: 15.59 tokens
max: 59 tokens

min: 6 tokens
mean: 15.65 tokens
max: 76 tokens

0: ~63.40%
1: ~36.60%

	sentence1	sentence2	label
type	string	string	int
details	min: 5 tokens mean: 15.59 tokens max: 59 tokens	min: 6 tokens mean: 15.65 tokens max: 76 tokens	0: ~63.40% 1: ~36.60%

Samples:

sentence1	sentence2	label
`What should I ask my friend to get from UK to India?`	`What is the process of getting a surgical residency in UK after completing MBBS from India?`	`0`
`How can I learn hacking for free?`	`How can I learn to hack seriously?`	`1`
`Which is the best website to learn programming language C++?`	`Which is the best website to learn C++ Programming language for free?`	`0`

Loss: ContrastiveLoss with these parameters:

{
    "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
    "margin": 0.5,
    "size_average": true
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
num_train_epochs: 1
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: False
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: None
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	cl loss	mnrl loss	cosine_map@100	quora-duplicates-dev_average_precision	quora-duplicates_max_ap
0	0	-	-	-	0.9245	0.4200	0.6890
0.0320	100	0.1634	-	-	-	-	-
0.0640	200	0.1206	-	-	-	-	-
0.0800	250	-	0.0190	0.1469	0.9530	0.5068	0.7354
0.0960	300	0.1036	-	-	-	-	-
0.1280	400	0.0836	-	-	-	-	-
0.1599	500	0.0918	0.0180	0.1008	0.9553	0.5259	0.7643
0.1919	600	0.0784	-	-	-	-	-
0.2239	700	0.0656	-	-	-	-	-
0.2399	750	-	0.0177	0.0905	0.9593	0.5305	0.7686
0.2559	800	0.0593	-	-	-	-	-
0.2879	900	0.0534	-	-	-	-	-
0.3199	1000	0.0612	0.0161	0.0736	0.9642	0.5512	0.7881
0.3519	1100	0.0572	-	-	-	-	-
0.3839	1200	0.06	-	-	-	-	-
0.3999	1250	-	0.0158	0.0641	0.9649	0.5567	0.7983
0.4159	1300	0.0565	-	-	-	-	-
0.4479	1400	0.0565	-	-	-	-	-
0.4798	1500	0.0475	0.0154	0.0578	0.9645	0.5614	0.8062
0.5118	1600	0.0596	-	-	-	-	-
0.5438	1700	0.0509	-	-	-	-	-
0.5598	1750	-	0.0150	0.0525	0.9674	0.5762	0.8092
0.5758	1800	0.0403	-	-	-	-	-
0.6078	1900	0.0431	-	-	-	-	-
0.6398	2000	0.0481	0.0150	0.0531	0.9689	0.5824	0.8128
0.6718	2100	0.05	-	-	-	-	-
0.7038	2200	0.0468	-	-	-	-	-
0.7198	2250	-	0.0146	0.0486	0.9684	0.5756	0.8195
0.7358	2300	0.0436	-	-	-	-	-
0.7678	2400	0.0409	-	-	-	-	-
0.7997	2500	0.0391	0.0145	0.0454	0.9705	0.5822	0.8190
0.8317	2600	0.0412	-	-	-	-	-
0.8637	2700	0.0373	-	-	-	-	-
0.8797	2750	-	0.0143	0.0451	0.9705	0.5889	0.8229
0.8957	2800	0.0428	-	-	-	-	-
0.9277	2900	0.0419	-	-	-	-	-
0.9597	3000	0.0376	0.0143	0.0435	0.9709	0.5889	0.8230
0.9917	3100	0.0366	-	-	-	-	-

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Energy Consumed: 0.084 kWh
Carbon Emitted: 0.033 kg of CO2
Hours Used: 0.399 hours

Training Hardware

On Cloud: No
GPU Model: 1 x NVIDIA GeForce RTX 3090
CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
RAM Size: 31.78 GB

Framework Versions

Python: 3.11.6
Sentence Transformers: 3.0.0.dev0
Transformers: 4.41.0.dev0
PyTorch: 2.3.0+cu121
Accelerate: 0.26.1
Datasets: 2.18.0
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

ContrastiveLoss

@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)}, 
    title={Dimensionality Reduction by Learning an Invariant Mapping}, 
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}

tomaarsen
/

stsb-distilbert-base-mnrl-cl-multi

SentenceTransformer based on sentence-transformers/stsb-distilbert-base

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Binary Classification

Paraphrase Mining

Information Retrieval

Training Details

Training Datasets

mnrl

cl

Evaluation Datasets

mnrl

cl

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Environmental Impact

Training Hardware

Framework Versions

Citation

BibTeX

Sentence Transformers

MultipleNegativesRankingLoss

ContrastiveLoss

Model tree for tomaarsen/stsb-distilbert-base-mnrl-cl-multi

Evaluation results