Update README.md
Browse files
README.md
CHANGED
@@ -5985,7 +5985,7 @@ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=Tru
|
|
5985 |
outputs = model(**batch_dict)
|
5986 |
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
5987 |
|
5988 |
-
#
|
5989 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
5990 |
scores = (embeddings[:2] @ embeddings[2:].T) * 100
|
5991 |
print(scores.tolist())
|
@@ -6037,11 +6037,61 @@ For all labeled datasets, we only use its training set for fine-tuning.
|
|
6037 |
|
6038 |
For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
|
6039 |
|
6040 |
-
## Benchmark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6041 |
|
6042 |
Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
|
6043 |
on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
|
6044 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6045 |
## Citation
|
6046 |
|
6047 |
If you find our paper or models helpful, please consider cite as follows:
|
|
|
5985 |
outputs = model(**batch_dict)
|
5986 |
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
5987 |
|
5988 |
+
# normalize embeddings
|
5989 |
embeddings = F.normalize(embeddings, p=2, dim=1)
|
5990 |
scores = (embeddings[:2] @ embeddings[2:].T) * 100
|
5991 |
print(scores.tolist())
|
|
|
6037 |
|
6038 |
For other training details, please refer to our paper at [https://arxiv.org/pdf/2212.03533.pdf](https://arxiv.org/pdf/2212.03533.pdf).
|
6039 |
|
6040 |
+
## Benchmark Results on [Mr. TyDi](https://arxiv.org/abs/2108.08787)
|
6041 |
+
|
6042 |
+
| Model | Avg MRR@10 | | ar | bn | en | fi | id | ja | ko | ru | sw | te | th |
|
6043 |
+
|-----------------------|------------|-------|------| --- | --- | --- | --- | --- | --- | --- |------| --- | --- |
|
6044 |
+
| BM25 | 33.3 | | 36.7 | 41.3 | 15.1 | 28.8 | 38.2 | 21.7 | 28.1 | 32.9 | 39.6 | 42.4 | 41.7 |
|
6045 |
+
| mDPR | 16.7 | | 26.0 | 25.8 | 16.2 | 11.3 | 14.6 | 18.1 | 21.9 | 18.5 | 7.3 | 10.6 | 13.5 |
|
6046 |
+
| BM25 + mDPR | 41.7 | | 49.1 | 53.5 | 28.4 | 36.5 | 45.5 | 35.5 | 36.2 | 42.7 | 40.5 | 42.0 | 49.2 |
|
6047 |
+
| | |
|
6048 |
+
| multilingual-e5-small | 64.4 | | 71.5 | 66.3 | 54.5 | 57.7 | 63.2 | 55.4 | 54.3 | 60.8 | 65.4 | 89.1 | 70.1 |
|
6049 |
+
| multilingual-e5-base | 65.9 | | 72.3 | 65.0 | 58.5 | 60.8 | 64.9 | 56.6 | 55.8 | 62.7 | 69.0 | 86.6 | 72.7 |
|
6050 |
+
| multilingual-e5-large | **70.5** | | 77.5 | 73.2 | 60.8 | 66.8 | 68.5 | 62.5 | 61.6 | 65.8 | 72.7 | 90.2 | 76.2 |
|
6051 |
+
|
6052 |
+
## MTEB Benchmark Evaluation
|
6053 |
|
6054 |
Check out [unilm/e5](https://github.com/microsoft/unilm/tree/master/e5) to reproduce evaluation results
|
6055 |
on the [BEIR](https://arxiv.org/abs/2104.08663) and [MTEB benchmark](https://arxiv.org/abs/2210.07316).
|
6056 |
|
6057 |
+
## Support for Sentence Transformers
|
6058 |
+
|
6059 |
+
Below is an example for usage with sentence_transformers.
|
6060 |
+
```python
|
6061 |
+
from sentence_transformers import SentenceTransformer
|
6062 |
+
model = SentenceTransformer('intfloat/multilingual-e5-large')
|
6063 |
+
input_texts = [
|
6064 |
+
'query: how much protein should a female eat',
|
6065 |
+
'query: 南瓜的家常做法',
|
6066 |
+
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 i s 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or traini ng for a marathon. Check out the chart below to see how much protein you should be eating each day.",
|
6067 |
+
"passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮 ,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右, 放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油 锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
|
6068 |
+
]
|
6069 |
+
embeddings = model.encode(input_texts, normalize_embeddings=True)
|
6070 |
+
```
|
6071 |
+
|
6072 |
+
Package requirements
|
6073 |
+
|
6074 |
+
`pip install sentence_transformers~=2.2.2`
|
6075 |
+
|
6076 |
+
Contributors: [michaelfeil](https://huggingface.co/michaelfeil)
|
6077 |
+
|
6078 |
+
## FAQ
|
6079 |
+
|
6080 |
+
**1. Do I need to add the prefix "query: " and "passage: " to input texts?**
|
6081 |
+
|
6082 |
+
Yes, this is how the model is trained, otherwise you will see a performance degradation.
|
6083 |
+
|
6084 |
+
Here are some rules of thumb:
|
6085 |
+
- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
|
6086 |
+
|
6087 |
+
- Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
|
6088 |
+
|
6089 |
+
- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
|
6090 |
+
|
6091 |
+
**2. Why are my reproduced results slightly different from reported in the model card?**
|
6092 |
+
|
6093 |
+
Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
|
6094 |
+
|
6095 |
## Citation
|
6096 |
|
6097 |
If you find our paper or models helpful, please consider cite as follows:
|