BAAI
/

ldwang commited on
Commit
4619734
1 Parent(s): 6c18dec

update readme

Browse files
Files changed (1) hide show
  1. README.md +70 -33
README.md CHANGED
@@ -7,6 +7,7 @@ tags:
7
  - sentence-transformers
8
  ---
9
 
 
10
  <h1 align="center">FlagEmbedding</h1>
11
 
12
 
@@ -16,20 +17,22 @@ tags:
16
  <a href=#usage>Usage</a> |
17
  <a href="#evaluation">Evaluation</a> |
18
  <a href="#train">Train</a> |
 
19
  <a href="#license">License</a>
20
  <p>
21
  </h4>
22
 
23
- For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
24
 
25
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
26
 
27
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
28
- And it also can be used in vector databases for LLMs.
29
 
30
  ************* 🌟**Updates**🌟 *************
 
31
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
32
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
33
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
34
 
35
 
@@ -37,36 +40,42 @@ And it also can be used in vector databases for LLMs.
37
 
38
  `bge` is short for `BAAI general embedding`.
39
 
40
- | Model | Language | Description | query instruction for retrieval |
41
  |:-------------------------------|:--------:| :--------:| :--------:|
42
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
43
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
44
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
45
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
46
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
47
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
48
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
49
 
50
-
51
 
52
  ## Usage
53
 
54
- * **Using FlagEmbedding**
 
 
 
55
  ```
56
  pip install -U FlagEmbedding
57
  ```
58
- See [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
59
 
60
  ```python
61
  from FlagEmbedding import FlagModel
62
  sentences = ["样例数据-1", "样例数据-2"]
63
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
64
- embeddings = model.encode(sentences)
65
- print(embeddings)
66
- # for retrieval task, please use encode_queries() which will automatically add the instruction to each query
67
- # corpus in retrieval task can still use encode() or encode_corpus()
 
 
 
68
  queries = ['query_1', 'query_2']
69
- passages = ["样例段落-1", "样例段落-2"]
70
  q_embeddings = model.encode_queries(queries)
71
  p_embeddings = model.encode(passages)
72
  scores = q_embeddings @ p_embeddings.T
@@ -76,7 +85,7 @@ The value of argument `query_instruction_for_retrieval` see [Model List](https:/
76
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
77
 
78
 
79
- * **Using Sentence-Transformers**
80
 
81
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
82
 
@@ -87,23 +96,43 @@ pip install -U sentence-transformers
87
  from sentence_transformers import SentenceTransformer
88
  sentences = ["样例数据-1", "样例数据-2"]
89
  model = SentenceTransformer('BAAI/bge-large-zh')
90
- embeddings = model.encode(sentences, normalize_embeddings=True)
91
- print(embeddings)
 
 
92
  ```
93
- For retrieval task,
94
- each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
 
95
  ```python
96
  from sentence_transformers import SentenceTransformer
97
- queries = ["手机开不了机怎么办?"]
98
- passages = ["样例段落-1", "样例段落-2"]
99
  instruction = "为这个句子生成表示以用于检索相关文章:"
 
100
  model = SentenceTransformer('BAAI/bge-large-zh')
101
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
102
  p_embeddings = model.encode(passages, normalize_embeddings=True)
103
  scores = q_embeddings @ p_embeddings.T
104
  ```
105
 
106
- * **Using HuggingFace Transformers**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
109
 
@@ -112,13 +141,16 @@ from transformers import AutoTokenizer, AutoModel
112
  import torch
113
  # Sentences we want sentence embeddings for
114
  sentences = ["样例数据-1", "样例数据-2"]
 
115
  # Load model from HuggingFace Hub
116
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
117
  model = AutoModel.from_pretrained('BAAI/bge-large-zh')
 
118
  # Tokenize sentences
119
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
120
- # for retrieval task, add an instruction to query
121
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
 
122
  # Compute token embeddings
123
  with torch.no_grad():
124
  model_output = model(**encoded_input)
@@ -160,7 +192,7 @@ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/
160
 
161
 
162
  - **C-MTEB**:
163
- We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
164
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
165
 
166
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
@@ -188,7 +220,7 @@ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagE
188
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
189
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
190
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
191
- In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
192
  We used the AdamW optimizer and the learning rate is 2e-5.
193
 
194
  **Pre-training data**:
@@ -197,8 +229,7 @@ We used the AdamW optimizer and the learning rate is 2e-5.
197
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
198
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
199
  - Chinese:
200
- - Subset of [wudao](https://github.com/BAAI-WuDao/Data)
201
- - [baidu-baike](https://baike.baidu.com/)
202
 
203
 
204
  **2. Finetune**
@@ -212,11 +243,11 @@ We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so
212
  We used the AdamW optimizer and the learning rate is 1e-5.
213
  The temperature for contrastive loss is 0.01.
214
 
215
- For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
216
- For english, the instruction is `Represent this sentence for searching relevant passages: `;
217
- For chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
218
- In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
219
-
220
 
221
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
222
  You can easily finetune your model with it.
@@ -229,7 +260,13 @@ You can easily finetune your model with it.
229
 
230
  **The data collection is to be released in the future.**
231
 
 
 
 
232
 
233
 
234
  ## License
235
- FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
 
 
 
 
7
  - sentence-transformers
8
  ---
9
 
10
+
11
  <h1 align="center">FlagEmbedding</h1>
12
 
13
 
 
17
  <a href=#usage>Usage</a> |
18
  <a href="#evaluation">Evaluation</a> |
19
  <a href="#train">Train</a> |
20
+ <a href="#contact">Contact</a> |
21
  <a href="#license">License</a>
22
  <p>
23
  </h4>
24
 
25
+ More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
26
 
27
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
28
 
29
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
30
+ And it also can be used in vector database for LLMs.
31
 
32
  ************* 🌟**Updates**🌟 *************
33
+ - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [**this**](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
34
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
35
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
36
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
37
 
38
 
 
40
 
41
  `bge` is short for `BAAI general embedding`.
42
 
43
+ | Model | Language | Description | query instruction for retrieval\* |
44
  |:-------------------------------|:--------:| :--------:| :--------:|
45
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
46
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
47
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
48
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
49
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
50
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
51
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
52
 
53
+ \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
54
 
55
  ## Usage
56
 
57
+ Here are some examples to use `bge` models with
58
+ [FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
59
+
60
+ #### Using FlagEmbedding
61
  ```
62
  pip install -U FlagEmbedding
63
  ```
64
+ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
65
 
66
  ```python
67
  from FlagEmbedding import FlagModel
68
  sentences = ["样例数据-1", "样例数据-2"]
69
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
70
+ embeddings_1 = model.encode(sentences)
71
+ embeddings_2 = model.encode(sentences)
72
+ similarity = embeddings_1 @ embeddings_2.T
73
+ print(similarity)
74
+
75
+ # for s2p(short query to long passage) retrieval task, please use encode_queries() which will automatically add the instruction to each query
76
+ # corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction
77
  queries = ['query_1', 'query_2']
78
+ passages = ["样例文档-1", "样例文档-2"]
79
  q_embeddings = model.encode_queries(queries)
80
  p_embeddings = model.encode(passages)
81
  scores = q_embeddings @ p_embeddings.T
 
85
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
86
 
87
 
88
+ #### Using Sentence-Transformers
89
 
90
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
91
 
 
96
  from sentence_transformers import SentenceTransformer
97
  sentences = ["样例数据-1", "样例数据-2"]
98
  model = SentenceTransformer('BAAI/bge-large-zh')
99
+ embeddings_1 = model.encode(sentences, normalize_embeddings=True)
100
+ embeddings_2 = model.encode(sentences, normalize_embeddings=True)
101
+ similarity = embeddings_1 @ embeddings_2.T
102
+ print(similarity)
103
  ```
104
+ For s2p(short query to long passage) retrieval task,
105
+ each short query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
106
+ But the instruction is not needed for passages.
107
  ```python
108
  from sentence_transformers import SentenceTransformer
109
+ queries = ['query_1', 'query_2']
110
+ passages = ["样例文档-1", "样例文档-2"]
111
  instruction = "为这个句子生成表示以用于检索相关文章:"
112
+
113
  model = SentenceTransformer('BAAI/bge-large-zh')
114
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
115
  p_embeddings = model.encode(passages, normalize_embeddings=True)
116
  scores = q_embeddings @ p_embeddings.T
117
  ```
118
 
119
+ #### Using Langchain
120
+
121
+ You can use `bge` in langchain like this:
122
+ ```python
123
+ from langchain.embeddings import HuggingFaceBgeEmbeddings
124
+ model_name = "BAAI/bge-small-en"
125
+ model_kwargs = {'device': 'cuda'}
126
+ encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
127
+ model_norm = HuggingFaceBgeEmbeddings(
128
+ model_name=model_name,
129
+ model_kwargs=model_kwargs,
130
+ encode_kwargs=encode_kwargs
131
+ )
132
+ ```
133
+
134
+
135
+ #### Using HuggingFace Transformers
136
 
137
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
138
 
 
141
  import torch
142
  # Sentences we want sentence embeddings for
143
  sentences = ["样例数据-1", "样例数据-2"]
144
+
145
  # Load model from HuggingFace Hub
146
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
147
  model = AutoModel.from_pretrained('BAAI/bge-large-zh')
148
+
149
  # Tokenize sentences
150
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
151
+ # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
152
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
153
+
154
  # Compute token embeddings
155
  with torch.no_grad():
156
  model_output = model(**encoded_input)
 
192
 
193
 
194
  - **C-MTEB**:
195
+ We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks.
196
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
197
 
198
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
 
220
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
221
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
222
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
223
+ In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
224
  We used the AdamW optimizer and the learning rate is 2e-5.
225
 
226
  **Pre-training data**:
 
229
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
230
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
231
  - Chinese:
232
+ - [wudao](https://github.com/BAAI-WuDao/Data)
 
233
 
234
 
235
  **2. Finetune**
 
243
  We used the AdamW optimizer and the learning rate is 1e-5.
244
  The temperature for contrastive loss is 0.01.
245
 
246
+ Besides, we add instruction to the query for s2p(short query to long passage) retrieval task in the training (add nothing to passages).
247
+ For English, the instruction is `Represent this sentence for searching relevant passages: `;
248
+ For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
249
+ In the evaluation, the instruction should be added for queries in retrieval task, not be added for other tasks.
250
+ Noted that the instruction is not needed for passages.
251
 
252
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
253
  You can easily finetune your model with it.
 
260
 
261
  **The data collection is to be released in the future.**
262
 
263
+ We will continually update the embedding models and training codes,
264
+ hoping to promote the development of the embedding model community.
265
+
266
 
267
 
268
  ## License
269
+ FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
270
+
271
+
272
+