nan commited on
Commit
3a2b8e0
1 Parent(s): 4aea125

feat: update README

Browse files
Files changed (1) hide show
  1. README.md +287 -3
README.md CHANGED
@@ -1,3 +1,287 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - multilingual
5
+ - af
6
+ - am
7
+ - ar
8
+ - as
9
+ - az
10
+ - be
11
+ - bg
12
+ - bn
13
+ - br
14
+ - bs
15
+ - ca
16
+ - cs
17
+ - cy
18
+ - da
19
+ - de
20
+ - el
21
+ - en
22
+ - eo
23
+ - es
24
+ - et
25
+ - eu
26
+ - fa
27
+ - fi
28
+ - fr
29
+ - fy
30
+ - ga
31
+ - gd
32
+ - gl
33
+ - gu
34
+ - ha
35
+ - he
36
+ - hi
37
+ - hr
38
+ - hu
39
+ - hy
40
+ - id
41
+ - is
42
+ - it
43
+ - ja
44
+ - jv
45
+ - ka
46
+ - kk
47
+ - km
48
+ - kn
49
+ - ko
50
+ - ku
51
+ - ky
52
+ - la
53
+ - lo
54
+ - lt
55
+ - lv
56
+ - mg
57
+ - mk
58
+ - ml
59
+ - mn
60
+ - mr
61
+ - ms
62
+ - my
63
+ - ne
64
+ - nl
65
+ - 'no'
66
+ - om
67
+ - or
68
+ - pa
69
+ - pl
70
+ - ps
71
+ - pt
72
+ - ro
73
+ - ru
74
+ - sa
75
+ - sd
76
+ - si
77
+ - sk
78
+ - sl
79
+ - so
80
+ - sq
81
+ - sr
82
+ - su
83
+ - sv
84
+ - sw
85
+ - ta
86
+ - te
87
+ - th
88
+ - tl
89
+ - tr
90
+ - ug
91
+ - uk
92
+ - ur
93
+ - uz
94
+ - vi
95
+ - xh
96
+ - yi
97
+ - zh
98
+ tags:
99
+ - ColBERT
100
+ - passage-retrieval
101
+ ---
102
+
103
+ <br><br>
104
+
105
+ <p align="center">
106
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
107
+ </p>
108
+
109
+
110
+ <p align="center">
111
+ <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
112
+ </p>
113
+
114
+ <p align="center">
115
+ <b>JinaColBERT V2: your multilingual late interaction retriever!</b>
116
+ </p>
117
+
118
+ JinaColBERT V2 (`jina-colbert-v2`) is a new model based on the [JinaColBERT V1](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/) that expands on the capabilities and performance of the [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en) model. Like the previous release, it has Jina AI’s 8192 token input context and the [improved efficiency, performance](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/), and [explainability](https://jina.ai/news/ai-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/) of token-level embeddings and late interaction.
119
+
120
+ This new release adds new functionality and performance improvements:
121
+
122
+ - Multilingual support for dozens of languages, with strong performance on major global languages.
123
+ - [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), which allow users to trade between efficiency and precision flexibly.
124
+ - Superior retrieval performance when compared to the English-only [`jina-colbert-v1-en`](https://huggingface.co/jinaai/jina-colbert-v1-en).
125
+
126
+ JinaColBERT V2 offers three different versions for different embeddings dimensions:
127
+ [`jinaai/jina-colbert-v2`](https://huggingface.co/jinaai/jina-colbert-v2): 128 dimension embeddings
128
+ [`jinaai/jina-colbert-v2-96`](https://huggingface.co/jinaai/jina-colbert-v2-96): 96 dimension embeddings
129
+ [`jinaai/jina-colbert-v2-64`](https://huggingface.co/jinaai/jina-colbert-v2-64): 64 dimension embeddings
130
+
131
+
132
+ ## Usage
133
+
134
+ ### Installation
135
+
136
+ `jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.
137
+
138
+ To use the model, you could either use the Standford ColBERT library or use the `ragatouille` package that we provide.
139
+
140
+ ```bash
141
+ pip install -U einops flash_attn
142
+ pip install -U ragatouille
143
+ pip install -U colbert-ai
144
+ ```
145
+
146
+ ### RAGatouille
147
+
148
+ ```python
149
+ from ragatouille import RAGPretrainedModel
150
+
151
+ RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
152
+ docs = [
153
+ "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
154
+ "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
155
+ ]
156
+ RAG.index(docs, index_name="demo")
157
+ query = "What does ColBERT do?"
158
+ results = RAG.search(query)
159
+ ```
160
+
161
+ ### Stanford ColBERT
162
+
163
+ ```python
164
+ from colbert.infra import ColBERTConfig
165
+ from colbert.modeling.checkpoint import Checkpoint
166
+
167
+ ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
168
+ docs = [
169
+ "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
170
+ "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
171
+ ]
172
+ query_vectors = ckpt.queryFromText(docs, bsize=2)
173
+ ```
174
+
175
+ ## Evaluation Results
176
+
177
+ ### Retrieval Benchmarks
178
+
179
+ #### BEIR
180
+
181
+ | **NDCG@10** | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
182
+ |--------------------|---------------------|---------------------|-----------------|----------|
183
+ | **avg** | 0.531 | 0.502 | 0.496 | 0.440 |
184
+ | **nfcorpus** | 0.346 | 0.338 | 0.337 | 0.325 |
185
+ | **fiqa** | 0.408 | 0.368 | 0.354 | 0.236 |
186
+ | **trec-covid** | 0.834 | 0.750 | 0.726 | 0.656 |
187
+ | **arguana** | 0.366 | 0.494 | 0.465 | 0.315 |
188
+ | **quora** | 0.887 | 0.823 | 0.855 | 0.789 |
189
+ | **scidocs** | 0.186 | 0.169 | 0.154 | 0.158 |
190
+ | **scifact** | 0.678 | 0.701 | 0.689 | 0.665 |
191
+ | **webis-touche** | 0.274 | 0.270 | 0.260 | 0.367 |
192
+ | **dbpedia-entity** | 0.471 | 0.413 | 0.452 | 0.313 |
193
+ | **fever** | 0.805 | 0.795 | 0.785 | 0.753 |
194
+ | **climate-fever** | 0.239 | 0.196 | 0.176 | 0.213 |
195
+ | **hotpotqa** | 0.766 | 0.656 | 0.675 | 0.603 |
196
+ | **nq** | 0.640 | 0.549 | 0.524 | 0.329 |
197
+
198
+
199
+
200
+ #### MS MARCO Passage Retrieval
201
+
202
+ | **MRR@10** | **jina-colbert-v2** | **jina-colbert-v1** | **ColBERTv2.0** | **BM25** |
203
+ |-------------|---------------------|---------------------|-----------------|----------|
204
+ | **MSMARCO** | 0.396 | 0.390 | 0.397 | 0.187 |
205
+
206
+
207
+ ### Multilingual Benchmarks
208
+
209
+ #### MIRACLE
210
+
211
+ | **NDCG@10** | **jina-colbert-v2** | **mDPR (zero shot)** |
212
+ |---------|---------------------|----------------------|
213
+ | **avg** | 0.627 | 0.427 |
214
+ | **ar** | 0.753 | 0.499 |
215
+ | **bn** | 0.750 | 0.443 |
216
+ | **de** | 0.504 | 0.490 |
217
+ | **es** | 0.538 | 0.478 |
218
+ | **en** | 0.570 | 0.394 |
219
+ | **fa** | 0.563 | 0.480 |
220
+ | **fi** | 0.740 | 0.472 |
221
+ | **fr** | 0.541 | 0.435 |
222
+ | **hi** | 0.600 | 0.383 |
223
+ | **id** | 0.547 | 0.272 |
224
+ | **ja** | 0.632 | 0.439 |
225
+ | **ko** | 0.671 | 0.419 |
226
+ | **ru** | 0.643 | 0.407 |
227
+ | **sw** | 0.499 | 0.299 |
228
+ | **te** | 0.742 | 0.356 |
229
+ | **th** | 0.772 | 0.358 |
230
+ | **yo** | 0.623 | 0.396 |
231
+ | **zh** | 0.523 | 0.512 |
232
+
233
+ #### mMARCO
234
+
235
+ | **MRR@10** | **jina-colbert-v2** | **BM-25** | **ColBERT-XM** |
236
+ |------------|---------------------|-----------|----------------|
237
+ | **avg** | 0.313 | 0.141 | 0.254 |
238
+ | **ar** | 0.272 | 0.111 | 0.195 |
239
+ | **de** | 0.331 | 0.136 | 0.270 |
240
+ | **nl** | 0.330 | 0.140 | 0.275 |
241
+ | **es** | 0.341 | 0.158 | 0.285 |
242
+ | **fr** | 0.335 | 0.155 | 0.269 |
243
+ | **hi** | 0.309 | 0.134 | 0.238 |
244
+ | **id** | 0.319 | 0.149 | 0.263 |
245
+ | **it** | 0.337 | 0.153 | 0.265 |
246
+ | **ja** | 0.276 | 0.141 | 0.241 |
247
+ | **pt** | 0.337 | 0.152 | 0.276 |
248
+ | **ru** | 0.298 | 0.124 | 0.251 |
249
+ | **vi** | 0.287 | 0.136 | 0.226 |
250
+ | **zh** | 0.302 | 0.116 | 0.246 |
251
+
252
+
253
+
254
+ ### Matryoshka Representation Benchmarks
255
+
256
+ #### BEIR
257
+
258
+ | **NDCG@10** | **dim=128** | **dim=96** | **dim=64** |
259
+ |----------------|-------------|------------|------------|
260
+ | **avg** | 0.599 | 0.591 | 0.589 |
261
+ | **nfcorpus** | 0.346 | 0.340 | 0.347 |
262
+ | **fiqa** | 0.408 | 0.404 | 0.404 |
263
+ | **trec-covid** | 0.834 | 0.808 | 0.805 |
264
+ | **hotpotqa** | 0.766 | 0.764 | 0.756 |
265
+ | **nq** | 0.640 | 0.640 | 0.635 |
266
+
267
+
268
+ #### MSMARCO
269
+
270
+ | **MRR@10** | **dim=128** | **dim=96** | **dim=64** |
271
+ |----------------|-------------|------------|------------|
272
+ | **msmarco** | 0.396 | 0.391 | 0.388 |
273
+
274
+ ## Other Models
275
+
276
+ Additionally, we provide the following embedding models, you can also use them for retrieval.
277
+
278
+ - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
279
+ - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
280
+ - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
281
+ - [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.
282
+ - [`jina-reranker-v2`](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual): multilingual reranker model.
283
+ - [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1): English multimodal (text-image) embedding model.
284
+
285
+ ## Contact
286
+
287
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.