Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,79 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- feature-extraction
|
4 |
+
- sentence-similarity
|
5 |
+
- mteb
|
6 |
+
language: en
|
7 |
+
inference: false
|
8 |
+
license: apache-2.0
|
9 |
+
---
|
10 |
+
<!-- TODO: add evaluation results here -->
|
11 |
+
<br><br>
|
12 |
+
|
13 |
+
<p align="center">
|
14 |
+
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
|
15 |
+
</p>
|
16 |
+
|
17 |
+
|
18 |
+
<p align="center">
|
19 |
+
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
20 |
+
</p>
|
21 |
+
|
22 |
+
## Quick Start
|
23 |
+
|
24 |
+
The easiest way to starting using `jina-clip-v1` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
|
25 |
+
|
26 |
+
## Intended Usage & Model Info
|
27 |
+
|
28 |
+
### `jina-clip-v1` Overview
|
29 |
+
|
30 |
+
### `jina-clip-v1` Overview
|
31 |
+
|
32 |
+
`jina-clip-v1` is an English, monolingual **multimodal (text-image) embedding model**.
|
33 |
+
|
34 |
+
Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en),
|
35 |
+
excel in text-to-text retrieval but lack cross-modal retrieval capabilities.
|
36 |
+
Conversely, CLIP-like models, such as [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32),
|
37 |
+
align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.
|
38 |
+
|
39 |
+
`jina-clip-v1` is an innovative **multimodal embedding model**.
|
40 |
+
Its text component achieves comparable performance to `jina-embeddings-v2-base-en` in text-to-text retrieval,
|
41 |
+
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
|
42 |
+
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
|
43 |
+
allowing for both text-to-text and text-to-image searches with a single model.
|
44 |
+
|
45 |
+
|
46 |
+
## Data & Parameters
|
47 |
+
|
48 |
+
Jina CLIP V1 [technical report]() coming soon.
|
49 |
+
|
50 |
+
## Usage
|
51 |
+
|
52 |
+
You can use Jina CLIP directly from transformers package.
|
53 |
+
|
54 |
+
```python
|
55 |
+
!pip install transformers
|
56 |
+
from transformers import AutoModel
|
57 |
+
from numpy.linalg import norm
|
58 |
+
|
59 |
+
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
60 |
+
model = AutoModel.from_pretrained('jinaai/jina-clip-v1')
|
61 |
+
text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
|
62 |
+
image_embeddings = model.encode_image(['raindrop.png'])
|
63 |
+
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
|
64 |
+
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
|
65 |
+
```
|
66 |
+
|
67 |
+
|
68 |
+
## Contact
|
69 |
+
|
70 |
+
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|
71 |
+
|
72 |
+
## Citation
|
73 |
+
|
74 |
+
If you find Jina CLIP useful in your research, please cite the following paper:
|
75 |
+
|
76 |
+
```console
|
77 |
+
TBD
|
78 |
+
```
|
79 |
+
|