File size: 5,880 Bytes
6417422
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a792bea
6417422
 
 
 
 
 
a792bea
ea6cbd8
6417422
 
 
b9d7d6e
 
 
aac6f27
 
6507fd6
aac6f27
 
b9d7d6e
6417422
 
 
 
0aaf6db
 
 
 
 
 
 
1d3913c
 
0aaf6db
 
 
 
 
 
 
 
1d3913c
 
 
 
 
 
 
0aaf6db
 
1f9045f
6417422
07d4bb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6417422
 
 
 
 
 
a792bea
6417422
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0
---
<!-- TODO: add evaluation results here -->
<br><br>

<p align="center">
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>

## Quick Start

The easiest way to starting using `jina-clip-v1` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).

## Intended Usage & Model Info

`jina-clip-v1` is an English, monolingual **multimodal (text-image) embedding model**.

Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en),
excel in text-to-text retrieval but lack cross-modal retrieval capabilities.
Conversely, CLIP-like models, such as [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32),
align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

`jina-clip-v1` is an innovative **multimodal embedding model**.
Its text component achieves comparable performance to `jina-embeddings-v2-base-en` in text-to-text retrieval,
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
allowing for both text-to-text and text-to-image searches with a single model.


## Data & Parameters

`jina-clip-v1` [technical report]() coming soon.

## Usage

You can use Jina CLIP directly from transformers package.

```python
!pip install transformers einops timm pillow
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

sentences = ['How is the weather today?', 'What is the current weather like today?']
images = ['raindrop.jpg', 'sunny.jpg']

text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(images)

print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
```

**notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!**
If you want to merge two scores, we recommended 2 ways:

1. weighted average of text-text sim and text-image sim:

```python
# pseudo code
alpha = 0.6
beta = 0.4

combined_scores = alpha * sim(query, document) + beta * sim(text, image)
```

2. apply z-score normalization before merging scores:

```python
# pseudo code
query_document_mean = np.mean(cos_sim_query_documents)
query_document_std = np.std(cos_sim_query_documents)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)

query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
```

## Performance

### Text-Image Retrieval

| Name             | Flickr Image Retr. R@1 | Flickr Image Retr. R@5 | Flickr Text Retr. R@1 | Flickr Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32         | 0.597                   | 0.8398                  | 0.781                 | 0.938                 |
| ViT-B-16         | 0.6216                  | 0.8572                  | 0.822                 | 0.966                 |
| jina-clip        | 0.6748                  | 0.8902                  | 0.811                 | 0.965                 |


| Name             | MSCOCO Image Retr. R@1  | MSCOCO Image Retr. R@5 | MSCOCO Text Retr. R@1 | MSCOCO Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32         | 0.342                   | 0.6001                  | 0.5234                | 0.7634                |
| ViT-B-16         | 0.3309                  | 0.5842                  | 0.5242                | 0.767                 |
| jina-clip        | 0.4111                  | 0.6644                  | 0.5544                | 0.7904                |

### Text-Text Retrieval

| Name                  | STS12  | STS15  | STS17  | STS13  | STS14  | STS16  | STS22  | STSBenchmark | SummEval |
|-----------------------|--------|--------|--------|--------|--------|--------|--------|--------------|----------|
| jina-embeddings-v2    | 0.7427 | 0.8755 | 0.8888 | 0.833  | 0.7917 | 0.836  | 0.6346 | 0.8404       | 0.3056   |
| jina-clip             | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493       | 0.3048   |


| Name               | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID |
|--------------------|---------|----------|----------|-------|---------|---------|-----------|
| jina-embeddings-v2 | 0.4418  | 0.4158   | 0.3245   | 0.882 | 0.1986  | 0.6668  | 0.6591    |
| jina-clip          | 0.4933  | 0.3827   | 0.3352   | 0.8789| 0.2024  | 0.6734  | 0.7161    |

## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.

## Citation

If you find `jina-clip-v1` useful in your research, please cite the following paper:

```console
TBD
```