File size: 5,880 Bytes

---
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0
---
<!-- TODO: add evaluation results here -->
<br><br>

<p align="center">
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>


<p align="center">
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>

## Quick Start

The easiest way to starting using `jina-clip-v1` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).

## Intended Usage & Model Info

`jina-clip-v1` is an English, monolingual **multimodal (text-image) embedding model**.

Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en),
excel in text-to-text retrieval but lack cross-modal retrieval capabilities.
Conversely, CLIP-like models, such as [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32),
align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

`jina-clip-v1` is an innovative **multimodal embedding model**.
Its text component achieves comparable performance to `jina-embeddings-v2-base-en` in text-to-text retrieval,
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
allowing for both text-to-text and text-to-image searches with a single model.


## Data & Parameters

`jina-clip-v1` [technical report]() coming soon.

## Usage

You can use Jina CLIP directly from transformers package.

```python
!pip install transformers einops timm pillow
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

sentences = ['How is the weather today?', 'What is the current weather like today?']
images = ['raindrop.jpg', 'sunny.jpg']

text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(images)

print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
```

**notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!**
If you want to merge two scores, we recommended 2 ways:

1. weighted average of text-text sim and text-image sim:

```python
# pseudo code
alpha = 0.6
beta = 0.4

combined_scores = alpha * sim(query, document) + beta * sim(text, image)
```

2. apply z-score normalization before merging scores:

```python
# pseudo code
query_document_mean = np.mean(cos_sim_query_documents)
query_document_std = np.std(cos_sim_query_documents)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)

query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
```

## Performance

### Text-Image Retrieval

| Name             | Flickr Image Retr. R@1 | Flickr Image Retr. R@5 | Flickr Text Retr. R@1 | Flickr Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32         | 0.597                   | 0.8398                  | 0.781                 | 0.938                 |
| ViT-B-16         | 0.6216                  | 0.8572                  | 0.822                 | 0.966                 |
| jina-clip        | 0.6748                  | 0.8902                  | 0.811                 | 0.965                 |


| Name             | MSCOCO Image Retr. R@1  | MSCOCO Image Retr. R@5 | MSCOCO Text Retr. R@1 | MSCOCO Text Retr. R@5 |
|------------------|-------------------------|-------------------------|-----------------------|-----------------------|
| ViT-B-32         | 0.342                   | 0.6001                  | 0.5234                | 0.7634                |
| ViT-B-16         | 0.3309                  | 0.5842                  | 0.5242                | 0.767                 |
| jina-clip        | 0.4111                  | 0.6644                  | 0.5544                | 0.7904                |

### Text-Text Retrieval

| Name                  | STS12  | STS15  | STS17  | STS13  | STS14  | STS16  | STS22  | STSBenchmark | SummEval |
|-----------------------|--------|--------|--------|--------|--------|--------|--------|--------------|----------|
| jina-embeddings-v2    | 0.7427 | 0.8755 | 0.8888 | 0.833  | 0.7917 | 0.836  | 0.6346 | 0.8404       | 0.3056   |
| jina-clip             | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493       | 0.3048   |


| Name               | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID |
|--------------------|---------|----------|----------|-------|---------|---------|-----------|
| jina-embeddings-v2 | 0.4418  | 0.4158   | 0.3245   | 0.882 | 0.1986  | 0.6668  | 0.6591    |
| jina-clip          | 0.4933  | 0.3827   | 0.3352   | 0.8789| 0.2024  | 0.6734  | 0.7161    |

## Contact

Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.

## Citation

If you find `jina-clip-v1` useful in your research, please cite the following paper:

```console
TBD
```