|
--- |
|
tags: |
|
- feature-extraction |
|
- sentence-similarity |
|
- mteb |
|
language: en |
|
inference: false |
|
license: apache-2.0 |
|
--- |
|
<!-- TODO: add evaluation results here --> |
|
<br><br> |
|
|
|
<p align="center"> |
|
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px"> |
|
</p> |
|
|
|
|
|
<p align="center"> |
|
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b> |
|
</p> |
|
|
|
## Quick Start |
|
|
|
The easiest way to starting using `jina-clip-v1` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/). |
|
|
|
## Intended Usage & Model Info |
|
|
|
`jina-clip-v1` is an English, monolingual **multimodal (text-image) embedding model**. |
|
|
|
Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en), |
|
excel in text-to-text retrieval but lack cross-modal retrieval capabilities. |
|
Conversely, CLIP-like models, such as [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), |
|
align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations. |
|
|
|
`jina-clip-v1` is an innovative **multimodal embedding model**. |
|
Its text component achieves comparable performance to `jina-embeddings-v2-base-en` in text-to-text retrieval, |
|
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks. |
|
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications, |
|
allowing for both text-to-text and text-to-image searches with a single model. |
|
|
|
|
|
## Data & Parameters |
|
|
|
`jina-clip-v1` [technical report]() coming soon. |
|
|
|
## Usage |
|
|
|
You can use Jina CLIP directly from transformers package. |
|
|
|
```python |
|
!pip install transformers einops timm pillow |
|
from transformers import AutoModel |
|
from numpy.linalg import norm |
|
|
|
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b)) |
|
|
|
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True) |
|
|
|
sentences = ['How is the weather today?', 'What is the current weather like today?'] |
|
images = ['raindrop.jpg', 'sunny.jpg'] |
|
|
|
text_embeddings = model.encode_text(sentences) |
|
image_embeddings = model.encode_image(images) |
|
|
|
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity |
|
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity |
|
``` |
|
|
|
**notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!** |
|
If you want to merge two scores, we recommended 2 ways: |
|
|
|
1. weighted average of text-text sim and text-image sim: |
|
|
|
```python |
|
# pseudo code |
|
alpha = 0.6 |
|
beta = 0.4 |
|
|
|
combined_scores = alpha * sim(query, document) + beta * sim(text, image) |
|
``` |
|
|
|
2. apply z-score normalization before merging scores: |
|
|
|
```python |
|
# pseudo code |
|
query_document_mean = np.mean(cos_sim_query_documents) |
|
query_document_std = np.std(cos_sim_query_documents) |
|
text_image_mean = np.mean(cos_sim_text_images) |
|
text_image_std = np.std(cos_sim_text_images) |
|
|
|
query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std |
|
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std |
|
``` |
|
|
|
## Performance |
|
|
|
### Text-Image Retrieval |
|
|
|
| Name | Flickr Image Retr. R@1 | Flickr Image Retr. R@5 | Flickr Text Retr. R@1 | Flickr Text Retr. R@5 | |
|
|------------------|-------------------------|-------------------------|-----------------------|-----------------------| |
|
| ViT-B-32 | 0.597 | 0.8398 | 0.781 | 0.938 | |
|
| ViT-B-16 | 0.6216 | 0.8572 | 0.822 | 0.966 | |
|
| jina-clip | 0.6748 | 0.8902 | 0.811 | 0.965 | |
|
|
|
|
|
| Name | MSCOCO Image Retr. R@1 | MSCOCO Image Retr. R@5 | MSCOCO Text Retr. R@1 | MSCOCO Text Retr. R@5 | |
|
|------------------|-------------------------|-------------------------|-----------------------|-----------------------| |
|
| ViT-B-32 | 0.342 | 0.6001 | 0.5234 | 0.7634 | |
|
| ViT-B-16 | 0.3309 | 0.5842 | 0.5242 | 0.767 | |
|
| jina-clip | 0.4111 | 0.6644 | 0.5544 | 0.7904 | |
|
|
|
### Text-Text Retrieval |
|
|
|
| Name | STS12 | STS15 | STS17 | STS13 | STS14 | STS16 | STS22 | STSBenchmark | SummEval | |
|
|-----------------------|--------|--------|--------|--------|--------|--------|--------|--------------|----------| |
|
| jina-embeddings-v2 | 0.7427 | 0.8755 | 0.8888 | 0.833 | 0.7917 | 0.836 | 0.6346 | 0.8404 | 0.3056 | |
|
| jina-clip | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493 | 0.3048 | |
|
|
|
|
|
| Name | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID | |
|
|--------------------|---------|----------|----------|-------|---------|---------|-----------| |
|
| jina-embeddings-v2 | 0.4418 | 0.4158 | 0.3245 | 0.882 | 0.1986 | 0.6668 | 0.6591 | |
|
| jina-clip | 0.4933 | 0.3827 | 0.3352 | 0.8789| 0.2024 | 0.6734 | 0.7161 | |
|
|
|
## Contact |
|
|
|
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |
|
|
|
## Citation |
|
|
|
If you find `jina-clip-v1` useful in your research, please cite the following paper: |
|
|
|
```console |
|
TBD |
|
``` |
|
|
|
|