tags:
- feature-extraction
- sentence-similarity
- mteb
language: en
inference: false
license: apache-2.0
The text embedding set trained by Jina AI.
Quick Start
The easiest way to starting using jina-clip-v1
is to use Jina AI's Embedding API.
Intended Usage & Model Info
jina-clip-v1
is an English, monolingual multimodal (text-image) embedding model.
Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but lack cross-modal retrieval capabilities. Conversely, CLIP-like models, such as openai/clip-vit-base-patch32, align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.
jina-clip-v1
is an innovative multimodal embedding model.
Its text component achieves comparable performance to jina-embeddings-v2-base-en
in text-to-text retrieval,
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
allowing for both text-to-text and text-to-image searches with a single model.
Data & Parameters
jina-clip-v1
technical report coming soon.
Usage
You can use Jina CLIP directly from transformers package.
!pip install transformers einops timm pillow
from transformers import AutoModel
from numpy.linalg import norm
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
sentences = ['How is the weather today?', 'What is the current weather like today?']
images = ['raindrop.jpg', 'sunny.jpg']
text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(images)
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity! If you want to merge two scores, we recommended 2 ways:
- weighted average of text-text sim and text-image sim:
# pseudo code
alpha = 0.6
beta = 0.4
combined_scores = alpha * sim(query, document) + beta * sim(text, image)
- apply z-score normalization before merging scores:
# pseudo code
query_document_mean = np.mean(cos_sim_query_documents)
query_document_std = np.std(cos_sim_query_documents)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)
query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
Performance
Text-Image Retrieval
Name | Flickr Image Retr. R@1 | Flickr Image Retr. R@5 | Flickr Text Retr. R@1 | Flickr Text Retr. R@5 |
---|---|---|---|---|
ViT-B-32 | 0.597 | 0.8398 | 0.781 | 0.938 |
ViT-B-16 | 0.6216 | 0.8572 | 0.822 | 0.966 |
jina-clip | 0.6748 | 0.8902 | 0.811 | 0.965 |
Name | MSCOCO Image Retr. R@1 | MSCOCO Image Retr. R@5 | MSCOCO Text Retr. R@1 | MSCOCO Text Retr. R@5 |
---|---|---|---|---|
ViT-B-32 | 0.342 | 0.6001 | 0.5234 | 0.7634 |
ViT-B-16 | 0.3309 | 0.5842 | 0.5242 | 0.767 |
jina-clip | 0.4111 | 0.6644 | 0.5544 | 0.7904 |
Text-Text Retrieval
Name | STS12 | STS15 | STS17 | STS13 | STS14 | STS16 | STS22 | STSBenchmark | SummEval |
---|---|---|---|---|---|---|---|---|---|
jina-embeddings-v2 | 0.7427 | 0.8755 | 0.8888 | 0.833 | 0.7917 | 0.836 | 0.6346 | 0.8404 | 0.3056 |
jina-clip | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493 | 0.3048 |
Name | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID |
---|---|---|---|---|---|---|---|
jina-embeddings-v2 | 0.4418 | 0.4158 | 0.3245 | 0.882 | 0.1986 | 0.6668 | 0.6591 |
jina-clip | 0.4933 | 0.3827 | 0.3352 | 0.8789 | 0.2024 | 0.6734 | 0.7161 |
Contact
Join our Discord community and chat with other community members about ideas.
Citation
If you find jina-clip-v1
useful in your research, please cite the following paper:
TBD