tags:
- feature-extraction
- sentence-similarity
- mteb
language: en
inference: false
license: apache-2.0
The text embedding set trained by Jina AI.
Quick Start
The easiest way to starting using jina-clip-v1
is to use Jina AI's Embedding API.
Intended Usage & Model Info
jina-clip-v1
is an English, monolingual multimodal (text-image) embedding model.
Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but lack cross-modal retrieval capabilities. Conversely, CLIP-like models, such as openai/clip-vit-base-patch32, align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.
jina-clip-v1
is an innovative multimodal embedding model.
Its text component achieves comparable performance to jina-embeddings-v2-base-en
in text-to-text retrieval,
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
allowing for both text-to-text and text-to-image searches with a single model.
Data & Parameters
Jina CLIP V1 technical report coming soon.
Usage
You can use Jina CLIP directly from transformers package.
!pip install transformers einops timm
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from numpy.linalg import norm
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
image_embeddings = model.encode_image(['raindrop.png'])
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
Performance
Text-Image Retrieval
Name | Flickr Image Retr. R@1 | Flickr Image Retr. R@5 | Flickr Text Retr. R@1 | Flickr Text Retr. R@5 |
---|---|---|---|---|
ViT-B-32 | 0.597 | 0.8398 | 0.781 | 0.938 |
ViT-B-16 | 0.6216 | 0.8572 | 0.822 | 0.966 |
jina-clip | 0.6748 | 0.8902 | 0.811 | 0.965 |
Name | MSCOCO Image Retr. R@1 | MSCOCO Image Retr. R@5 | MSCOCO Text Retr. R@1 | MSCOCO Text Retr. R@5 |
---|---|---|---|---|
ViT-B-32 | 0.342 | 0.6001 | 0.5234 | 0.7634 |
ViT-B-16 | 0.3309 | 0.5842 | 0.5242 | 0.767 |
jina-clip | 0.4111 | 0.6644 | 0.5544 | 0.7904 |
Text-Text Retrieval
Name | STS12 | STS15 | STS17 | STS13 | STS14 | STS16 | STS22 | STSBenchmark | SummEval |
---|---|---|---|---|---|---|---|---|---|
jina-embeddings-v2 | 0.7427 | 0.8755 | 0.8888 | 0.833 | 0.7917 | 0.836 | 0.6346 | 0.8404 | 0.3056 |
jina-clip | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493 | 0.3048 |
Name | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID |
---|---|---|---|---|---|---|---|
jina-embeddings-v2 | 0.4418 | 0.4158 | 0.3245 | 0.882 | 0.1986 | 0.6668 | 0.6591 |
jina-clip | 0.4933 | 0.3827 | 0.3352 | 0.8789 | 0.2024 | 0.6734 | 0.7161 |
Contact
Join our Discord community and chat with other community members about ideas.
Citation
If you find Jina CLIP useful in your research, please cite the following paper:
TBD