README.md · jinaai/jina-clip-v1 at 07d4bb656df1bec2a485340b388632a7fdb4019f

metadata

tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0

Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The text embedding set trained by Jina AI.

Quick Start

The easiest way to starting using jina-clip-v1 is to use Jina AI's Embedding API.

Intended Usage & Model Info

jina-clip-v1 is an English, monolingual multimodal (text-image) embedding model.

Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but lack cross-modal retrieval capabilities. Conversely, CLIP-like models, such as openai/clip-vit-base-patch32, align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

jina-clip-v1 is an innovative multimodal embedding model. Its text component achieves comparable performance to jina-embeddings-v2-base-en in text-to-text retrieval, while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks. This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications, allowing for both text-to-text and text-to-image searches with a single model.

Data & Parameters

Jina CLIP V1 technical report coming soon.

Usage

You can use Jina CLIP directly from transformers package.

!pip install transformers
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-clip-v1')
text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
image_embeddings = model.encode_image(['raindrop.png'])
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity

Performance

Text-Image Retrieval

Flickr

Name	Flickr Image Retr. R@1	Flickr Image Retr. R@5	Flickr Text Retr. R@1	Flickr Text Retr. R@5
ViT-B-32	0.597	0.8398	0.781	0.938
ViT-B-16	0.6216	0.8572	0.822	0.966
jina-clip	0.6748	0.8902	0.811	0.965

MSCOCO

Name	MSCOCO Image Retr. R@1	MSCOCO Image Retr. R@5	MSCOCO Text Retr. R@1	MSCOCO Text Retr. R@5
ViT-B-32	0.342	0.6001	0.5234	0.7634
ViT-B-16	0.3309	0.5842	0.5242	0.767
jina-clip	0.4111	0.6644	0.5544	0.7904

Text-Text Retrieval

STS

Name	STS12	STS15	STS17	STS13	STS14	STS16	STS22	STSBenchmark	SummEval
jina-embeddings-v2	0.7427	0.8755	0.8888	0.833	0.7917	0.836	0.6346	0.8404	0.3056
jina-clip	0.7352	0.8746	0.8976	0.8323	0.7868	0.8377	0.6583	0.8493	0.3048

BEIR

Name	ArguAna	FiQA2018	NFCorpus	Quora	SCIDOCS	SciFact	TRECCOVID
jina-embeddings-v2	0.4418	0.4158	0.3245	0.882	0.1986	0.6668	0.6591
jina-clip	0.4933	0.3827	0.3352	0.8789	0.2024	0.6734	0.7161

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find Jina CLIP useful in your research, please cite the following paper:

TBD