README.md · jinaai/jina-clip-v1 at 24a67ac456a540c6f3080a8a4f61f77bd6d22042

metadata

tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0

Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The text embedding set trained by Jina AI.

Quick Start

The easiest way to starting using jina-clip-v1 is to use Jina AI's Embedding API.

Intended Usage & Model Info

jina-clip-v1 is an English, monolingual multimodal (text-image) embedding model.

Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but lack cross-modal retrieval capabilities. Conversely, CLIP-like models, such as openai/clip-vit-base-patch32, align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

jina-clip-v1 is an innovative multimodal embedding model. Its text component achieves comparable performance to jina-embeddings-v2-base-en in text-to-text retrieval, while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks. This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications, allowing for both text-to-text and text-to-image searches with a single model.

Data & Parameters

jina-clip-v1 technical report coming soon.

Usage

You can use Jina CLIP directly from transformers package.

!pip install transformers einops timm pillow
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

sentences = ['How is the weather today?', 'What is the current weather like today?']
images = ['raindrop.jpg', 'sunny.jpg']

text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(images)

print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity

notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity! If you want to merge two scores, we recommended 2 ways:

weighted average of text-text sim and text-image sim:

# pseudo code
alpha = 0.6
beta = 0.4

combined_scores = alpha * sim(query, document) + beta * sim(text, image)

apply z-score normalization before merging scores:

# pseudo code
query_document_mean = np.mean(cos_sim_query_documents)
query_document_std = np.std(cos_sim_query_documents)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)

query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std

Performance

Text-Image Retrieval

Name	Flickr Image Retr. R@1	Flickr Image Retr. R@5	Flickr Text Retr. R@1	Flickr Text Retr. R@5
ViT-B-32	0.597	0.8398	0.781	0.938
ViT-B-16	0.6216	0.8572	0.822	0.966
jina-clip	0.6748	0.8902	0.811	0.965

Name	MSCOCO Image Retr. R@1	MSCOCO Image Retr. R@5	MSCOCO Text Retr. R@1	MSCOCO Text Retr. R@5
ViT-B-32	0.342	0.6001	0.5234	0.7634
ViT-B-16	0.3309	0.5842	0.5242	0.767
jina-clip	0.4111	0.6644	0.5544	0.7904

Text-Text Retrieval

Name	STS12	STS15	STS17	STS13	STS14	STS16	STS22	STSBenchmark	SummEval
jina-embeddings-v2	0.7427	0.8755	0.8888	0.833	0.7917	0.836	0.6346	0.8404	0.3056
jina-clip	0.7352	0.8746	0.8976	0.8323	0.7868	0.8377	0.6583	0.8493	0.3048

Name	ArguAna	FiQA2018	NFCorpus	Quora	SCIDOCS	SciFact	TRECCOVID
jina-embeddings-v2	0.4418	0.4158	0.3245	0.882	0.1986	0.6668	0.6591
jina-clip	0.4933	0.3827	0.3352	0.8789	0.2024	0.6734	0.7161

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find jina-clip-v1 useful in your research, please cite the following paper:

TBD