jina-clip-v1 / README.md
hanxiao's picture
Update README.md
e7cdc21 verified
|
raw
history blame
5.8 kB
metadata
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0

Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

jina-clip-v1

Jina CLIP: your CLIP model is also your text retriever!

Quick Start

The easiest way to starting using jina-clip-v1 is to use Jina AI Embedding API.

Intended Usage & Model Info

jina-clip-v1 is a state-of-the-art English multimodal (text-image) embedding model.

Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but fall short in cross-modal tasks. In contrast, models like openai/clip-vit-base-patch32 effectively align image and text embeddings but are not optimized for text-to-text retrieval due to their training methodologies and context limitations.

jina-clip-v1 bridges this gap by offering robust performance in both domains. Its text component matches the retrieval efficiency of jina-embeddings-v2-base-en, while its overall architecture sets a new benchmark for cross-modal retrieval. This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (M-RAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.

Data & Parameters

jina-clip-v1 technical report coming soon.

Usage

You can use Jina CLIP directly from transformers package.

!pip install transformers einops timm pillow
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

sentences = ['How is the weather today?', 'What is the current weather like today?']
images = ['raindrop.jpg', 'sunny.jpg']

text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(images)

print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity

notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity! If you want to merge two scores, we recommended 2 ways:

  1. weighted average of text-text sim and text-image sim:
# pseudo code
alpha = 0.6
beta = 0.4

combined_scores = alpha * sim(query, document) + beta * sim(text, image)
  1. apply z-score normalization before merging scores:
# pseudo code
query_document_mean = np.mean(cos_sim_query_documents)
query_document_std = np.std(cos_sim_query_documents)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)

query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std

Performance

Text-Image Retrieval

Name Flickr Image Retr. R@1 Flickr Image Retr. R@5 Flickr Text Retr. R@1 Flickr Text Retr. R@5
ViT-B-32 0.597 0.8398 0.781 0.938
ViT-B-16 0.6216 0.8572 0.822 0.966
jina-clip 0.6748 0.8902 0.811 0.965
Name MSCOCO Image Retr. R@1 MSCOCO Image Retr. R@5 MSCOCO Text Retr. R@1 MSCOCO Text Retr. R@5
ViT-B-32 0.342 0.6001 0.5234 0.7634
ViT-B-16 0.3309 0.5842 0.5242 0.767
jina-clip 0.4111 0.6644 0.5544 0.7904

Text-Text Retrieval

Name STS12 STS15 STS17 STS13 STS14 STS16 STS22 STSBenchmark SummEval
jina-embeddings-v2 0.7427 0.8755 0.8888 0.833 0.7917 0.836 0.6346 0.8404 0.3056
jina-clip 0.7352 0.8746 0.8976 0.8323 0.7868 0.8377 0.6583 0.8493 0.3048
Name ArguAna FiQA2018 NFCorpus Quora SCIDOCS SciFact TRECCOVID
jina-embeddings-v2 0.4418 0.4158 0.3245 0.882 0.1986 0.6668 0.6591
jina-clip 0.4933 0.3827 0.3352 0.8789 0.2024 0.6734 0.7161

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find jina-clip-v1 useful in your research, please cite the following paper:

TBD