jina-clip-v1 / README.md
bwang0911's picture
Update README.md
07d4bb6 verified
|
raw
history blame
4.98 kB
metadata
tags:
  - feature-extraction
  - sentence-similarity
  - mteb
language: en
inference: false
license: apache-2.0



Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

The text embedding set trained by Jina AI.

Quick Start

The easiest way to starting using jina-clip-v1 is to use Jina AI's Embedding API.

Intended Usage & Model Info

jina-clip-v1 is an English, monolingual multimodal (text-image) embedding model.

Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but lack cross-modal retrieval capabilities. Conversely, CLIP-like models, such as openai/clip-vit-base-patch32, align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

jina-clip-v1 is an innovative multimodal embedding model. Its text component achieves comparable performance to jina-embeddings-v2-base-en in text-to-text retrieval, while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks. This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications, allowing for both text-to-text and text-to-image searches with a single model.

Data & Parameters

Jina CLIP V1 technical report coming soon.

Usage

You can use Jina CLIP directly from transformers package.

!pip install transformers
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-clip-v1')
text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
image_embeddings = model.encode_image(['raindrop.png'])
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity

Performance

Text-Image Retrieval

Flickr

Name Flickr Image Retr. R@1 Flickr Image Retr. R@5 Flickr Text Retr. R@1 Flickr Text Retr. R@5
ViT-B-32 0.597 0.8398 0.781 0.938
ViT-B-16 0.6216 0.8572 0.822 0.966
jina-clip 0.6748 0.8902 0.811 0.965

MSCOCO

Name MSCOCO Image Retr. R@1 MSCOCO Image Retr. R@5 MSCOCO Text Retr. R@1 MSCOCO Text Retr. R@5
ViT-B-32 0.342 0.6001 0.5234 0.7634
ViT-B-16 0.3309 0.5842 0.5242 0.767
jina-clip 0.4111 0.6644 0.5544 0.7904

Text-Text Retrieval

STS

Name STS12 STS15 STS17 STS13 STS14 STS16 STS22 STSBenchmark SummEval
jina-embeddings-v2 0.7427 0.8755 0.8888 0.833 0.7917 0.836 0.6346 0.8404 0.3056
jina-clip 0.7352 0.8746 0.8976 0.8323 0.7868 0.8377 0.6583 0.8493 0.3048

BEIR

Name ArguAna FiQA2018 NFCorpus Quora SCIDOCS SciFact TRECCOVID
jina-embeddings-v2 0.4418 0.4158 0.3245 0.882 0.1986 0.6668 0.6591
jina-clip 0.4933 0.3827 0.3352 0.8789 0.2024 0.6734 0.7161

Contact

Join our Discord community and chat with other community members about ideas.

Citation

If you find Jina CLIP useful in your research, please cite the following paper:

TBD