tags:
- feature-extraction
- sentence-similarity
- mteb
language: en
inference: false
license: apache-2.0
The text embedding set trained by Jina AI.
Quick Start
The easiest way to starting using jina-clip-v1
is to use Jina AI's Embedding API.
Intended Usage & Model Info
jina-clip-v1
Overview
jina-clip-v1
Overview
jina-clip-v1
is an English, monolingual multimodal (text-image) embedding model.
Traditional text embedding models, such as jina-embeddings-v2-base-en, excel in text-to-text retrieval but lack cross-modal retrieval capabilities. Conversely, CLIP-like models, such as openai/clip-vit-base-patch32, align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.
jina-clip-v1
is an innovative multimodal embedding model.
Its text component achieves comparable performance to jina-embeddings-v2-base-en
in text-to-text retrieval,
while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
allowing for both text-to-text and text-to-image searches with a single model.
Data & Parameters
Jina CLIP V1 technical report coming soon.
Usage
You can use Jina CLIP directly from transformers package.
!pip install transformers
from transformers import AutoModel
from numpy.linalg import norm
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-clip-v1')
text_embeddings = model.encode_text(['How is the weather today?', 'What is the current weather like today?'])
image_embeddings = model.encode_image(['raindrop.png'])
print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
Contact
Join our Discord community and chat with other community members about ideas.
Citation
If you find Jina CLIP useful in your research, please cite the following paper:
TBD