jina-clip-v1 / README.md

Update README.md

24a67ac verified 6 months ago

5.88 kB

	---
	tags:
	- feature-extraction
	- sentence-similarity
	- mteb
	language: en
	inference: false
	license: apache-2.0
	---
	<!-- TODO: add evaluation results here -->
	<br><br>

	<p align="center">
	<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
	</p>


	<p align="center">
	<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
	</p>

	## Quick Start

	The easiest way to starting using `jina-clip-v1` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).

	## Intended Usage & Model Info

	`jina-clip-v1` is an English, monolingual multimodal (text-image) embedding model.

	Traditional text embedding models, such as [jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en),
	excel in text-to-text retrieval but lack cross-modal retrieval capabilities.
	Conversely, CLIP-like models, such as [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32),
	align image embeddings with text embeddings but underperform in text-to-text retrieval due to their training methodology and context length limitations.

	`jina-clip-v1` is an innovative multimodal embedding model.
	Its text component achieves comparable performance to `jina-embeddings-v2-base-en` in text-to-text retrieval,
	while the overall model delivers state-of-the-art performance in cross-modal retrieval tasks.
	This makes it an ideal choice for multimodal retrieval-augmented generation (M-RAG) applications,
	allowing for both text-to-text and text-to-image searches with a single model.


	## Data & Parameters

	`jina-clip-v1` [technical report]() coming soon.

	## Usage

	You can use Jina CLIP directly from transformers package.

	```python
	!pip install transformers einops timm pillow
	from transformers import AutoModel
	from numpy.linalg import norm

	cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

	model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

	sentences = ['How is the weather today?', 'What is the current weather like today?']
	images = ['raindrop.jpg', 'sunny.jpg']

	text_embeddings = model.encode_text(sentences)
	image_embeddings = model.encode_image(images)

	print(cos_sim(text_embeddings[0], text_embeddings[1])) # text embedding similarity
	print(cos_sim(text_embeddings[0], image_embeddings[0])) # text-image cross-modal similarity
	```

	notice: our emperical study shows that text-text cosine similarity is normally larger than text-image cosine similarity!
	If you want to merge two scores, we recommended 2 ways:

	1. weighted average of text-text sim and text-image sim:

	```python
	# pseudo code
	alpha = 0.6
	beta = 0.4

	combined_scores = alpha * sim(query, document) + beta * sim(text, image)
	```

	2. apply z-score normalization before merging scores:

	```python
	# pseudo code
	query_document_mean = np.mean(cos_sim_query_documents)
	query_document_std = np.std(cos_sim_query_documents)
	text_image_mean = np.mean(cos_sim_text_images)
	text_image_std = np.std(cos_sim_text_images)

	query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
	text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
	```

	## Performance

	### Text-Image Retrieval

	\| Name \| Flickr Image Retr. R@1 \| Flickr Image Retr. R@5 \| Flickr Text Retr. R@1 \| Flickr Text Retr. R@5 \|
	\|------------------\|-------------------------\|-------------------------\|-----------------------\|-----------------------\|
	\| ViT-B-32 \| 0.597 \| 0.8398 \| 0.781 \| 0.938 \|
	\| ViT-B-16 \| 0.6216 \| 0.8572 \| 0.822 \| 0.966 \|
	\| jina-clip \| 0.6748 \| 0.8902 \| 0.811 \| 0.965 \|


	\| Name \| MSCOCO Image Retr. R@1 \| MSCOCO Image Retr. R@5 \| MSCOCO Text Retr. R@1 \| MSCOCO Text Retr. R@5 \|
	\|------------------\|-------------------------\|-------------------------\|-----------------------\|-----------------------\|
	\| ViT-B-32 \| 0.342 \| 0.6001 \| 0.5234 \| 0.7634 \|
	\| ViT-B-16 \| 0.3309 \| 0.5842 \| 0.5242 \| 0.767 \|
	\| jina-clip \| 0.4111 \| 0.6644 \| 0.5544 \| 0.7904 \|

	### Text-Text Retrieval

	\| Name \| STS12 \| STS15 \| STS17 \| STS13 \| STS14 \| STS16 \| STS22 \| STSBenchmark \| SummEval \|
	\|-----------------------\|--------\|--------\|--------\|--------\|--------\|--------\|--------\|--------------\|----------\|
	\| jina-embeddings-v2 \| 0.7427 \| 0.8755 \| 0.8888 \| 0.833 \| 0.7917 \| 0.836 \| 0.6346 \| 0.8404 \| 0.3056 \|
	\| jina-clip \| 0.7352 \| 0.8746 \| 0.8976 \| 0.8323 \| 0.7868 \| 0.8377 \| 0.6583 \| 0.8493 \| 0.3048 \|


	\| Name \| ArguAna \| FiQA2018 \| NFCorpus \| Quora \| SCIDOCS \| SciFact \| TRECCOVID \|
	\|--------------------\|---------\|----------\|----------\|-------\|---------\|---------\|-----------\|
	\| jina-embeddings-v2 \| 0.4418 \| 0.4158 \| 0.3245 \| 0.882 \| 0.1986 \| 0.6668 \| 0.6591 \|
	\| jina-clip \| 0.4933 \| 0.3827 \| 0.3352 \| 0.8789\| 0.2024 \| 0.6734 \| 0.7161 \|

	## Contact

	Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.

	## Citation

	If you find `jina-clip-v1` useful in your research, please cite the following paper:

	```console
	TBD
	```