jina-embeddings-v3: Multilingual Embeddings With Task LoRA
Abstract
We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever (2024)
- mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval (2024)
- Ruri: Japanese General Text Embeddings (2024)
- NLLB-E5: A Scalable Multilingual Retrieval Model (2024)
- The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper