juliuslipp commited on
Commit
6e4fca3
1 Parent(s): d5e3a9a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -11
README.md CHANGED
@@ -2617,7 +2617,7 @@ pipeline_tag: feature-extraction
2617
 
2618
  # mxbai-embed-large-v1
2619
 
2620
- This is our base sentence embedding model. It was trained using [AnglE](https://arxiv.org/abs/2309.12871) loss on our high-quality large scale data. It achieves SOTA performance on BERT-large scale. Find out more in our [blog post](https://mixedbread.ai/blog/mxbai-embed-large-v1).
2621
 
2622
  ## Quickstart
2623
 
@@ -2631,10 +2631,13 @@ python -m pip install -U sentence-transformers
2631
 
2632
  ```python
2633
  from sentence_transformers import SentenceTransformer
2634
- from sentence_transformers.util import cos_sim
2635
 
2636
- # 1. load model
2637
- model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
 
 
 
2638
 
2639
  # For retrieval you need to pass this prompt.
2640
  query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'
@@ -2650,8 +2653,13 @@ docs = [
2650
  # 2. Encode
2651
  embeddings = model.encode(docs)
2652
 
 
 
 
2653
  similarities = cos_sim(embeddings[0], embeddings[1:])
2654
  print('similarities:', similarities)
 
 
2655
  ```
2656
  ### Transformers
2657
 
@@ -2669,7 +2677,7 @@ def transform_query(query: str) -> str:
2669
  """
2670
  return f'Represent this sentence for searching relevant passages: {query}'
2671
 
2672
- # The model works really well with cls pooling (default) but also with mean poolin.
2673
  def pooling(outputs: torch.Tensor, inputs: Dict, strategy: str = 'cls') -> np.ndarray:
2674
  if strategy == 'cls':
2675
  outputs = outputs[:, 0]
@@ -2743,7 +2751,7 @@ console.log(similarities); // [0.7919578577247139, 0.6369278664248345, 0.1651201
2743
  You can use the model via our API as follows:
2744
 
2745
  ```python
2746
- from mixedbread_ai.client import MixedbreadAI
2747
  from sklearn.metrics.pairwise import cosine_similarity
2748
  import os
2749
 
@@ -2756,15 +2764,17 @@ english_sentences = [
2756
 
2757
  res = mxbai.embeddings(
2758
  input=english_sentences,
2759
- model="mixedbread-ai/mxbai-embed-large-v1"
 
 
 
2760
  )
2761
- embeddings = [entry.embedding for entry in res.data]
2762
 
2763
- similarities = cosine_similarity([embeddings[0]], [embeddings[1]])
2764
- print(similarities)
2765
  ```
2766
 
2767
- The API comes with native INT8 and binary quantization support! Check out the [docs](https://mixedbread.ai/docs) for more information.
2768
 
2769
  ## Evaluation
2770
  As of March 2024, our model archives SOTA performance for Bert-large sized models on the [MTEB](https://huggingface.co/spaces/mteb/leaderboard). It ourperforms commercial models like OpenAIs text-embedding-3-large and matches the performance of model 20x it's size like the [echo-mistral-7b](https://huggingface.co/jspringer/echo-mistral-7b-instruct-lasttoken). Our model was trained with no overlap of the MTEB data, which indicates that our model generalizes well across several domains, tasks and text length. We know there are some limitations with this model, which will be fixed in v2.
@@ -2785,6 +2795,14 @@ As of March 2024, our model archives SOTA performance for Bert-large sized model
2785
 
2786
  Please find more information in our [blog post](https://mixedbread.ai/blog/mxbai-embed-large-v1).
2787
 
 
 
 
 
 
 
 
 
2788
  ## Community
2789
  Please join our [Discord Community](https://discord.gg/jDfMHzAVfU) and share your feedback and thoughts! We are here to help and also always happy to chat.
2790
 
 
2617
 
2618
  # mxbai-embed-large-v1
2619
 
2620
+ Here, we provide several ways to produce sentence embeddings. Please note that you have to provide the prompt `Represent this sentence for searching relevant passages:` for query if you want to use it for retrieval. Besides that you don't need any prompt. Our model also supports Matryoshka Representation Learning and binary or int8 quantization. [Learn More](https://www.mixedbread.ai/blog/binary-mrl)
2621
 
2622
  ## Quickstart
2623
 
 
2631
 
2632
  ```python
2633
  from sentence_transformers import SentenceTransformer
2634
+ from sentence_transformers.util import cos_sim, quantize_embeddings
2635
 
2636
+ # 1. Specify preffered dimensions (default is 1024)
2637
+ dimensions = 512
2638
+
2639
+ # 2. load model
2640
+ model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=dimensions)
2641
 
2642
  # For retrieval you need to pass this prompt.
2643
  query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'
 
2653
  # 2. Encode
2654
  embeddings = model.encode(docs)
2655
 
2656
+ # Optional: Quantize the embeddings
2657
+ binary_embeddings = quantize_embeddings(embeddings, precision="ubinary")
2658
+
2659
  similarities = cos_sim(embeddings[0], embeddings[1:])
2660
  print('similarities:', similarities)
2661
+
2662
+
2663
  ```
2664
  ### Transformers
2665
 
 
2677
  """
2678
  return f'Represent this sentence for searching relevant passages: {query}'
2679
 
2680
+ # The model works really well with cls pooling (default) but also with mean pooling.
2681
  def pooling(outputs: torch.Tensor, inputs: Dict, strategy: str = 'cls') -> np.ndarray:
2682
  if strategy == 'cls':
2683
  outputs = outputs[:, 0]
 
2751
  You can use the model via our API as follows:
2752
 
2753
  ```python
2754
+ from mixedbread_ai.client import MixedbreadAI, EncodingFormat
2755
  from sklearn.metrics.pairwise import cosine_similarity
2756
  import os
2757
 
 
2764
 
2765
  res = mxbai.embeddings(
2766
  input=english_sentences,
2767
+ model="mixedbread-ai/mxbai-embed-large-v1",
2768
+ normalized=True,
2769
+ encoding_format=[EncodingFormat.FLOAT, EncodingFormat.UBINARY, EncodingFormat.INT_8],
2770
+ dimensions=512
2771
  )
 
2772
 
2773
+ encoded_embeddings = res.data[0].embedding
2774
+ print(res.dimensions, encoded_embeddings.ubinary, encoded_embeddings.float_, encoded_embeddings.int_8)
2775
  ```
2776
 
2777
+ The API comes with native int8 and binary quantization support! Check out the [docs](https://mixedbread.ai/docs) for more information.
2778
 
2779
  ## Evaluation
2780
  As of March 2024, our model archives SOTA performance for Bert-large sized models on the [MTEB](https://huggingface.co/spaces/mteb/leaderboard). It ourperforms commercial models like OpenAIs text-embedding-3-large and matches the performance of model 20x it's size like the [echo-mistral-7b](https://huggingface.co/jspringer/echo-mistral-7b-instruct-lasttoken). Our model was trained with no overlap of the MTEB data, which indicates that our model generalizes well across several domains, tasks and text length. We know there are some limitations with this model, which will be fixed in v2.
 
2795
 
2796
  Please find more information in our [blog post](https://mixedbread.ai/blog/mxbai-embed-large-v1).
2797
 
2798
+ ## Matryoshka and Binary Quantization
2799
+
2800
+ Embeddings in their commonly used form (float arrays) have a high memory footprint when used at scale. Two approaches to solve this problem are Matryoshka Representation Learning (MRL) and (Binary) Quantization.
2801
+
2802
+ While MRL reduces the number of dimensions of an embedding, binary quantization transforms the value of each dimension from a float32 into a lower precision (int8 or even binary). <b> The model supports both approaches! </b>
2803
+
2804
+ You can also take it one step further, and combine these. This combination of binary quantization and MRL allows you to reduce the memory usage of your embeddings significantly. This leads to much lower costs when using a vector database in particular. You can read more about the technology and its advantages in our [blog post](https://www.mixedbread.ai/blog/binary-mrl).
2805
+
2806
  ## Community
2807
  Please join our [Discord Community](https://discord.gg/jDfMHzAVfU) and share your feedback and thoughts! We are here to help and also always happy to chat.
2808