Jinkin commited on
Commit
bc8ae6a
1 Parent(s): 9c7ac59

update readme zh

Browse files
Files changed (2) hide show
  1. README.md +3 -3
  2. README_zh.md +79 -0
README.md CHANGED
@@ -1073,7 +1073,7 @@ effectively harnessing textual data and labels from diverse downstream
1073
  tasks. In addition, Piccolo2 scales up the embedding dimension and uses
1074
  MRL training to support more flexible vector dimensions.
1075
 
1076
- ## Model Hightlights
1077
  The main feature of piccolo2 is that it uses a multi-task hybrid loss during training.
1078
  For retrieval/sorting tasks, we use the standard InfoNCE with in-batch-negative:
1079
  <p align='left'>
@@ -1092,7 +1092,7 @@ it can easily lead to conflict training targets:
1092
  <img src='assets/3.png' width='400' height='80'>
1093
  </p>
1094
 
1095
- ## Experiments and Results
1096
  Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open source model uses [stella-v3.5](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d) as initialization and trained about 2500 steps on 32 GPUS. For more implementation details, please refer to our [technical report](https://arxiv.org/abs/2405.06932).
1097
 
1098
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) | Average (35) |
@@ -1102,7 +1102,7 @@ Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open
1102
  | [acge-text-embedding](https://huggingface.co/aspire/acge_text_embedding) |1.21 | 1792 | 512 | 72.75 | 58.7 | 87.84 | 67.98 | 72.93 | 62.09 | 69.07 |
1103
 
1104
 
1105
- ## Usage
1106
  The piccolo model can be easily accessed in the sentence-transformer package:
1107
  ```python
1108
  # for s2s/s2p dataset, you can use piccolo as below
 
1073
  tasks. In addition, Piccolo2 scales up the embedding dimension and uses
1074
  MRL training to support more flexible vector dimensions.
1075
 
1076
+ ## 💡 Model Hightlights
1077
  The main feature of piccolo2 is that it uses a multi-task hybrid loss during training.
1078
  For retrieval/sorting tasks, we use the standard InfoNCE with in-batch-negative:
1079
  <p align='left'>
 
1092
  <img src='assets/3.png' width='400' height='80'>
1093
  </p>
1094
 
1095
+ ## 📃 Experiments and Results
1096
  Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open source model uses [stella-v3.5](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d) as initialization and trained about 2500 steps on 32 GPUS. For more implementation details, please refer to our [technical report](https://arxiv.org/abs/2405.06932).
1097
 
1098
  | Model Name | Model Size (GB) | Dimension | Sequence Length | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) | Average (35) |
 
1102
  | [acge-text-embedding](https://huggingface.co/aspire/acge_text_embedding) |1.21 | 1792 | 512 | 72.75 | 58.7 | 87.84 | 67.98 | 72.93 | 62.09 | 69.07 |
1103
 
1104
 
1105
+ ## 🔨 Usage
1106
  The piccolo model can be easily accessed in the sentence-transformer package:
1107
  ```python
1108
  # for s2s/s2p dataset, you can use piccolo as below
README_zh.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [EN](README.md) | [简体中文](README_zh.md)
2
+
3
+ **新闻**
4
+
5
+ **[2024-05-14]**
6
+ 我们目前已经发布了模型权重、训练代码和技术报告,欢迎大家关注。
7
+ 我们的训练代码放在了github上: https://github.com/hjq133/piccolo-embedding
8
+ 对于训练细节,可以参考我们的技术报告: https://arxiv.org/abs/2405.06932
9
+
10
+
11
+ **[2024-04-22]**
12
+
13
+ piccolo-large-zh-v2 目前在C-MTEB榜单取得第一名,领先上一名BERT模型约1.9个点。
14
+
15
+ ## Piccolo-large-zh-v2
16
+
17
+ piccolo-large-zh-v2是商汤研究院通用模型组开发的中文嵌入模型。 Piccolo此次的升级版本旨在关注通用的下游微调方法。 Piccolo2主要通过利用高效的多任务混合损失训练方法,有效地利用来自不同下游的文本数据和标签。 此外,Piccolo2扩大了嵌入维度,同时使用MRL训练来支持更灵活的向量维度。
18
+
19
+ ## 💡 模型亮点
20
+ piccolo2的主要特点是在训练过程中使用了多任务混合损失。
21
+ 对于检retrieval/reranking任务,我们使用带有批内负样本的InfoNCE:
22
+ <p align='left'>
23
+ <img src='assets/1.png' width='400' height='80'>
24
+ </p>
25
+
26
+ 对于 sts/pair classification任务,我们使用排序损失:cosent loss。在具有细粒度标签的数据集上(比如有相似度的score),排序损失通常被证明表现更好:
27
+ <p align='left'>
28
+ <img src='assets/2.png' width='450' height='90'>
29
+ </p>
30
+
31
+ 对于分类/聚类任务,我们通过将文本和其语义标签视为正负对,将数据集转换为三元组的格式来进行训练。我们同样采用InfoNCE对其进行优化。但这类任务不能再使用批内负样本,因为很容易导致训练目标的冲突:
32
+ <p align='left'>
33
+ <img src='assets/3.png' width='400' height='80'>
34
+ </p>
35
+
36
+ ## 📃 实验和结果
37
+ Piccolo2主要关注在一种通用的下游微调范式。我们的开源模型使用了[stella-v3.5](https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d)作为初始化,在32张A100上训练了2500 step,对于更多的实现细节,可以参考我们的 [技术报告](https://arxiv.org/abs/2405.06932), 以及[训练代码](https://github.com/hjq133/piccolo-embedding)
38
+
39
+ | Model Name | Model Size (GB) | Dimension | Sequence Length | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) | Average (35) |
40
+ |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
41
+ | [**piccolo-large-zh-v2**](https://huggingface.co/sensenova/piccolo-large-zh-v2) | 1.21 | 1792 | 512 | 74.59 | 62.17 | 90.24 | 70 | 74.36 | 63.5 | 70.95 |
42
+ | [gte-Qwen1.5-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct)| 26.45 | 32768 |4096 | 73.35 | 67.08 | 88.52 | 66.38 | 70.62 | 62.32 | 69.56|
43
+ | [acge-text-embedding](https://huggingface.co/aspire/acge_text_embedding) |1.21 | 1792 | 512 | 72.75 | 58.7 | 87.84 | 67.98 | 72.93 | 62.09 | 69.07 |
44
+
45
+
46
+ ## 🔨 使用方法
47
+ 在sentence-transformer中使用piccolo:
48
+ ```python
49
+ # for s2s/s2p dataset, you can use piccolo as below
50
+ from sklearn.preprocessing import normalize
51
+ from sentence_transformers import SentenceTransformer
52
+ sentences = ["数据1", "数据2"]
53
+ matryoshka_dim=1792 # support 256, 512, 768, 1024, 1280, 1536, 1792
54
+ model = SentenceTransformer('sensenova/piccolo-large-zh-v2')
55
+ embeddings_1 = model.encode(sentences, normalize_embeddings=False)
56
+ embeddings_2 = model.encode(sentences, normalize_embeddings=False)
57
+ embeddings_1 = normalize(embeddings_1[..., :matryoshka_dim], norm="l2", axis=1)
58
+ embeddings_2 = normalize(embeddings_2[..., :matryoshka_dim], norm="l2", axis=1)
59
+ similarity = embeddings_1 @ embeddings_2.T
60
+ ```
61
+
62
+ ## 🤗 **Model List**
63
+ | Model|Language|Description|prompt|
64
+ |:-|:-:|:-:|:--:|
65
+ | [sensenova/piccolo-large-zh-v2](https://huggingface.co/sensenova/piccolo-large-zh-v2) | Chinese | version2: finetuning with multi-task hybrid loss training | None |
66
+ | [sensenova/piccolo-large-zh](https://huggingface.co/sensenova/piccolo-large-zh) | Chinese | version1: pretrain under 400 million chinese text pair | '查询'/'结果' |
67
+ | [sensenova/piccolo-base-zh](https://huggingface.co/sensenova/piccolo-base-zh) | Chinese | version1: pretrain under 400 million chinese text pair | '查询'/'结果' |
68
+
69
+
70
+ ## Citation
71
+ 如果我们的技术报告、模型或训练代码对您有帮助,请像下面这样引用我们的论文,或者在 github、 Huggingface 上给一个 Star!
72
+ ```bibtex
73
+ @misc{2405.06932,
74
+ Author = {Junqin Huang and Zhongjie Hu and Zihao Jing and Mengya Gao and Yichao Wu},
75
+ Title = {Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training},
76
+ Year = {2024},
77
+ Eprint = {arXiv:2405.06932},
78
+ }
79
+ ```