Fill-Mask
Safetensors
new
custom_code
File size: 4,651 Bytes
c824f2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b747c5e
c824f2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: apache-2.0
datasets:
- allenai/c4
- wikimedia/wikipedia
- Skywork/SkyPile-150B
- uonlp/CulturaX
language:
- af
- ar
- az
- be
- bg
- bn
- ca
- ceb
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fa
- fi
- fr
- gl
- gu
- he
- hi
- hr
- ht
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ky
- lo
- lt
- lv
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- pa
- pl
- pt
- qu
- ro
- ru
- si
- sk
- sl
- so
- sq
- sr
- sv
- sw
- ta
- te
- th
- tl
- tr
- uk
- ur
- vi
- yo
- zh
pipeline_tag: fill-mask
---

## gte-multilingual-mlm-base


We introduce `mGTE` series, new generalized text encoder, embedding and reranking models that support 75 languages and the context length of up to 8192.
The models are built upon the transformer++ encoder backbone (BERT + RoPE + GLU, code refer to [Alibaba-NLP/new-impl](https://huggingface.co/Alibaba-NLP/new-impl))
as well as the vocabulary of `XLM-R`.

This text encoder (`mGTE-MLM-8192` in our [paper](https://arxiv.org/pdf/2407.19669)) outperforms the same-sized previous state-of-the-art [XLM-R-base](https://huggingface.co/FacebookAI/xlm-roberta-base)
in both [GLUE](https://aclanthology.org/W18-5446.pdf) and [XTREME-R](https://aclanthology.org/2021.emnlp-main.802.pdf).

- **Developed by**: Institute for Intelligent Computing, Alibaba Group
- **Model type**: Text Encoder
- **Paper**: [mGTE: Generalized Long-Context Text Representation and Reranking
Models for Multilingual Text Retrieval](https://arxiv.org/pdf/2407.19669).

### Model list
| Models | Language | Model Size | Max Seq. Length | GLUE | XTREME-R |
|:-----: | :-----: |:-----: |:-----: |:-----: | :-----: |
|[`gte-multilingual-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base)| Multiple | 306M | 8192 | 83.47 | 64.44 |
|[`gte-en-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base) | English | - | 8192 | 85.61 | - |
|[`gte-en-mlm-large`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-large) | English | - | 8192 | 87.58| - |


## Training Details

### Training Data

- Masked language modeling (MLM): `c4-en`, `mc4`, `skypile`, `Wikipedia`, `CulturaX`, etc (refer to [paper appendix A.1](https://arxiv.org/pdf/2407.19669))

### Training Procedure 

To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy.
The model first undergoes preliminary MLM pre-training on shorter lengths.
And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.

The entire training process is as follows:
- MLM-2048: lr 2e-4, mlm_probability 0.3, batch_size 8192, num_steps 250k, rope_base 10000
- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 2048, num_steps 30k, rope_base 160000



## Evaluation

| Models | Language | Model Size | Max Seq. Length | GLUE | XTREME-R |
|:-----: | :-----: |:-----: |:-----: |:-----: | :-----: |
|**[`gte-multilingual-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base)**| Multiple | 306M | 8192 | 83.47 | 64.44 |
|**[`gte-en-mlm-base`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base)** | English | - | 8192 | 85.61 | - |
|**[`gte-en-mlm-large`](https://huggingface.co/Alibaba-NLP/gte-en-mlm-large)** | English | - | 8192 | 87.58| - |
|[`MosaicBERT-base`](https://huggingface.co/mosaicml/mosaic-bert-base)| English|  137M | 128 | 85.4 | - |
|[`MosaicBERT-base-2048`](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-2048)| English|  137M | 2048 | 85 | - |
|`JinaBERT-base`| English|  137M | 512 | 85 | - |
|[`nomic-bert-2048`](https://huggingface.co/nomic-ai/nomic-bert-2048)| English|  137M | 2048 | 84 | - |
|`MosaicBERT-large`| English | 434M | 128 | 86.1 | - |
|`JinaBERT-large`| English | 434M | 512 | 83.7 | - |
|[`XLM-R-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) | Multiple |  279M | 512 | 80.44 | 62.02 |
|[`RoBERTa-base`](https://huggingface.co/FacebookAI/roberta-base)| English | 125M | 512 | 86.4 | - |
|[`RoBERTa-large`](https://huggingface.co/FacebookAI/roberta-large)| English | 355M | 512 | 88.9 | - |


## Citation

If you find our paper or models helpful, please consider citing them as follows:

```
@misc{zhang2024mgtegeneralizedlongcontexttext,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, 
  author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
  year={2024},
  eprint={2407.19669},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.19669}, 
}
```