File size: 3,446 Bytes
613cdc7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30a1050
613cdc7
 
971d357
 
 
 
 
 
 
 
 
 
 
 
3382538
971d357
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3382538
971d357
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
language: 
  - en
  - ja
  - pt
  - es
  - ko
  - ar
  - tr
  - th
  - fr
  - id
  - ru
  - de
  - fa
  - it
  - zh
  - pl
  - hi
  - ur
  - nl
  - el
  - ms
  - ca
  - sr
  - sv
  - uk
  - he
  - fi
  - cs
  - ta
  - ne
  - vi
  - hu
  - eo
  - bn
  - mr
  - ml
  - hr
  - no
  - sw
  - sl
  - te
  - az
  - da
  - ro
  - gl
  - gu
  - ps
  - mk
  - kn
  - bg
  - lv
  - eu
  - pa
  - et
  - mn
  - sq
  - si
  - sd
  - la
  - is
  - jv
  - lt
  - ku
  - am
  - bs
  - hy
  - or
  - sk
  - uz
  - cy
  - my
  - su
  - br
  - as
  - af
  - be
  - fy
  - kk
  - ga
  - lo
  - ka
  - km
  - sa
  - mg
  - so
  - ug
  - ky
  - gd
  - yi
tags:
  - Twitter
  - Multilingual
license: "apache-2.0"
mask_token: "<mask>"
---

# TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-green.svg?style=flat-square)](http://makeapullrequest.com)
[![arXiv](https://img.shields.io/badge/arXiv-2203.15827-b31b1b.svg)](https://arxiv.org/abs/2209.07562)


This repo contains models, code and pointers to datasets from our paper: [TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations](https://arxiv.org/abs/2209.07562).
[[PDF]](https://arxiv.org/pdf/2209.07562.pdf)
[[HuggingFace Models]](https://huggingface.co/Twitter)

### Overview
TwHIN-BERT is a new multi-lingual Tweet language model that is trained on 7 billion Tweets from over 100 distinct languages. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision (e.g., MLM), but also with a social objective based on the rich social engagements within a Twitter Heterogeneous Information Network (TwHIN).

TwHIN-BERT can be used as a drop-in replacement for BERT in a variety of NLP and recommendation tasks. It not only outperforms similar models semantic understanding tasks such text classification), but also **social recommendation** tasks such as predicting user to Tweet engagement.

## 1. Pretrained Models

We initially release two pretrained TwHIN-BERT models (base and large) that are compatible wit the [HuggingFace BERT models](https://github.com/huggingface/transformers).


| Model | Size | Download Link (🤗 HuggingFace) |
| ------------- | ------------- | --------- |
| TwHIN-BERT-base   | 280M parameters | [Twitter/TwHIN-BERT-base](https://huggingface.co/Twitter/twhin-bert-base) |
| TwHIN-BERT-large  | 550M parameters | [Twitter/TwHIN-BERT-large](https://huggingface.co/Twitter/twhin-bert-large) |


To use these models in 🤗 Transformers:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('Twitter/twhin-bert-base')
model = AutoModel.from_pretrained('Twitter/twhin-bert-base')
inputs = tokenizer("I'm using TwHIN-BERT! #TwHIN-BERT #NLP", return_tensors="pt")
outputs = model(**inputs)
```



<!-- ## 2. Set up environment and data
### Environment
TBD


## 3. Fine-tune TwHIN-BERT

TBD -->


## Citation
If you use TwHIN-BERT or out datasets in your work, please cite the following:
```bib
@article{zhang2022twhin,
  title={TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations},
  author={Zhang, Xinyang and Malkov, Yury and Florez, Omar and Park, Serim and McWilliams, Brian and Han, Jiawei and El-Kishky, Ahmed},
  journal={arXiv preprint arXiv:2209.07562},
  year={2022}
}
```