Trofish commited on
Commit
e731e52
โ€ข
1 Parent(s): b1d0372

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -5
README.md CHANGED
@@ -1,3 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
1
  # RoBERTa-base Korean
2
 
3
  ## ๋ชจ๋ธ ์„ค๋ช…
@@ -9,8 +20,8 @@
9
  - **์•„ํ‚คํ…์ฒ˜**: RobertaForMaskedLM
10
  - **๋ชจ๋ธ ํฌ๊ธฐ**: 128 hidden size, 8 hidden layers, 8 attention heads
11
  - **max_position_embeddings**: 514
12
- - **intermediate_size**: 2048
13
- - **vocab_size**: 1428
14
 
15
  ## ํ•™์Šต ๋ฐ์ดํ„ฐ
16
  ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
@@ -18,12 +29,12 @@
18
  - **AIHUB**: SNS, ์œ ํŠœ๋ธŒ ๋Œ“๊ธ€, ๋„์„œ ๋ฌธ์žฅ
19
  - **๊ธฐํƒ€**: ๋‚˜๋ฌด์œ„ํ‚ค, ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์•„
20
 
21
- ์ด ํ•ฉ์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 11GB ์ž…๋‹ˆ๋‹ค.
22
 
23
  ## ํ•™์Šต ์ƒ์„ธ
24
  - **BATCH_SIZE**: 112 (GPU๋‹น)
25
  - **ACCUMULATE**: 36
26
- - **Total_BATCH_SIZE**: 8064
27
  - **MAX_STEPS**: 12,500
28
  - **TRAIN_STEPS * BATCH_SIZE**: **100M**
29
  - **WARMUP_STEPS**: 2,400
@@ -34,9 +45,15 @@
34
 
35
 
36
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/TPSI6kksBLzcbloDCUgwc.png)
37
-
38
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/z3_zVWsGsyT7YD9Zr9aeK.png)
39
 
 
 
 
 
 
 
 
40
  ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
41
  ### tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
42
  ### (๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)
@@ -51,3 +68,17 @@ tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs)
51
  # ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
52
  inputs = tokenizer("์—ฌ๊ธฐ์— ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ž…๋ ฅ", return_tensors="pt")
53
  outputs = model(**inputs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - klue/klue
5
+ language:
6
+ - ko
7
+ metrics:
8
+ - f1
9
+ - accuracy
10
+ - pearsonr
11
+ ---
12
  # RoBERTa-base Korean
13
 
14
  ## ๋ชจ๋ธ ์„ค๋ช…
 
20
  - **์•„ํ‚คํ…์ฒ˜**: RobertaForMaskedLM
21
  - **๋ชจ๋ธ ํฌ๊ธฐ**: 128 hidden size, 8 hidden layers, 8 attention heads
22
  - **max_position_embeddings**: 514
23
+ - **intermediate_size**: 2,048
24
+ - **vocab_size**: 1,428
25
 
26
  ## ํ•™์Šต ๋ฐ์ดํ„ฐ
27
  ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
 
29
  - **AIHUB**: SNS, ์œ ํŠœ๋ธŒ ๋Œ“๊ธ€, ๋„์„œ ๋ฌธ์žฅ
30
  - **๊ธฐํƒ€**: ๋‚˜๋ฌด์œ„ํ‚ค, ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์•„
31
 
32
+ ์ด ํ•ฉ์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋Š” **์•ฝ 11GB** ์ž…๋‹ˆ๋‹ค. **(4B tokens)**
33
 
34
  ## ํ•™์Šต ์ƒ์„ธ
35
  - **BATCH_SIZE**: 112 (GPU๋‹น)
36
  - **ACCUMULATE**: 36
37
+ - **Total_BATCH_SIZE**: 8,064
38
  - **MAX_STEPS**: 12,500
39
  - **TRAIN_STEPS * BATCH_SIZE**: **100M**
40
  - **WARMUP_STEPS**: 2,400
 
45
 
46
 
47
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/TPSI6kksBLzcbloDCUgwc.png)
 
48
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/z3_zVWsGsyT7YD9Zr9aeK.png)
49
 
50
+ ## ์„ฑ๋Šฅ ํ‰๊ฐ€
51
+ - **KLUE benchmark test๋ฅผ ํ†ตํ•ด์„œ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.**
52
+ - klue-roberta-base์— ๋น„ํ•ด์„œ ๋งค์šฐ ์ž‘์€ ํฌ๊ธฐ๋ผ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ธฐ๋Š” ํ•˜์ง€๋งŒ hidden size 512์ธ ๋ชจ๋ธ์€ ํฌ๊ธฐ ๋Œ€๋น„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
53
+
54
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/I8e60cf9w-IQCHDgKiooq.png)
55
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/hkc5ko9Vo-pkKmtouN7xc.png)
56
+
57
  ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
58
  ### tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
59
  ### (๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)
 
68
  # ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
69
  inputs = tokenizer("์—ฌ๊ธฐ์— ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ž…๋ ฅ", return_tensors="pt")
70
  outputs = model(**inputs)
71
+ ```
72
+
73
+ ## Citation
74
+ **klue**
75
+ ```
76
+ @misc{park2021klue,
77
+ title={KLUE: Korean Language Understanding Evaluation},
78
+ author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
79
+ year={2021},
80
+ eprint={2105.09680},
81
+ archivePrefix={arXiv},
82
+ primaryClass={cs.CL}
83
+ }
84
+ ```