myeongho-jeong
commited on
Commit
•
05009de
1
Parent(s):
d7a7019
Update README.md
Browse files
README.md
CHANGED
@@ -30,7 +30,14 @@ This model is a Korean vocabulary-extended version of [upstage/SOLAR-10.7B-v1.0]
|
|
30 |
<p align="left">
|
31 |
<img src="https://huggingface.co/yanolja/EEVE-Korean-10.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
|
32 |
<p>
|
33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
```python
|
36 |
# number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
|
@@ -47,15 +54,6 @@ for name, param in model.named_parameters():
|
|
47 |
param.requires_grad = False
|
48 |
```
|
49 |
|
50 |
-
Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
|
51 |
-
|
52 |
-
1. Freezing the `embed_tokens` layer for existing tokens is crucial to maintain overall performance.
|
53 |
-
2. Unfreezing the `lm_head` layer for existing tokens actually boosts performance.
|
54 |
-
|
55 |
-
As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
|
56 |
-
|
57 |
-
For detail, please refer our technical report - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
|
58 |
-
|
59 |
### Usage and Limitations
|
60 |
|
61 |
Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
|
@@ -93,11 +91,11 @@ This rigorous approach ensured a comprehensive and contextually rich Korean voca
|
|
93 |
## Citation
|
94 |
|
95 |
```
|
96 |
-
@misc{
|
97 |
-
title={
|
98 |
-
author={
|
99 |
-
year={
|
100 |
-
eprint={
|
101 |
archivePrefix={arXiv},
|
102 |
primaryClass={cs.CL}
|
103 |
}
|
|
|
30 |
<p align="left">
|
31 |
<img src="https://huggingface.co/yanolja/EEVE-Korean-10.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
|
32 |
<p>
|
33 |
+
|
34 |
+
To adapt foundational models from English to Korean, we use subword-based embedding with a seven-stage training process involving parameter freezing.
|
35 |
+
This approach progressively trains from input embeddings to full parameters, efficiently extending the model's vocabulary to include Korean.
|
36 |
+
Our method enhances the model's cross-linguistic applicability by carefully integrating new linguistic tokens, focusing on causal language modeling pre-training.
|
37 |
+
We leverage the inherent capabilities of foundational models trained on English to efficiently transfer knowledge and reasoning to Korean, optimizing the adaptation process.
|
38 |
+
For detail, please refer our technical report - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
|
39 |
+
|
40 |
+
Here’s an simplified code for our key approach:
|
41 |
|
42 |
```python
|
43 |
# number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
|
|
|
54 |
param.requires_grad = False
|
55 |
```
|
56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
### Usage and Limitations
|
58 |
|
59 |
Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
|
|
|
91 |
## Citation
|
92 |
|
93 |
```
|
94 |
+
@misc{Kim2024Efficient,
|
95 |
+
title={Efficient and Effective Vocabulary Expansion \\Towards Multilingual Large Language Models},
|
96 |
+
author={Seungduk Kim, Seungtaek Choi, Myeongho Jeong},
|
97 |
+
year={2024},
|
98 |
+
eprint={2402.XXXXX},
|
99 |
archivePrefix={arXiv},
|
100 |
primaryClass={cs.CL}
|
101 |
}
|