SOLAR-KOEN-10.8B ⭐🇰🇷🇺🇸

Solar-KoEn represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced pretraining.

Model Details

Model Developers: Junbum Lee (Beomi) & Taekyoon Choi (Taekyoon)

Variations: Solar-KoEn is available with one parameter sizes — 10.8B with Continual Pretrained version.

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

SOLAR-KOEN-10.8B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

	Training Data	Parameters	Content Length	GQA	Tokens	Learning Rate
SOLAR-KOEN-10.8B	A curated mix of Korean+English Corpora	10.8B	4k	O	>60B*	5e^-5

Vocab Expansion

Model Name	Vocabulary Size	Description
Original Solar	32000	Sentencepiece BPE
Expanded SOLAR-KOEN-10.8B	46336	Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

SOLAR-10.7B: 26 tokens
SOLAR-KO-10.7b: 10 tokens

Model	Tokens
SOLAR-10.7B	`['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']`
SOLAR-KOEN-10.8B	`['▁안', '녕', '하세요', ',', '▁오늘', '은', '▁날', '씨가', '▁좋네요', '.']`

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

SOLAR-10.7B: 22 tokens
SOLAR-KO-10.7b: 22 tokens

Model	Tokens
SOLAR-10.7B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`
SOLAR-KOEN-10.8B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`

LICENSE

CC-BY-NC-SA-4.0

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
5-shot scores

Task	Version	Metric	Value		Stderr
klue_mrc	0	exact	50.2140
		f1	54.0330
		HasAns_exact	73.1786
		HasAns_f1	78.7442
		best_exact	56.9594
		best_f1	60.3743
korquad	1	exact_match	81.0530
		f1	87.6418
klue_nli	0	acc	0.4540	±	0.0091
klue_sts	0	acc	0.3410	±	0.0208
		f1	0.4896	±	0.0237
klue_ynat	0	acc	0.6308	±	0.0051
		macro_f1	0.6086	±	0.0057
kobest_boolq	0	acc	0.8711	±	0.0089
		macro_f1	0.8705	±	0.0090
kobest_copa	0	acc	0.8500	±	0.0113
		macro_f1	0.8498	±	0.0113
kobest_hellaswag	0	acc	0.5180	±	0.0224
		acc_norm	0.6180	±	0.0218
		macro_f1	0.5138	±	0.0224
kobest_sentineg	0	acc	0.9723	±	0.0082
		macro_f1	0.9723	±	0.0083
kobest_wic	0	acc	0.5825	±	0.0139
		macro_f1	0.4952	±	0.0140
kohatespeech_apeach	0	acc	0.7034	±	0.0074
		macro_f1	0.7033	±	0.0074
nsmc	0	acc	0.8738	±	0.0015
pawsx_ko	0	acc	0.5510	±	0.0111
kmmlu_direct	0	exact_match	0.4220	±	0.0909

Citation

@misc {solar_koen_junbum_taekyoon_2024,
    author       = { {L. Junbum, Taekyoon Choi} },
    title        = { SOLAR-KOEN-10.8B },
    year         = 2024,
    url          = { https://huggingface.co/beomi/SOLAR-KOEN-10.8B },
    publisher    = { Hugging Face }
}

Acknowledgements

Training support was provided by the TPU Research Cloud program.