GPT2-small / README.md
Datascience-Lab's picture
Update README.md
276eb8e
---
license: apache-2.0
tags: [gpt2]
language: ko
---
# KoGPT2-small
| Model | Batch Size | Tokenizer | Vocab Size | Max Length | Parameter Size |
|:---: | :------: | :-----: | :------: | :----: | :------: |
|GPT2 | 64 | BPE | 30,000 | 1024 | 108M |
# DataSet
- AIhub - ์›น๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ (4.8M)
- KoWiki dump 230701 (1.4M)
# Inference Example
```python
from transformers import AutoTokenizer, GPT2LMHeadModel
text = "์ถœ๊ทผ์ด ํž˜๋“ค๋ฉด"
tokenizer = AutoTokenizer.from_pretrained('Datascience-Lab/GPT2-small')
model = GPT2LMHeadModel.from_pretrained('Datascience-Lab/GPT2-small')
inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=False)
outputs = model.generate(inputs['input_ids'], max_length=128,
repetition_penalty=2.0,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
use_cache=True,
temperature = 0.5)
outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ : '์ถœ๊ทผ์ด ํž˜๋“ค๋ฉด ์ถœ๊ทผ์„ ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ํ•˜์ง€๋งŒ ์ถœํ‡ด๊ทผ ์‹œ๊ฐ„์„ ๋Šฆ์ถ”๋Š” ๊ฒƒ์€ ์˜คํžˆ๋ ค ๊ฑด๊ฐ•์— ์ข‹์ง€ ์•Š๋‹ค.. ํŠนํžˆ๋‚˜ ์žฅ์‹œ๊ฐ„์˜ ์—…๋ฌด๋กœ ์ธํ•ด ํ”ผ๋กœ๊ฐ€ ์Œ“์ด๊ณ  ๋ฉด์—ญ๋ ฅ์ด ๋–จ์–ด์ง€๋ฉด, ํ”ผ๋กœ๊ฐ์ด ์‹ฌํ•ด์ ธ์„œ ์ž ๋“ค๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ผ๋ฉด ํ‰์†Œ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์œผ๋กœ ๊ณผ์‹์„ ํ•˜๊ฑฐ๋‚˜ ๋ฌด๋ฆฌํ•œ ๋‹ค์ด์–ดํŠธ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹๋‹จ ์กฐ์ ˆ๊ณผ ํ•จ๊ป˜ ์˜์–‘ ๋ณด์ถฉ์— ์‹ ๊ฒฝ ์จ์•ผ ํ•œ๋‹ค. ๋˜ํ•œ ๊ณผ๋„ํ•œ ์Œ์‹์ด ์ฒด์ค‘ ๊ฐ๋Ÿ‰์— ๋„์›€์„ ์ฃผ๋ฏ€๋กœ ์ ์ ˆํ•œ ์šด๋™๋Ÿ‰์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•˜๋‹ค.'
```