rmihaylov
/

gpt2-small-theseus-bg

Text Generation

text-generation-inference

Model card Files Files and versions Community

gpt2-small-theseus-bg / README.md

rmihaylov's picture

Update README.md

533baf0 over 2 years ago

|

history blame contribute delete

3.01 kB

	---
	inference: false
	language:
	- bg
	license: mit
	datasets:
	- oscar
	- chitanka
	- wikipedia
	tags:
	- torch
	---

	# GPT-2

	Pretrained model on Bulgarian language using a causal language modeling (CLM) objective. It was introduced in
	[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
	and first released at [this page](https://openai.com/blog/better-language-models/).

	## Model description

	This is the SMALL version compressed via [progressive module replacing](https://arxiv.org/abs/2002.02925).

	The compression was executed on Bulgarian text from [OSCAR](https://oscar-corpus.com/post/oscar-2019/), [Chitanka](https://chitanka.info/) and [Wikipedia](https://bg.wikipedia.org/).

	## Intended uses & limitations

	You can use the raw model for:
	- text generation
	- auto-complete
	- spelling correction

	Or fine-tune it to a downstream task.

	### How to use

	Here is how to use this model in PyTorch:

	```python
	>>> from transformers import AutoModel, AutoTokenizer
	>>>
	>>> model_id = "rmihaylov/gpt2-small-theseus-bg"
	>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
	>>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
	>>>
	>>> input_ids = tokenizer.encode(
	>>> "Здравей,",
	>>> add_special_tokens=False,
	>>> return_tensors='pt')
	>>>
	>>> output_ids = model.generate(
	>>> input_ids,
	>>> do_sample=True,
	>>> max_length=50,
	>>> top_p=0.92,
	>>> pad_token_id=2,
	>>> top_k=0)
	>>>
	>>> output = tokenizer.decode(output_ids[0])
	>>>
	>>> output = output.replace('<\|endoftext\|>', '\n\n\n')
	>>> output = output.replace('<\|unknown\|>', '')
	>>> output = output.replace('▁', ' ')
	>>> output = output.replace('<\|n\|>', '\n')
	>>>
	>>> print(output)

	Здравей, извинявай, но не мога да заспя.
	Джини се обърна и забеляза колко са прегърнати.
	— Почакай, Джини. Не мога да повярвам, че е възможно! Толкова искам да те видя.
	— Обеща
	```

	### Limitations and bias

	As the openAI team themselves point out in their
	[model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases):

	> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases
	> that require the generated text to be true.
	>
	> Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
	> not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
	> study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
	> and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
	> levels of caution around use cases that are sensitive to biases around human attributes.