radi-cho
/

poetry-bg

Text Generation

text-generation-inference

Model card Files Files and versions Community

radi-cho commited on Jun 29, 2022

Commit

e70c7bb

•

1 Parent(s): 4d1e6de

Update README.md

Files changed (1) hide show

README.md +65 -0

README.md CHANGED Viewed

@@ -1,3 +1,68 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+- bg
+datasets:
+- chitanka
+tags:
+- torch
 ---
+# Bulgarian language poetry generation
+Pretrained model using causal language modeling (CLM) objective based on [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). <br/>
+Developed by [Radostin Cholakov](https://www.linkedin.com/in/radostin-cholakov-bb4422146/) as a part of the [AzBuki.ML](https://azbuki-ml.com) initiatives.
+# How to use?
+```python
+>>> from transformers import AutoModel, AutoTokenizer
+>>>
+>>> model_id = "radi-cho/poetry-bg"
+>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
+>>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
+>>>
+>>> input_ids = tokenizer.encode(
+>>>     "[HED]Суетата на живота[NEL][BDY]",
+>>>     add_special_tokens=False,
+>>>     return_tensors='pt')
+>>>
+>>> output_ids = model.generate(
+>>>     input_ids,
+>>>     do_sample=True,
+>>>     max_length=250,
+>>>     top_p=0.98,
+>>>     top_k=0,
+>>>     pad_token_id=2,
+>>>     eos_token_id=50258)
+>>>
+>>> output = tokenizer.decode(output_ids[0])
+>>>
+>>> output = output.replace('[NEL]', '\n')
+>>> output = output.replace('[BDY]', '\n')
+>>> output = output.replace('[HED]', '')
+>>> output = output.replace('[SEP]', '')
+>>>
+>>> print(output)
+Суетата на живота
+Да страдам ли?
+Да страдам ли за това?
+Не, не за това, че умирам...
+Но само за това,
+че миговете ми са рани.
+Аз съм сам и търся утеха.
+```
+# Custom Tokens
+We introduced 3 custom tokens in the tokenizer - `[NEL]`, `[BDY]`, `[HED]`
+- `[HED]` denotes where the title of the poem begins;
+- `[BDY]` denotes where the body of the poem begins;
+- `[NEL]` marks the end of a verse and should be decoded as a new line;
+`[SEP]` (with id 50258) is the *end of sequence* token.
+# Credits
+- Inspired by [rmihaylov/gpt2-medium-bg](https://huggingface.co/rmihaylov/gpt2-medium-bg).
+- Data: [https://chitanka.info/texts/type/poetry](https://chitanka.info/texts/type/poetry);