radi-cho commited on
Commit
e70c7bb
1 Parent(s): 4d1e6de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md CHANGED
@@ -1,3 +1,68 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - bg
5
+ datasets:
6
+ - chitanka
7
+ tags:
8
+ - torch
9
  ---
10
+
11
+ # Bulgarian language poetry generation
12
+
13
+ Pretrained model using causal language modeling (CLM) objective based on [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). <br/>
14
+ Developed by [Radostin Cholakov](https://www.linkedin.com/in/radostin-cholakov-bb4422146/) as a part of the [AzBuki.ML](https://azbuki-ml.com) initiatives.
15
+
16
+ # How to use?
17
+
18
+ ```python
19
+ >>> from transformers import AutoModel, AutoTokenizer
20
+ >>>
21
+ >>> model_id = "radi-cho/poetry-bg"
22
+ >>> tokenizer = AutoTokenizer.from_pretrained(model_id)
23
+ >>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
24
+ >>>
25
+ >>> input_ids = tokenizer.encode(
26
+ >>> "[HED]Суетата на живота[NEL][BDY]",
27
+ >>> add_special_tokens=False,
28
+ >>> return_tensors='pt')
29
+ >>>
30
+ >>> output_ids = model.generate(
31
+ >>> input_ids,
32
+ >>> do_sample=True,
33
+ >>> max_length=250,
34
+ >>> top_p=0.98,
35
+ >>> top_k=0,
36
+ >>> pad_token_id=2,
37
+ >>> eos_token_id=50258)
38
+ >>>
39
+ >>> output = tokenizer.decode(output_ids[0])
40
+ >>>
41
+ >>> output = output.replace('[NEL]', '\n')
42
+ >>> output = output.replace('[BDY]', '\n')
43
+ >>> output = output.replace('[HED]', '')
44
+ >>> output = output.replace('[SEP]', '')
45
+ >>>
46
+ >>> print(output)
47
+ Суетата на живота
48
+
49
+ Да страдам ли?
50
+ Да страдам ли за това?
51
+ Не, не за това, че умирам...
52
+ Но само за това,
53
+ че миговете ми са рани.
54
+
55
+ Аз съм сам и търся утеха.
56
+ ```
57
+
58
+ # Custom Tokens
59
+ We introduced 3 custom tokens in the tokenizer - `[NEL]`, `[BDY]`, `[HED]`
60
+ - `[HED]` denotes where the title of the poem begins;
61
+ - `[BDY]` denotes where the body of the poem begins;
62
+ - `[NEL]` marks the end of a verse and should be decoded as a new line;
63
+
64
+ `[SEP]` (with id 50258) is the *end of sequence* token.
65
+
66
+ # Credits
67
+ - Inspired by [rmihaylov/gpt2-medium-bg](https://huggingface.co/rmihaylov/gpt2-medium-bg).
68
+ - Data: [https://chitanka.info/texts/type/poetry](https://chitanka.info/texts/type/poetry);