balazik commited on
Commit
cf339ae
1 Parent(s): 3e7ac6e

Initial commit

Browse files
Files changed (8) hide show
  1. README.md +111 -0
  2. config.json +23 -0
  3. merges.txt +0 -0
  4. pytorch_model.bin +3 -0
  5. special_tokens_map.json +1 -0
  6. tf_model.h5 +3 -0
  7. tokenizer_config.json +1 -0
  8. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: sk
3
+ tags:
4
+ - SlovakBERT
5
+ license: mit
6
+ datasets:
7
+ - wikipedia
8
+ - opensubtitles
9
+ - oscar
10
+ - gerulatawebcrawl
11
+ - gerulatamonitoring
12
+ - blbec.online
13
+ ---
14
+
15
+ # SlovakBERT (base-sized model)
16
+ SlovakBERT pretrained model on Slovak language using a masked language modeling (MLM) objective. This model is case-sensitive: it makes a difference between slovenko and Slovensko.
17
+
18
+ ## Intended uses & limitations
19
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
20
+ **IMPORTANT**: The model was not trained on the “ and ” (direct quote) character -> so before tokenizing the text, it is advised to replace all “ and ” (direct quote marks) with a single "(double quote marks).
21
+
22
+ ### How to use
23
+ You can use this model directly with a pipeline for masked language modeling:
24
+
25
+ ```python
26
+ from transformers import pipeline
27
+ unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
28
+ unmasker("Deti sa <mask> na ihrisku.")
29
+
30
+ [{'sequence': 'Deti sa hrali na ihrisku.',
31
+ 'score': 0.6355380415916443,
32
+ 'token': 5949,
33
+ 'token_str': ' hrali'},
34
+ {'sequence': 'Deti sa hrajú na ihrisku.',
35
+ 'score': 0.14731724560260773,
36
+ 'token': 9081,
37
+ 'token_str': ' hrajú'},
38
+ {'sequence': 'Deti sa zahrali na ihrisku.',
39
+ 'score': 0.05016357824206352,
40
+ 'token': 32553,
41
+ 'token_str': ' zahrali'},
42
+ {'sequence': 'Deti sa stretli na ihrisku.',
43
+ 'score': 0.041727423667907715,
44
+ 'token': 5964,
45
+ 'token_str': ' stretli'},
46
+ {'sequence': 'Deti sa učia na ihrisku.',
47
+ 'score': 0.01886524073779583,
48
+ 'token': 18099,
49
+ 'token_str': ' učia'}]
50
+ ```
51
+
52
+ Here is how to use this model to get the features of a given text in PyTorch:
53
+ ```python
54
+ from transformers import RobertaTokenizer, RobertaModel
55
+ tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')
56
+ model = RobertaModel.from_pretrained('gerulata/slovakbert')
57
+ text = "Text ktorý sa má embedovať."
58
+ encoded_input = tokenizer(text, return_tensors='pt')
59
+ output = model(**encoded_input)
60
+ ```
61
+ and in TensorFlow:
62
+ ```python
63
+ from transformers import RobertaTokenizer, TFRobertaModel
64
+ tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
65
+ model = TFRobertaModel.from_pretrained('roberta-base')
66
+ text = "Text ktorý sa má embedovať."
67
+ encoded_input = tokenizer(text, return_tensors='tf')
68
+ output = model(encoded_input)
69
+ ```
70
+ Or extract information from the model like this:
71
+ ```python
72
+ from transformers import pipeline
73
+ unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
74
+ unmasker("Slovenské národne povstanie sa uskutočnilo v roku <mask>.")
75
+
76
+ [{'sequence': 'Slovenske narodne povstanie sa uskutočnilo v roku 1944.',
77
+ 'score': 0.7383289933204651,
78
+ 'token': 16621,
79
+ 'token_str': ' 1944'},...]
80
+ ```
81
+
82
+ # Training data
83
+ The SlovakBERT model was pretrained on the these datasets:
84
+
85
+ - Wikipedia (326MB of text),
86
+ - OpenSubtitles (415MB of text),
87
+ - Oscar (4.6GB of text),
88
+ - Gerulata WebCrawl (12.7GB of text) ,
89
+ - Gerulata Monitoring (214 MB of text),
90
+ - blbec.online (4.5GB of text)
91
+
92
+ The text was then processed with the following steps:
93
+ - URL and email addresses were replaced with special tokens ("url", "email").
94
+ - Elongated interpunction was reduced (e.g. -- to -).
95
+ - Markdown syntax was deleted.
96
+ - All text content in braces f.g was eliminated to reduce the amount of markup and programming language text.
97
+
98
+ We segmented the resulting corpus into sentences and removed duplicates to get 181.6M unique sentences. In total, the final corpus has 19.35GB of text.
99
+
100
+ # Pretraining
101
+ The model was trained in **fairseq** on 4 x Nvidia A100 GPUs for 300K steps with a batch size of 512 and a sequence length of 512. The optimizer used is Adam with a learning rate of 5e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, dropout rate 0.1, learning rate warmup for 10k steps and linear decay of the learning rate after. We used 16-bit float precision.
102
+
103
+ ## About us
104
+ <a href="https://www.gerulata.com/">
105
+ <img width="300px" src="https://www.gerulata.com/images/gerulata-logo-blue.png">
106
+ </a>
107
+
108
+ Gerulata uses near real-time monitoring, advanced analytics and machine learning to help create a safer, more productive and enjoyable online environment for everyone.
109
+
110
+ ### BibTeX entry and citation info
111
+ - to be completed
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "gerulata/slovakbert",
3
+ "architectures": [
4
+ "RobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 1,
21
+ "type_vocab_size": 1,
22
+ "vocab_size": 50264
23
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71bf910b56cca82b2b9bf79b4ed7212cfba711fb3b90cfb79181e97f495ab130
3
+ size 499040675
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c5a18b0c0c0e42251e20f3d5ccfd7ccd87752ee560d326ff0faa31eb4546474
3
+ size 657427592
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"errors": "replace", "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": false, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "special_tokens_map_file": null, "tokenizer_file": null, "model_max_length": 512, "name_or_path": "sk-roberta-base-300k-voc50264-20gb"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff