catherinearnett commited on
Commit
7f15411
1 Parent(s): 6045476

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +32 -33
README.md CHANGED
@@ -12,58 +12,57 @@ library_name: transformers
12
 
13
  # B-GPT_en_nl_simultaneous
14
 
15
- The B-GPT Models are bilingual GPT-2 style models. For the first half of training, this model was trained only on English data. In the second half of training, the model was trained on a 50%-50% mix of English and Dutch data. At the end of training, 75 % of training data seen by the model is English and 25 % is Dutch. The tokenizer was trained on the same proportions of English and Dutch data.
16
 
17
  ## Model details:
18
 
19
- All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
20
- For best results, make sure that [CLS] is prepended to your input sequence (see sample usage linked above)!
21
- Details for this model specifically:
22
 
23
- * Architecture: gpt2
24
- * Parameters: 124770816
25
- * Maximum sequence length: 512 tokens
26
- * Training text data (raw): [XXXX]
27
- * Training tokens: 12B
28
- * Vocabulary size: 50000
29
- * Compute cost: ~9 NVIDIA A6000 GPU hours
30
- * CO2 Emission: 1.17 kg
31
 
32
- Training datasets (percentages prior to deduplication):
33
- * 100.00000%: [OSCAR 2021/09](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109)
34
 
35
- Checkpoints are taken at training steps: 0, 10000, 20000, 30000, 40000, 50000, 64000, 64010, 64020, 64030, 64040, 64050, 64060, 64070, 64080, 64090, 64100, 64110, 64120, 64130, 64140, 64150, 64160, 64170, 64180, 64190, 64200, 64300, 64400, 64500, 64600, 64700, 64800, 64900, 65000, 66000, 67000, 68000, 69000, 70000, 80000, 90000, 100000, 110000, 120000, 128000.
36
 
37
- ## Use This Model
38
 
39
- Load the model:
40
 
41
- ```
42
- from transformers import AutoTokenizer, AutoModel
43
 
44
- tokenizer = AutoTokenizer.from_pretrained("B-GPT_en_nl_simultaneous")
45
- model = AutoModel.from_pretrained("B-GPT_en_nl_simultaneous")
46
 
47
 
48
- ````
49
 
50
- Text Generation:
51
 
52
- ```
53
- from transformers import pipeline
54
 
55
- pipe = pipeline("text-generation", model="B-GPT_en_nl_simultaneous")
56
 
57
- pipe("I am a")
58
 
59
- ```
60
 
61
- ## Citation
62
 
63
- If you use this model, please cite:
64
 
65
- ```
66
 
67
 
68
- ```
69
-
 
12
 
13
  # B-GPT_en_nl_simultaneous
14
 
15
+ This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on English data. In the second half of training, the model was trained on a 50%-50% mix of English and Dutch data. At the end of training, 75 % of training data seen by the model is English and 25 % is Dutch. The tokenizer was trained on the same proportions of English and Dutch data.
16
 
17
  ## Model details:
18
 
19
+ All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
20
+ For best results, make sure that [CLS] is prepended to your input sequence (see sample usage linked above)!
21
+ Details for this model specifically:
22
 
23
+ * Architecture: gpt2
24
+ * Parameters: 124770816
25
+ * Maximum sequence length: 512 tokens
26
+ * Training text data (raw): [XXXX]
27
+ * Training tokens: 12B
28
+ * Vocabulary size: 50000
29
+ * Compute cost: ~9 NVIDIA A6000 GPU hours
30
+ * CO2 Emission: 1.17 kg
31
 
32
+ Training datasets (percentages prior to deduplication):
33
+ * 100.00000%: [OSCAR 2021/09](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109)
34
 
35
+ Checkpoints are taken at training steps: 0, 10000, 20000, 30000, 40000, 50000, 64000, 64010, 64020, 64030, 64040, 64050, 64060, 64070, 64080, 64090, 64100, 64110, 64120, 64130, 64140, 64150, 64160, 64170, 64180, 64190, 64200, 64300, 64400, 64500, 64600, 64700, 64800, 64900, 65000, 66000, 67000, 68000, 69000, 70000, 80000, 90000, 100000, 110000, 120000, 128000.
36
 
37
+ ## Use This Model
38
 
39
+ Load the model:
40
 
41
+ ```
42
+ from transformers import AutoTokenizer, AutoModel
43
 
44
+ tokenizer = AutoTokenizer.from_pretrained("B-GPT_en_nl_simultaneous")
45
+ model = AutoModel.from_pretrained("B-GPT_en_nl_simultaneous")
46
 
47
 
48
+ ````
49
 
50
+ Text Generation:
51
 
52
+ ```
53
+ from transformers import pipeline
54
 
55
+ pipe = pipeline("text-generation", model="B-GPT_en_nl_simultaneous")
56
 
57
+ pipe("I am a")
58
 
59
+ ```
60
 
61
+ ## Citation
62
 
63
+ If you use this model, please cite:
64
 
65
+ ```
66
 
67
 
68
+ ```