barbaroo commited on
Commit
7ff0c39
1 Parent(s): 5fd4c52

Upload 8 files

Browse files
README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: AI-Sweden-Models/gpt-sw3-6.7b-v2
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.11.1
adapter_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "AI-Sweden-Models/gpt-sw3-6.7b-v2",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 4,
14
+ "lora_dropout": 0.1,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 4,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "c_attn",
24
+ "c_proj"
25
+ ],
26
+ "task_type": "CAUSAL_LM",
27
+ "use_dora": false,
28
+ "use_rslora": false
29
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cd696e217478d0fd2598209d39fb1a98794e14440ff92a6e3eb44f0cdc3e1c9
3
+ size 23093424
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ac3e086bd43e4d2c77c39b96a5efc4ba30f382377f9c12bbb6e02c6a8ca8b59
3
+ size 46298682
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d46c56b2ff5f1d7e4350bd5a78a3c38071bcb0e540a8783b3d5dcf4123df2f0
3
+ size 14244
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab30705be11a7f0d47ae24808b51c36fd3d9958a81ef53b71ea1841770f6e963
3
+ size 1064
trainer_state.json ADDED
@@ -0,0 +1,801 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.909609317779541,
3
+ "best_model_checkpoint": "outputs-6_7/checkpoint-48000",
4
+ "epoch": 2.061041169297357,
5
+ "eval_steps": 4000,
6
+ "global_step": 48000,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.021469178846847466,
13
+ "grad_norm": 0.39146578311920166,
14
+ "learning_rate": 0.00029989693820586593,
15
+ "loss": 2.4186,
16
+ "step": 500
17
+ },
18
+ {
19
+ "epoch": 0.04293835769369493,
20
+ "grad_norm": 0.5122425556182861,
21
+ "learning_rate": 0.0002997681109631983,
22
+ "loss": 2.3709,
23
+ "step": 1000
24
+ },
25
+ {
26
+ "epoch": 0.0644075365405424,
27
+ "grad_norm": 0.4298762083053589,
28
+ "learning_rate": 0.0002996392837205307,
29
+ "loss": 2.3735,
30
+ "step": 1500
31
+ },
32
+ {
33
+ "epoch": 0.08587671538738986,
34
+ "grad_norm": 0.39066824316978455,
35
+ "learning_rate": 0.00029951045647786317,
36
+ "loss": 2.3524,
37
+ "step": 2000
38
+ },
39
+ {
40
+ "epoch": 0.10734589423423734,
41
+ "grad_norm": 0.39771586656570435,
42
+ "learning_rate": 0.00029938162923519556,
43
+ "loss": 2.347,
44
+ "step": 2500
45
+ },
46
+ {
47
+ "epoch": 0.1288150730810848,
48
+ "grad_norm": 0.47568196058273315,
49
+ "learning_rate": 0.000299252801992528,
50
+ "loss": 2.3336,
51
+ "step": 3000
52
+ },
53
+ {
54
+ "epoch": 0.15028425192793227,
55
+ "grad_norm": 0.44162309169769287,
56
+ "learning_rate": 0.0002991239747498604,
57
+ "loss": 2.334,
58
+ "step": 3500
59
+ },
60
+ {
61
+ "epoch": 0.17175343077477973,
62
+ "grad_norm": 0.4031461179256439,
63
+ "learning_rate": 0.0002989951475071928,
64
+ "loss": 2.3266,
65
+ "step": 4000
66
+ },
67
+ {
68
+ "epoch": 0.17175343077477973,
69
+ "eval_loss": 3.0375378131866455,
70
+ "eval_runtime": 174.8485,
71
+ "eval_samples_per_second": 14.298,
72
+ "eval_steps_per_second": 3.575,
73
+ "step": 4000
74
+ },
75
+ {
76
+ "epoch": 0.19322260962162718,
77
+ "grad_norm": 0.384034126996994,
78
+ "learning_rate": 0.00029886632026452525,
79
+ "loss": 2.3344,
80
+ "step": 4500
81
+ },
82
+ {
83
+ "epoch": 0.21469178846847467,
84
+ "grad_norm": 0.44177886843681335,
85
+ "learning_rate": 0.00029873749302185764,
86
+ "loss": 2.3157,
87
+ "step": 5000
88
+ },
89
+ {
90
+ "epoch": 0.23616096731532213,
91
+ "grad_norm": 0.4425281286239624,
92
+ "learning_rate": 0.0002986086657791901,
93
+ "loss": 2.3232,
94
+ "step": 5500
95
+ },
96
+ {
97
+ "epoch": 0.2576301461621696,
98
+ "grad_norm": 0.4302816390991211,
99
+ "learning_rate": 0.0002984798385365225,
100
+ "loss": 2.3141,
101
+ "step": 6000
102
+ },
103
+ {
104
+ "epoch": 0.2790993250090171,
105
+ "grad_norm": 0.5806054472923279,
106
+ "learning_rate": 0.00029835101129385493,
107
+ "loss": 2.2975,
108
+ "step": 6500
109
+ },
110
+ {
111
+ "epoch": 0.30056850385586453,
112
+ "grad_norm": 0.5654121041297913,
113
+ "learning_rate": 0.00029822218405118733,
114
+ "loss": 2.3141,
115
+ "step": 7000
116
+ },
117
+ {
118
+ "epoch": 0.322037682702712,
119
+ "grad_norm": 0.5454065203666687,
120
+ "learning_rate": 0.0002980933568085197,
121
+ "loss": 2.3097,
122
+ "step": 7500
123
+ },
124
+ {
125
+ "epoch": 0.34350686154955945,
126
+ "grad_norm": 0.43060022592544556,
127
+ "learning_rate": 0.00029796452956585217,
128
+ "loss": 2.308,
129
+ "step": 8000
130
+ },
131
+ {
132
+ "epoch": 0.34350686154955945,
133
+ "eval_loss": 3.0060064792633057,
134
+ "eval_runtime": 171.3325,
135
+ "eval_samples_per_second": 14.592,
136
+ "eval_steps_per_second": 3.648,
137
+ "step": 8000
138
+ },
139
+ {
140
+ "epoch": 0.3649760403964069,
141
+ "grad_norm": 0.490461140871048,
142
+ "learning_rate": 0.00029783570232318456,
143
+ "loss": 2.2985,
144
+ "step": 8500
145
+ },
146
+ {
147
+ "epoch": 0.38644521924325437,
148
+ "grad_norm": 0.5096587538719177,
149
+ "learning_rate": 0.000297706875080517,
150
+ "loss": 2.3041,
151
+ "step": 9000
152
+ },
153
+ {
154
+ "epoch": 0.4079143980901018,
155
+ "grad_norm": 0.4906415343284607,
156
+ "learning_rate": 0.0002975780478378494,
157
+ "loss": 2.2903,
158
+ "step": 9500
159
+ },
160
+ {
161
+ "epoch": 0.42938357693694934,
162
+ "grad_norm": 0.5885447263717651,
163
+ "learning_rate": 0.00029744922059518186,
164
+ "loss": 2.3069,
165
+ "step": 10000
166
+ },
167
+ {
168
+ "epoch": 0.4508527557837968,
169
+ "grad_norm": 0.5200388431549072,
170
+ "learning_rate": 0.00029732039335251425,
171
+ "loss": 2.3025,
172
+ "step": 10500
173
+ },
174
+ {
175
+ "epoch": 0.47232193463064426,
176
+ "grad_norm": 0.6331049799919128,
177
+ "learning_rate": 0.00029719156610984664,
178
+ "loss": 2.2957,
179
+ "step": 11000
180
+ },
181
+ {
182
+ "epoch": 0.4937911134774917,
183
+ "grad_norm": 0.5442560315132141,
184
+ "learning_rate": 0.0002970627388671791,
185
+ "loss": 2.2878,
186
+ "step": 11500
187
+ },
188
+ {
189
+ "epoch": 0.5152602923243392,
190
+ "grad_norm": 0.5305426120758057,
191
+ "learning_rate": 0.0002969339116245115,
192
+ "loss": 2.2903,
193
+ "step": 12000
194
+ },
195
+ {
196
+ "epoch": 0.5152602923243392,
197
+ "eval_loss": 2.973823070526123,
198
+ "eval_runtime": 170.1319,
199
+ "eval_samples_per_second": 14.694,
200
+ "eval_steps_per_second": 3.674,
201
+ "step": 12000
202
+ },
203
+ {
204
+ "epoch": 0.5367294711711866,
205
+ "grad_norm": 0.5756106972694397,
206
+ "learning_rate": 0.0002968050843818439,
207
+ "loss": 2.2883,
208
+ "step": 12500
209
+ },
210
+ {
211
+ "epoch": 0.5581986500180341,
212
+ "grad_norm": 0.5812390446662903,
213
+ "learning_rate": 0.00029667625713917633,
214
+ "loss": 2.2807,
215
+ "step": 13000
216
+ },
217
+ {
218
+ "epoch": 0.5796678288648816,
219
+ "grad_norm": 0.4355560541152954,
220
+ "learning_rate": 0.0002965474298965088,
221
+ "loss": 2.2885,
222
+ "step": 13500
223
+ },
224
+ {
225
+ "epoch": 0.6011370077117291,
226
+ "grad_norm": 0.41715824604034424,
227
+ "learning_rate": 0.00029641860265384117,
228
+ "loss": 2.2834,
229
+ "step": 14000
230
+ },
231
+ {
232
+ "epoch": 0.6226061865585765,
233
+ "grad_norm": 0.4623817801475525,
234
+ "learning_rate": 0.00029628977541117357,
235
+ "loss": 2.2748,
236
+ "step": 14500
237
+ },
238
+ {
239
+ "epoch": 0.644075365405424,
240
+ "grad_norm": 0.5191289186477661,
241
+ "learning_rate": 0.000296160948168506,
242
+ "loss": 2.2811,
243
+ "step": 15000
244
+ },
245
+ {
246
+ "epoch": 0.6655445442522714,
247
+ "grad_norm": 0.6877865791320801,
248
+ "learning_rate": 0.0002960321209258384,
249
+ "loss": 2.2783,
250
+ "step": 15500
251
+ },
252
+ {
253
+ "epoch": 0.6870137230991189,
254
+ "grad_norm": 0.49987566471099854,
255
+ "learning_rate": 0.0002959032936831708,
256
+ "loss": 2.2719,
257
+ "step": 16000
258
+ },
259
+ {
260
+ "epoch": 0.6870137230991189,
261
+ "eval_loss": 2.964353561401367,
262
+ "eval_runtime": 171.8492,
263
+ "eval_samples_per_second": 14.548,
264
+ "eval_steps_per_second": 3.637,
265
+ "step": 16000
266
+ },
267
+ {
268
+ "epoch": 0.7084829019459664,
269
+ "grad_norm": 0.5470739006996155,
270
+ "learning_rate": 0.00029577446644050325,
271
+ "loss": 2.2832,
272
+ "step": 16500
273
+ },
274
+ {
275
+ "epoch": 0.7299520807928138,
276
+ "grad_norm": 0.6002724766731262,
277
+ "learning_rate": 0.0002956456391978357,
278
+ "loss": 2.2838,
279
+ "step": 17000
280
+ },
281
+ {
282
+ "epoch": 0.7514212596396613,
283
+ "grad_norm": 0.6674920320510864,
284
+ "learning_rate": 0.0002955168119551681,
285
+ "loss": 2.2686,
286
+ "step": 17500
287
+ },
288
+ {
289
+ "epoch": 0.7728904384865087,
290
+ "grad_norm": 0.5728652477264404,
291
+ "learning_rate": 0.0002953879847125005,
292
+ "loss": 2.2725,
293
+ "step": 18000
294
+ },
295
+ {
296
+ "epoch": 0.7943596173333562,
297
+ "grad_norm": 0.5590266585350037,
298
+ "learning_rate": 0.00029525915746983294,
299
+ "loss": 2.2794,
300
+ "step": 18500
301
+ },
302
+ {
303
+ "epoch": 0.8158287961802037,
304
+ "grad_norm": 0.7446316480636597,
305
+ "learning_rate": 0.00029513033022716533,
306
+ "loss": 2.2676,
307
+ "step": 19000
308
+ },
309
+ {
310
+ "epoch": 0.8372979750270512,
311
+ "grad_norm": 0.4322523772716522,
312
+ "learning_rate": 0.0002950015029844977,
313
+ "loss": 2.2832,
314
+ "step": 19500
315
+ },
316
+ {
317
+ "epoch": 0.8587671538738987,
318
+ "grad_norm": 0.6566835045814514,
319
+ "learning_rate": 0.00029487267574183017,
320
+ "loss": 2.2636,
321
+ "step": 20000
322
+ },
323
+ {
324
+ "epoch": 0.8587671538738987,
325
+ "eval_loss": 2.954716920852661,
326
+ "eval_runtime": 171.5581,
327
+ "eval_samples_per_second": 14.572,
328
+ "eval_steps_per_second": 3.643,
329
+ "step": 20000
330
+ },
331
+ {
332
+ "epoch": 0.8802363327207461,
333
+ "grad_norm": 0.5313192009925842,
334
+ "learning_rate": 0.0002947438484991626,
335
+ "loss": 2.2819,
336
+ "step": 20500
337
+ },
338
+ {
339
+ "epoch": 0.9017055115675936,
340
+ "grad_norm": 0.689608633518219,
341
+ "learning_rate": 0.000294615021256495,
342
+ "loss": 2.2728,
343
+ "step": 21000
344
+ },
345
+ {
346
+ "epoch": 0.923174690414441,
347
+ "grad_norm": 0.7024255394935608,
348
+ "learning_rate": 0.0002944861940138274,
349
+ "loss": 2.2746,
350
+ "step": 21500
351
+ },
352
+ {
353
+ "epoch": 0.9446438692612885,
354
+ "grad_norm": 0.6012333035469055,
355
+ "learning_rate": 0.00029435736677115986,
356
+ "loss": 2.2658,
357
+ "step": 22000
358
+ },
359
+ {
360
+ "epoch": 0.9661130481081359,
361
+ "grad_norm": 0.6304742693901062,
362
+ "learning_rate": 0.0002942285395284923,
363
+ "loss": 2.2718,
364
+ "step": 22500
365
+ },
366
+ {
367
+ "epoch": 0.9875822269549834,
368
+ "grad_norm": 0.541362464427948,
369
+ "learning_rate": 0.0002940997122858247,
370
+ "loss": 2.272,
371
+ "step": 23000
372
+ },
373
+ {
374
+ "epoch": 1.009051405801831,
375
+ "grad_norm": 0.5888085961341858,
376
+ "learning_rate": 0.0002939708850431571,
377
+ "loss": 2.2485,
378
+ "step": 23500
379
+ },
380
+ {
381
+ "epoch": 1.0305205846486785,
382
+ "grad_norm": 0.5453173518180847,
383
+ "learning_rate": 0.00029384205780048954,
384
+ "loss": 2.2395,
385
+ "step": 24000
386
+ },
387
+ {
388
+ "epoch": 1.0305205846486785,
389
+ "eval_loss": 2.940072774887085,
390
+ "eval_runtime": 171.7472,
391
+ "eval_samples_per_second": 14.556,
392
+ "eval_steps_per_second": 3.639,
393
+ "step": 24000
394
+ },
395
+ {
396
+ "epoch": 1.0519897634955258,
397
+ "grad_norm": 0.7155711054801941,
398
+ "learning_rate": 0.00029371323055782194,
399
+ "loss": 2.2476,
400
+ "step": 24500
401
+ },
402
+ {
403
+ "epoch": 1.0734589423423733,
404
+ "grad_norm": 0.7307182550430298,
405
+ "learning_rate": 0.00029358440331515433,
406
+ "loss": 2.2479,
407
+ "step": 25000
408
+ },
409
+ {
410
+ "epoch": 1.0949281211892208,
411
+ "grad_norm": 0.6849473714828491,
412
+ "learning_rate": 0.0002934555760724868,
413
+ "loss": 2.2407,
414
+ "step": 25500
415
+ },
416
+ {
417
+ "epoch": 1.1163973000360683,
418
+ "grad_norm": 0.7161998152732849,
419
+ "learning_rate": 0.00029332674882981923,
420
+ "loss": 2.247,
421
+ "step": 26000
422
+ },
423
+ {
424
+ "epoch": 1.1378664788829158,
425
+ "grad_norm": 0.723235011100769,
426
+ "learning_rate": 0.0002931979215871516,
427
+ "loss": 2.2382,
428
+ "step": 26500
429
+ },
430
+ {
431
+ "epoch": 1.159335657729763,
432
+ "grad_norm": 0.4874274432659149,
433
+ "learning_rate": 0.000293069094344484,
434
+ "loss": 2.2483,
435
+ "step": 27000
436
+ },
437
+ {
438
+ "epoch": 1.1808048365766106,
439
+ "grad_norm": 0.5381557941436768,
440
+ "learning_rate": 0.00029294026710181646,
441
+ "loss": 2.2423,
442
+ "step": 27500
443
+ },
444
+ {
445
+ "epoch": 1.2022740154234581,
446
+ "grad_norm": 0.7897226214408875,
447
+ "learning_rate": 0.00029281143985914886,
448
+ "loss": 2.2538,
449
+ "step": 28000
450
+ },
451
+ {
452
+ "epoch": 1.2022740154234581,
453
+ "eval_loss": 2.926734447479248,
454
+ "eval_runtime": 172.6333,
455
+ "eval_samples_per_second": 14.482,
456
+ "eval_steps_per_second": 3.62,
457
+ "step": 28000
458
+ },
459
+ {
460
+ "epoch": 1.2237431942703056,
461
+ "grad_norm": 0.5494747161865234,
462
+ "learning_rate": 0.00029268261261648125,
463
+ "loss": 2.2441,
464
+ "step": 28500
465
+ },
466
+ {
467
+ "epoch": 1.245212373117153,
468
+ "grad_norm": 0.5955171585083008,
469
+ "learning_rate": 0.0002925537853738137,
470
+ "loss": 2.245,
471
+ "step": 29000
472
+ },
473
+ {
474
+ "epoch": 1.2666815519640005,
475
+ "grad_norm": 0.7213128805160522,
476
+ "learning_rate": 0.0002924249581311461,
477
+ "loss": 2.2488,
478
+ "step": 29500
479
+ },
480
+ {
481
+ "epoch": 1.288150730810848,
482
+ "grad_norm": 0.7488630414009094,
483
+ "learning_rate": 0.00029229613088847854,
484
+ "loss": 2.2412,
485
+ "step": 30000
486
+ },
487
+ {
488
+ "epoch": 1.3096199096576955,
489
+ "grad_norm": 0.5948154330253601,
490
+ "learning_rate": 0.00029216730364581094,
491
+ "loss": 2.2378,
492
+ "step": 30500
493
+ },
494
+ {
495
+ "epoch": 1.3310890885045428,
496
+ "grad_norm": 0.7915855050086975,
497
+ "learning_rate": 0.0002920384764031434,
498
+ "loss": 2.2464,
499
+ "step": 31000
500
+ },
501
+ {
502
+ "epoch": 1.3525582673513903,
503
+ "grad_norm": 0.6043704152107239,
504
+ "learning_rate": 0.0002919096491604758,
505
+ "loss": 2.2421,
506
+ "step": 31500
507
+ },
508
+ {
509
+ "epoch": 1.3740274461982378,
510
+ "grad_norm": 0.5474274158477783,
511
+ "learning_rate": 0.0002917808219178082,
512
+ "loss": 2.2507,
513
+ "step": 32000
514
+ },
515
+ {
516
+ "epoch": 1.3740274461982378,
517
+ "eval_loss": 2.9269840717315674,
518
+ "eval_runtime": 174.9446,
519
+ "eval_samples_per_second": 14.29,
520
+ "eval_steps_per_second": 3.573,
521
+ "step": 32000
522
+ },
523
+ {
524
+ "epoch": 1.3954966250450853,
525
+ "grad_norm": 0.5420586466789246,
526
+ "learning_rate": 0.0002916519946751406,
527
+ "loss": 2.2405,
528
+ "step": 32500
529
+ },
530
+ {
531
+ "epoch": 1.4169658038919328,
532
+ "grad_norm": 0.4751032888889313,
533
+ "learning_rate": 0.000291523167432473,
534
+ "loss": 2.2497,
535
+ "step": 33000
536
+ },
537
+ {
538
+ "epoch": 1.4384349827387801,
539
+ "grad_norm": 0.5793635249137878,
540
+ "learning_rate": 0.00029139434018980547,
541
+ "loss": 2.2448,
542
+ "step": 33500
543
+ },
544
+ {
545
+ "epoch": 1.4599041615856276,
546
+ "grad_norm": 0.6635434031486511,
547
+ "learning_rate": 0.00029126551294713786,
548
+ "loss": 2.2434,
549
+ "step": 34000
550
+ },
551
+ {
552
+ "epoch": 1.4813733404324751,
553
+ "grad_norm": 0.5708619356155396,
554
+ "learning_rate": 0.0002911366857044703,
555
+ "loss": 2.2343,
556
+ "step": 34500
557
+ },
558
+ {
559
+ "epoch": 1.5028425192793224,
560
+ "grad_norm": 0.5989744067192078,
561
+ "learning_rate": 0.0002910078584618027,
562
+ "loss": 2.2388,
563
+ "step": 35000
564
+ },
565
+ {
566
+ "epoch": 1.5243116981261702,
567
+ "grad_norm": 0.746486246585846,
568
+ "learning_rate": 0.0002908790312191351,
569
+ "loss": 2.2484,
570
+ "step": 35500
571
+ },
572
+ {
573
+ "epoch": 1.5457808769730175,
574
+ "grad_norm": 0.6059302687644958,
575
+ "learning_rate": 0.00029075020397646755,
576
+ "loss": 2.2409,
577
+ "step": 36000
578
+ },
579
+ {
580
+ "epoch": 1.5457808769730175,
581
+ "eval_loss": 2.918299913406372,
582
+ "eval_runtime": 171.3628,
583
+ "eval_samples_per_second": 14.589,
584
+ "eval_steps_per_second": 3.647,
585
+ "step": 36000
586
+ },
587
+ {
588
+ "epoch": 1.567250055819865,
589
+ "grad_norm": 0.5767127871513367,
590
+ "learning_rate": 0.00029062137673379994,
591
+ "loss": 2.2459,
592
+ "step": 36500
593
+ },
594
+ {
595
+ "epoch": 1.5887192346667125,
596
+ "grad_norm": 0.6815518736839294,
597
+ "learning_rate": 0.0002904925494911324,
598
+ "loss": 2.2565,
599
+ "step": 37000
600
+ },
601
+ {
602
+ "epoch": 1.6101884135135598,
603
+ "grad_norm": 0.6565374732017517,
604
+ "learning_rate": 0.0002903637222484648,
605
+ "loss": 2.2388,
606
+ "step": 37500
607
+ },
608
+ {
609
+ "epoch": 1.6316575923604075,
610
+ "grad_norm": 0.6622541546821594,
611
+ "learning_rate": 0.0002902348950057972,
612
+ "loss": 2.261,
613
+ "step": 38000
614
+ },
615
+ {
616
+ "epoch": 1.6531267712072548,
617
+ "grad_norm": 0.8162985444068909,
618
+ "learning_rate": 0.0002901060677631296,
619
+ "loss": 2.2495,
620
+ "step": 38500
621
+ },
622
+ {
623
+ "epoch": 1.6745959500541023,
624
+ "grad_norm": 0.5659546852111816,
625
+ "learning_rate": 0.000289977240520462,
626
+ "loss": 2.2385,
627
+ "step": 39000
628
+ },
629
+ {
630
+ "epoch": 1.6960651289009498,
631
+ "grad_norm": 0.5625469088554382,
632
+ "learning_rate": 0.00028984841327779447,
633
+ "loss": 2.2372,
634
+ "step": 39500
635
+ },
636
+ {
637
+ "epoch": 1.7175343077477971,
638
+ "grad_norm": 0.5423092842102051,
639
+ "learning_rate": 0.00028971958603512686,
640
+ "loss": 2.2424,
641
+ "step": 40000
642
+ },
643
+ {
644
+ "epoch": 1.7175343077477971,
645
+ "eval_loss": 2.92061710357666,
646
+ "eval_runtime": 172.1102,
647
+ "eval_samples_per_second": 14.526,
648
+ "eval_steps_per_second": 3.631,
649
+ "step": 40000
650
+ },
651
+ {
652
+ "epoch": 1.7390034865946449,
653
+ "grad_norm": 0.7644880414009094,
654
+ "learning_rate": 0.0002895907587924593,
655
+ "loss": 2.2368,
656
+ "step": 40500
657
+ },
658
+ {
659
+ "epoch": 1.7604726654414922,
660
+ "grad_norm": 0.8192068934440613,
661
+ "learning_rate": 0.0002894619315497917,
662
+ "loss": 2.2357,
663
+ "step": 41000
664
+ },
665
+ {
666
+ "epoch": 1.7819418442883397,
667
+ "grad_norm": 0.6234991550445557,
668
+ "learning_rate": 0.0002893331043071241,
669
+ "loss": 2.2418,
670
+ "step": 41500
671
+ },
672
+ {
673
+ "epoch": 1.8034110231351872,
674
+ "grad_norm": 0.5751623511314392,
675
+ "learning_rate": 0.00028920427706445655,
676
+ "loss": 2.2413,
677
+ "step": 42000
678
+ },
679
+ {
680
+ "epoch": 1.8248802019820345,
681
+ "grad_norm": 0.8999291062355042,
682
+ "learning_rate": 0.00028907544982178894,
683
+ "loss": 2.2356,
684
+ "step": 42500
685
+ },
686
+ {
687
+ "epoch": 1.846349380828882,
688
+ "grad_norm": 0.7696816325187683,
689
+ "learning_rate": 0.00028894662257912133,
690
+ "loss": 2.2427,
691
+ "step": 43000
692
+ },
693
+ {
694
+ "epoch": 1.8678185596757295,
695
+ "grad_norm": 0.6660240292549133,
696
+ "learning_rate": 0.0002888177953364538,
697
+ "loss": 2.2507,
698
+ "step": 43500
699
+ },
700
+ {
701
+ "epoch": 1.889287738522577,
702
+ "grad_norm": 0.6106180548667908,
703
+ "learning_rate": 0.00028868896809378623,
704
+ "loss": 2.2428,
705
+ "step": 44000
706
+ },
707
+ {
708
+ "epoch": 1.889287738522577,
709
+ "eval_loss": 2.9116756916046143,
710
+ "eval_runtime": 172.218,
711
+ "eval_samples_per_second": 14.516,
712
+ "eval_steps_per_second": 3.629,
713
+ "step": 44000
714
+ },
715
+ {
716
+ "epoch": 1.9107569173694245,
717
+ "grad_norm": 0.6366661190986633,
718
+ "learning_rate": 0.0002885601408511186,
719
+ "loss": 2.2408,
720
+ "step": 44500
721
+ },
722
+ {
723
+ "epoch": 1.9322260962162718,
724
+ "grad_norm": 0.7893187403678894,
725
+ "learning_rate": 0.000288431313608451,
726
+ "loss": 2.2369,
727
+ "step": 45000
728
+ },
729
+ {
730
+ "epoch": 1.9536952750631194,
731
+ "grad_norm": 0.633651077747345,
732
+ "learning_rate": 0.00028830248636578347,
733
+ "loss": 2.2431,
734
+ "step": 45500
735
+ },
736
+ {
737
+ "epoch": 1.9751644539099669,
738
+ "grad_norm": 0.7481298446655273,
739
+ "learning_rate": 0.00028817365912311586,
740
+ "loss": 2.2371,
741
+ "step": 46000
742
+ },
743
+ {
744
+ "epoch": 1.9966336327568142,
745
+ "grad_norm": 0.596591055393219,
746
+ "learning_rate": 0.00028804483188044826,
747
+ "loss": 2.2358,
748
+ "step": 46500
749
+ },
750
+ {
751
+ "epoch": 2.018102811603662,
752
+ "grad_norm": 0.7450771927833557,
753
+ "learning_rate": 0.0002879160046377807,
754
+ "loss": 2.2238,
755
+ "step": 47000
756
+ },
757
+ {
758
+ "epoch": 2.039571990450509,
759
+ "grad_norm": 0.6886998414993286,
760
+ "learning_rate": 0.00028778717739511315,
761
+ "loss": 2.218,
762
+ "step": 47500
763
+ },
764
+ {
765
+ "epoch": 2.061041169297357,
766
+ "grad_norm": 0.5555692911148071,
767
+ "learning_rate": 0.00028765835015244555,
768
+ "loss": 2.2151,
769
+ "step": 48000
770
+ },
771
+ {
772
+ "epoch": 2.061041169297357,
773
+ "eval_loss": 2.909609317779541,
774
+ "eval_runtime": 175.1711,
775
+ "eval_samples_per_second": 14.272,
776
+ "eval_steps_per_second": 3.568,
777
+ "step": 48000
778
+ }
779
+ ],
780
+ "logging_steps": 500,
781
+ "max_steps": 1164450,
782
+ "num_input_tokens_seen": 0,
783
+ "num_train_epochs": 50,
784
+ "save_steps": 8000,
785
+ "stateful_callbacks": {
786
+ "TrainerControl": {
787
+ "args": {
788
+ "should_epoch_stop": false,
789
+ "should_evaluate": false,
790
+ "should_log": false,
791
+ "should_save": true,
792
+ "should_training_stop": false
793
+ },
794
+ "attributes": {}
795
+ }
796
+ },
797
+ "total_flos": 5.807418710561096e+18,
798
+ "train_batch_size": 4,
799
+ "trial_name": null,
800
+ "trial_params": null
801
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d2488184e5b8ed492bf284f651fa9fb6b271935bbf360072ebdb3c6f92148c2
3
+ size 5112