pbevan11 commited on
Commit
b878357
1 Parent(s): 5eaa62e

Training in progress, step 57

Browse files
README.md CHANGED
@@ -1,12 +1,12 @@
1
  ---
2
- base_model: meta-llama/Meta-Llama-3-8B
3
  library_name: peft
4
- license: llama3
5
  tags:
6
  - axolotl
7
  - generated_from_trainer
8
  model-index:
9
- - name: llama-3-8b-ocr-correction
10
  results: []
11
  ---
12
 
@@ -18,10 +18,9 @@ should probably proofread and complete it, then remove this comment. -->
18
 
19
  axolotl version: `0.4.1`
20
  ```yaml
21
- base_model: meta-llama/Meta-Llama-3-8B
22
  model_type: AutoModelForCausalLM
23
  tokenizer_type: AutoTokenizer
24
- is_mistral_derived_model: true
25
 
26
  load_in_8bit: false
27
  load_in_4bit: true
@@ -35,14 +34,14 @@ datasets:
35
  - path: ft_data/alpaca_data.jsonl
36
  type: alpaca
37
  dataset_prepared_path: last_run_prepared
38
- val_set_size: 0.1
39
  output_dir: ./qlora-alpaca-out
40
- hub_model_id: pbevan11/llama-3-8b-ocr-correction
41
 
42
  adapter: qlora
43
  lora_model_dir:
44
 
45
- sequence_len: 4096
46
  sample_packing: true
47
  pad_to_sequence_len: true
48
 
@@ -62,7 +61,7 @@ lora_target_modules:
62
 
63
  wandb_project: ocr-ft
64
  wandb_entity: sncds
65
- wandb_name: test
66
 
67
  gradient_accumulation_steps: 4
68
  micro_batch_size: 2 # was 16
@@ -104,86 +103,24 @@ special_tokens:
104
 
105
  </details><br>
106
 
107
- # llama-3-8b-ocr-correction
 
108
 
109
- This model is a qlora fine-tuned adapter for [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) on the [pbevan11/synthetic-ocr-correction-gpt4o](https://huggingface.co/datasets/pbevan11/synthetic-ocr-correction-gpt4o) dataset.
110
  It achieves the following results on the evaluation set:
111
- - Loss: 0.1742
112
 
113
- ## Usage
114
-
115
- First, download the model
116
-
117
- ```python
118
- from peft import AutoPeftModelForCausalLM
119
- from transformers import AutoTokenizer
120
- model_id='pbevan11/llama-3-8b-ocr-correction'
121
- model = AutoPeftModelForCausalLM.from_pretrained(model_id).cuda()
122
- tokenizer = AutoTokenizer.from_pretrained(model_id)
123
- tokenizer.pad_token = tokenizer.eos_token
124
- ```
125
-
126
- Then, construct the prompt template like so:
127
-
128
- ```python
129
- def prompt(instruction, inp):
130
- return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
131
-
132
- ### Instruction:
133
- {instruction}
134
-
135
- ### Input:
136
- {inp}
137
-
138
- ### Response:
139
- """
140
-
141
- def prompt_tok(instruction, inp, return_ids=False):
142
- _p = prompt(instruction, inp)
143
- input_ids = tokenizer(_p, return_tensors="pt", truncation=True).input_ids.cuda()
144
- out_ids = model.generate(input_ids=input_ids, max_new_tokens=5000,
145
- do_sample=False)
146
- ids = out_ids.detach().cpu().numpy()
147
- if return_ids: return out_ids
148
-
149
- full_output = tokenizer.batch_decode(ids, skip_special_tokens=True)[0]
150
- response_start = full_output.find("### Response:")
151
- if response_start != -1:
152
- return full_output[response_start + len("### Response:"):]
153
- else:
154
- return full_output[len(_p):]
155
- ```
156
-
157
- Finally, you can get predictions like this:
158
-
159
- ```python
160
- # model inputs
161
- instruction = "You are an assistant that takes a piece of text that has been corrupted during OCR digitisation, and produce a corrected version of the same text."
162
- inp = "Do Not Kule Oi't hy.er-l'rieed AjijqIi: imac - Analyst (fteuiers) Hcuiers - A | ) | ilf, <;/) in |) nter |iic . conic! deeiilf. l.o sell n lower-|)rieofl wersinn oi its Macintosh cornutor to nttinct ronsnnu-rs already euami'red ot its iPod music jiayo-r untl annoyoil. by sccnrit.y problems ivitJi Willtlows PCs , Piper.iaffray analyst. (Jcne Muster <aid on Tlinrtiday."
163
-
164
- # print prediction
165
- out = prompt_tok(instruction, inp)
166
- print(out.replace('\\', ' '))
167
- ```
168
-
169
- This will give you a prediction that looks like this:
170
-
171
- ```md
172
- "Do Not Rule Out Lower-Priced Mac - Analyst (Reuters) Reuters - Apple Inc. may be considering a lower-priced version of its Macintosh computer to attract consumers already enamored of its iPod music player and annoyed by security problems with Windows PCs, PiperJaffray analyst Gene Munster said on Thursday."
173
- ```
174
-
175
- Alternatively, you can play with this model on Replicate: [tbc](tbc)
176
 
 
177
 
178
  ## Intended uses & limitations
179
 
180
- Reconstructions should not be taken as the truth, the model is likely to make some things up to fill in the gaps, and so some things may not be perfectly histoically acurate.
181
-
182
- This model was intended to be used to restore historical documents that have been imperfectly digitalised using OCR.
183
 
184
  ## Training and evaluation data
185
 
186
- TBC: evaluating on the test set from [pbevan11/synthetic-ocr-correction-gpt4o](https://huggingface.co/pbevan11/synthetic-ocr-correction-gpt4o)
187
 
188
  ## Training procedure
189
 
@@ -205,21 +142,20 @@ The following hyperparameters were used during training:
205
 
206
  | Training Loss | Epoch | Step | Validation Loss |
207
  |:-------------:|:------:|:----:|:---------------:|
208
- | 0.6611 | 0.0165 | 1 | 0.6229 |
209
- | 0.3149 | 0.2469 | 15 | 0.2870 |
210
- | 0.2074 | 0.4938 | 30 | 0.2166 |
211
- | 0.2211 | 0.7407 | 45 | 0.1937 |
212
- | 0.195 | 0.9877 | 60 | 0.1825 |
213
- | 0.1411 | 1.2140 | 75 | 0.1787 |
214
- | 0.1348 | 1.4609 | 90 | 0.1760 |
215
- | 0.1479 | 1.7078 | 105 | 0.1743 |
216
- | 0.1413 | 1.9547 | 120 | 0.1742 |
217
 
218
 
219
  ### Framework versions
220
 
221
  - PEFT 0.11.1
222
- - Transformers 4.42.3
223
  - Pytorch 2.1.2+cu118
224
  - Datasets 2.19.1
225
  - Tokenizers 0.19.1
 
1
  ---
2
+ base_model: meta-llama/Meta-Llama-3.1-8B
3
  library_name: peft
4
+ license: llama3.1
5
  tags:
6
  - axolotl
7
  - generated_from_trainer
8
  model-index:
9
+ - name: llama-3.1-8b-ocr-correction
10
  results: []
11
  ---
12
 
 
18
 
19
  axolotl version: `0.4.1`
20
  ```yaml
21
+ base_model: meta-llama/Meta-Llama-3.1-8B
22
  model_type: AutoModelForCausalLM
23
  tokenizer_type: AutoTokenizer
 
24
 
25
  load_in_8bit: false
26
  load_in_4bit: true
 
34
  - path: ft_data/alpaca_data.jsonl
35
  type: alpaca
36
  dataset_prepared_path: last_run_prepared
37
+ val_set_size: 0.05
38
  output_dir: ./qlora-alpaca-out
39
+ hub_model_id: pbevan11/llama-3.1-8b-ocr-correction
40
 
41
  adapter: qlora
42
  lora_model_dir:
43
 
44
+ sequence_len: 8192
45
  sample_packing: true
46
  pad_to_sequence_len: true
47
 
 
61
 
62
  wandb_project: ocr-ft
63
  wandb_entity: sncds
64
+ wandb_name: llama31
65
 
66
  gradient_accumulation_steps: 4
67
  micro_batch_size: 2 # was 16
 
103
 
104
  </details><br>
105
 
106
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/sncds/ocr-ft/runs/rotjhntf)
107
+ # llama-3.1-8b-ocr-correction
108
 
109
+ This model is a fine-tuned version of [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) on the None dataset.
110
  It achieves the following results on the evaluation set:
111
+ - Loss: 0.1901
112
 
113
+ ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
+ More information needed
116
 
117
  ## Intended uses & limitations
118
 
119
+ More information needed
 
 
120
 
121
  ## Training and evaluation data
122
 
123
+ More information needed
124
 
125
  ## Training procedure
126
 
 
142
 
143
  | Training Loss | Epoch | Step | Validation Loss |
144
  |:-------------:|:------:|:----:|:---------------:|
145
+ | 0.61 | 0.0331 | 1 | 0.6018 |
146
+ | 0.4379 | 0.2645 | 8 | 0.4256 |
147
+ | 0.2531 | 0.5289 | 16 | 0.2714 |
148
+ | 0.2366 | 0.7934 | 24 | 0.2247 |
149
+ | 0.1839 | 1.0331 | 32 | 0.2053 |
150
+ | 0.1752 | 1.2975 | 40 | 0.1961 |
151
+ | 0.1629 | 1.5620 | 48 | 0.1909 |
152
+ | 0.163 | 1.8264 | 56 | 0.1901 |
 
153
 
154
 
155
  ### Framework versions
156
 
157
  - PEFT 0.11.1
158
+ - Transformers 4.43.2
159
  - Pytorch 2.1.2+cu118
160
  - Datasets 2.19.1
161
  - Tokenizers 0.19.1
adapter_config.json CHANGED
@@ -20,12 +20,12 @@
20
  "rank_pattern": {},
21
  "revision": null,
22
  "target_modules": [
23
- "up_proj",
24
- "down_proj",
25
- "gate_proj",
26
- "k_proj",
27
  "v_proj",
28
  "q_proj",
 
 
 
 
29
  "o_proj"
30
  ],
31
  "task_type": "CAUSAL_LM",
 
20
  "rank_pattern": {},
21
  "revision": null,
22
  "target_modules": [
 
 
 
 
23
  "v_proj",
24
  "q_proj",
25
+ "k_proj",
26
+ "down_proj",
27
+ "up_proj",
28
+ "gate_proj",
29
  "o_proj"
30
  ],
31
  "task_type": "CAUSAL_LM",
adapter_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c64465bb2211b47808dc809512a591f6ada32a06c95e2e5ae6b3bef6b9622301
3
  size 167934026
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:befe7ee91cb8ab62450880c1dabf645b053b56d4e5b4cf5a4776e29329224eeb
3
  size 167934026
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:01ec29cca5cd60a3f5e350beff5a81a570434ae6b0102782e0b5ac9b40ebc71c
3
  size 167832688
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:094a56bdc5ddb4b0283610f269f8a14fe9b93e86c16ad75b348c378b9c7405f6
3
  size 167832688
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c519c403422bf69950c967ebaefca549327caa74bc80ee375ab0687e7ae81986
3
  size 6072
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:823c026c21ead0a0fcfbdb2b1d26d1596e5af7ebb2cff85f40a3fbb177930914
3
  size 6072