Update README.md
Browse files
README.md
CHANGED
@@ -65,8 +65,11 @@ The pre-processing operations used to produce the final training dataset were as
|
|
65 |
4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
|
66 |
5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
|
67 |
6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
|
68 |
-
7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```.
|
69 |
-
|
|
|
|
|
|
|
70 |
|
71 |
## Training procedure
|
72 |
|
|
|
65 |
4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
|
66 |
5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
|
67 |
6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
|
68 |
+
7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```. This is done to increase the number of training samples available for the GHG class from 42 to 84. The end result is a more equal sample per class breakdown of:
|
69 |
+
> - GHG: 84
|
70 |
+
> - NOT-GHG: 191
|
71 |
+
> - NEGATIVE: 190
|
72 |
+
8. To address the remaining class imbalance, inverse frequency class weights are computed and passed to a custom single label trainer function which is used during hyperparameter tuning and final model training.
|
73 |
|
74 |
## Training procedure
|
75 |
|