mtyrrell commited on
Commit
bf62dbf
1 Parent(s): 4931280

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -57,14 +57,15 @@ The combined dataset[GIZ/policy_qa_v0_1](https://huggingface.co/datasets/GIZ/pol
57
  The pre-processing operations used to produce the final training dataset were as follows:
58
 
59
  1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85).
60
- 2. For ClimateWatch, all rows are removed as there was assessed to be no taxonomical alignment with the IKITracs labels inherent to the dataset. For IKITracs, labels are assigned based on the presence of certain substrings based on 'parameter' values which correspond to assessments of Net-Zero targets by human annotaters. The specific assignments are as follows:
 
61
  > - 'GHG': target_labels_ghg_yes = ['T_Transport_Unc','T_Transport_C']
62
  > - 'NOT_GHG': target_labels_ghg_no = ['T_Adaptation_Unc', 'T_Adaptation_C', 'T_Transport_O_Unc', 'T_Transport_O_C']
63
  > - 'NEGATIVE': random sample of other labeled data omitting above labels
64
- 3. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
65
- 4. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
66
- 5. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
67
- 6. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```.
68
 
69
 
70
  ## Training procedure
 
57
  The pre-processing operations used to produce the final training dataset were as follows:
58
 
59
  1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85).
60
+ 2. For ClimateWatch, all rows are removed as there was assessed to be no taxonomical alignment with the IKITracs labels inherent to the dataset.
61
+ 3. For IKITracs, labels are assigned based on 'parameter' values which correspond to assessments of Transport-related GHG targets by human annotaters. The specific assignments are as follows:
62
  > - 'GHG': target_labels_ghg_yes = ['T_Transport_Unc','T_Transport_C']
63
  > - 'NOT_GHG': target_labels_ghg_no = ['T_Adaptation_Unc', 'T_Adaptation_C', 'T_Transport_O_Unc', 'T_Transport_O_C']
64
  > - 'NEGATIVE': random sample of other labeled data omitting above labels
65
+ 4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
66
+ 5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
67
+ 6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
68
+ 7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```.
69
 
70
 
71
  ## Training procedure