Loricanal commited on
Commit
777bc14
1 Parent(s): 2459ff0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -11
README.md CHANGED
@@ -28,27 +28,139 @@ RTE focuses on evaluating the support or refutation of claims within a given tex
28
 
29
  ## Inference API Usage
30
 
31
- When using the Inference API, it is important to note that the input should be provided by pasting the text first, followed by the claim, without any spaces or separators. The model's tokenizer concatenates these inputs in the specified order. Interestingly, inverting the order of pasting (claim first, then text) seems to produce similar results, suggesting that the model generally captures coherence within a given text.
32
 
33
 
34
  ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
 
37
-
38
- ### Training hyperparameters
39
-
40
- The following hyperparameters were used during training:
41
- - optimizer: None
42
- - training_precision: float32
43
-
44
  ## Inference procedure
 
 
 
 
 
45
 
 
 
 
 
46
 
47
  ### Evaluation results
48
  It achieves the following results on the evaluation set:
49
 
50
- ## Inference procedure
51
-
52
 
53
 
54
  ### Framework versions
@@ -56,4 +168,5 @@ It achieves the following results on the evaluation set:
56
  - Transformers 4.35.0
57
  - TensorFlow 2.13.0
58
  - Datasets 2.1.0
59
- - Tokenizers 0.14.1
 
 
28
 
29
  ## Inference API Usage
30
 
31
+ When using the Inference API, it is important to note that the input should be provided by pasting the text first, followed by the claim, without any spaces or separators. The model's tokenizer concatenates these inputs in the specified order. Interestingly, inverting the order of pasting (claim first, then text) seems to produce similar results, suggesting that the model generally captures coherence within a given text (the label 0 indicates a coherent text, while the other label 1 signify an incoherent text).
32
 
33
 
34
  ## Training procedure
35
+ The model was trained on Kaggle using as accelerator a GPU T4 x2. See the complete notebook here: <TODO>
36
+
37
+
38
+ ```python
39
+ import json
40
+ import numpy as np
41
+ import os
42
+ import pickle
43
+ from IPython.display import clear_output
44
+ import pandas as pd
45
+ import tensorflow as tf
46
+ import transformers
47
+ from datasets import load_dataset
48
+ from sklearn.metrics import confusion_matrix, classification_report
49
+ from sklearn.model_selection import train_test_split
50
+ from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
51
+ import warnings
52
+
53
+ # Silence all warnings
54
+ warnings.filterwarnings("ignore")
55
+
56
+
57
+ # Try to create a directory named "models"
58
+ try:
59
+ os.makedirs("models")
60
+ except:
61
+ # If the directory already exists or if there's an error, do nothing (pass)
62
+ pass
63
+
64
+ # Try to create a directory named "results"
65
+ try:
66
+ os.makedirs("results")
67
+ except:
68
+ # If the directory already exists or if there's an error, do nothing (pass)
69
+ pass
70
+
71
+ # Try to create a directory named "history"
72
+ try:
73
+ os.makedirs("history")
74
+ except:
75
+ # If the directory already exists or if there's an error, do nothing (pass)
76
+ pass
77
+
78
+
79
+ # Flag to determine if existing models and histories should be overwritten
80
+ overwrite = True
81
+
82
+ # Load dataset for the first fold
83
+ data = load_dataset("raicrits/fever_folds", data_files="folds_en/1.json")['train']
84
+ test = data['test'][0]
85
+ val_set = data['val'][0]
86
+ train_set = data['train'][0]
87
+
88
+ # Define paths for model, results, and history
89
+ model_path = 'models/DistilFEVERen_weights_0.h5'
90
+ results_path = "results/DistilFEVERen_0.json"
91
+ history_path = 'history/DistilFEVERen_0.pickle'
92
+
93
+ # Load the tokenizer
94
+ tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
95
+
96
+ # Preprocess the data
97
+ test_encodings = tokenizer(test['text'], test['claim'], truncation=True, padding=True, max_length=256, return_tensors='tf')
98
+ test_labels = tf.convert_to_tensor(test['label'])
99
+
100
+ train_encodings = tokenizer(train_set['text'], train_set['claim'], truncation=True, padding=True, return_tensors='tf')
101
+ val_encodings = tokenizer(val_set['text'], val_set['claim'], truncation=True, padding=True, return_tensors='tf')
102
+
103
+ train_labels = tf.convert_to_tensor(train_set['label'])
104
+ val_labels = tf.convert_to_tensor(val_set['label'])
105
+
106
+ # Check if the model and history already exist for the first fold
107
+ if not overwrite and os.path.exists(model_path):
108
+ print("Model and history already exist for fold {}. Loading...".format(0))
109
+ model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-multilingual-cased', num_labels=3)
110
+ model.load_weights(model_path)
111
+ # with open(history_path, 'rb') as file_pi:
112
+ # history = pickle.load(file_pi)
113
+ else:
114
+ # Create a new model and define loss, optimizer, and callbacks
115
+ model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-multilingual-cased', num_labels=3)
116
+ loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
117
+ optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
118
+ model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
119
+ model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
120
+ model_path,
121
+ monitor='val_loss',
122
+ save_best_only=True,
123
+ mode='min',
124
+ save_weights_only=True
125
+ )
126
+ early_stopping = tf.keras.callbacks.EarlyStopping(
127
+ monitor='val_loss',
128
+ patience=1,
129
+ mode='min',
130
+ restore_best_weights=True
131
+ )
132
+
133
+ # Train the model for the first fold
134
+ clear_output(wait=True)
135
+ history = model.fit(
136
+ [train_encodings['input_ids'], train_encodings['attention_mask']], train_labels,
137
+ validation_data=([val_encodings['input_ids'], val_encodings['attention_mask']], val_labels),
138
+ batch_size=10,
139
+ epochs=100,
140
+ callbacks=[early_stopping, model_checkpoint]
141
+ )
142
+
143
+ # Save the training history
144
+ with open(history_path, 'wb') as file_pi:
145
+ pickle.dump(history.history, file_pi)
146
+ ```
147
 
148
 
 
 
 
 
 
 
 
149
  ## Inference procedure
150
+ ```python
151
+ def getPrediction(model,tokenizer,claim,text):
152
+ encodings = tokenizer([text], [claim], truncation=True, padding=True, max_length=256, return_tensors='tf')
153
+ preds = model.predict([encodings['input_ids'], encodings["attention_mask"]])
154
+ return preds
155
 
156
+ text = "Soul Food is a 1997 American comedy-drama film produced by Kenneth `` Babyface '' Edmonds , Tracey Edmonds and Robert Teitel and released by Fox 2000 Pictures ."
157
+ claim = 'Fox 2000 Pictures released the film Soul Food .'
158
+ getPrediction(model,tokenizer,claim,text)
159
+ ```
160
 
161
  ### Evaluation results
162
  It achieves the following results on the evaluation set:
163
 
 
 
164
 
165
 
166
  ### Framework versions
 
168
  - Transformers 4.35.0
169
  - TensorFlow 2.13.0
170
  - Datasets 2.1.0
171
+ - Tokenizers 0.14.1
172
+ - Numpy 1.24.3