abhi227070 commited on
Commit
9e84c78
1 Parent(s): fe442bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -11
README.md CHANGED
@@ -11,6 +11,8 @@ model-index:
11
  results: []
12
  library_name: adapter-transformers
13
  pipeline_tag: text-classification
 
 
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -58,21 +60,17 @@ The evaluation was performed on a separate validation set derived from the IMDB
58
 
59
  ### Training Procedure
60
 
61
- The IMDB dataset was first loaded and preprocessed using Pandas and Scikit-learn. The dataset, which originally contained columns for movie reviews and sentiments, was transformed to include a numerical label for sentiment, with positive and negative sentiments encoded as 1 and 0 respectively. This processed dataset was then split into three subsets: training (60%), validation (20%), and test (20%), ensuring that the indices were reset for proper formatting and usage.
62
-
63
- Next, the datasets were converted into the `DatasetDict` format, which is compatible with the Hugging Face Transformers library. This step facilitated the seamless integration of the data with the tokenization and model training pipeline.
64
-
65
- The tokenization process was handled by the `AutoTokenizer` from the Hugging Face library, specifically using the DistilBERT tokenizer. A preprocessing function was defined to tokenize the text data, truncating where necessary to fit the model's input requirements. This function was applied to the entire dataset, transforming the text into tokenized sequences ready for model ingestion.
66
 
67
- To handle the variability in sequence lengths, a `DataCollatorWithPadding` was employed. This ensured that during batching, the input sequences were dynamically padded to the same length, making the training process more efficient and standardized.
68
 
69
- For the model setup, the `AutoModelForSequenceClassification` was used, initializing it with DistilBERT as the base model. This model was configured for binary classification, with labels mapped to sentiment classes—positive and negative. This setup provided the structural foundation for the sentiment analysis task.
70
 
71
- Training arguments were meticulously defined, including critical parameters such as learning rate, batch size, number of training epochs, evaluation strategy, and logging steps. These parameters were tailored to optimize the training process, ensuring that the model learned effectively from the data while also being regularly evaluated on the validation set.
72
 
73
- Evaluation metrics were established to assess the model's performance. The metrics included accuracy and F1 score, which are crucial for evaluating classification models. A dedicated function was implemented to compute these metrics, comparing the predicted labels with the true labels to provide a quantitative measure of model performance.
74
 
75
- Finally, the training was conducted using the `Trainer` class from the Transformers library. This class orchestrated the training process, integrating the model, training arguments, tokenized datasets, data collator, and evaluation metrics. The training process was conducted over three epochs, with the model being evaluated at the end of each epoch to track its performance and adjust accordingly. This comprehensive training procedure ensured that the model was fine-tuned effectively for sentiment analysis on the IMDB dataset, achieving high accuracy and F1 scores on the evaluation set.
76
 
77
  ### Training hyperparameters
78
 
@@ -98,4 +96,4 @@ The following hyperparameters were used during training:
98
  - Transformers 4.42.4
99
  - Pytorch 2.3.0+cu121
100
  - Datasets 2.20.0
101
- - Tokenizers 0.19.1
 
11
  results: []
12
  library_name: adapter-transformers
13
  pipeline_tag: text-classification
14
+ datasets:
15
+ - proj-persona/PersonaHub
16
  ---
17
 
18
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
60
 
61
  ### Training Procedure
62
 
63
+ ### Training Procedure
 
 
 
 
64
 
65
+ The IMDB dataset was loaded and preprocessed to include numerical labels for sentiments (positive and negative). It was then split into training (60%), validation (20%), and test (20%) sets, and the indices were reset for proper formatting.
66
 
67
+ The data was converted into the `DatasetDict` format compatible with the Hugging Face Transformers library. The `AutoTokenizer` for DistilBERT was used to tokenize the text data, truncating where necessary. A preprocessing function applied tokenization to the entire dataset, preparing it for model training.
68
 
69
+ A `DataCollatorWithPadding` was used to handle the variability in sequence lengths during batching, ensuring efficiency and standardization. The `AutoModelForSequenceClassification` with DistilBERT as the base model was set up for binary classification, mapping labels to sentiment classes (positive and negative).
70
 
71
+ Training arguments included learning rate, batch size, number of epochs, evaluation strategy, and logging steps, optimized for effective training. Evaluation metrics, including accuracy and F1 score, were defined to assess model performance, with a function to compute these metrics by comparing predictions with true labels.
72
 
73
+ The `Trainer` class from the Transformers library was used to conduct the training over three epochs, integrating the model, training arguments, tokenized datasets, data collator, and evaluation metrics. This comprehensive approach ensured effective fine-tuning for sentiment analysis on the IMDB dataset, achieving high accuracy and F1 scores.
74
 
75
  ### Training hyperparameters
76
 
 
96
  - Transformers 4.42.4
97
  - Pytorch 2.3.0+cu121
98
  - Datasets 2.20.0
99
+ - Tokenizers 0.19.1