abhi227070 commited on
Commit
fe442bb
1 Parent(s): 9405d51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -6
README.md CHANGED
@@ -18,7 +18,7 @@ should probably proofread and complete it, then remove this comment. -->
18
 
19
  # distilbert-base-uncased-finetuned-imdb
20
 
21
- This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the [IMDB dataset]().
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.2069
24
  - Accuracy: 0.9257
@@ -26,18 +26,54 @@ It achieves the following results on the evaluation set:
26
 
27
  ## Model description
28
 
29
- More information needed
 
 
30
 
31
  ## Intended uses & limitations
32
 
33
- More information needed
 
 
 
 
 
 
 
 
 
 
34
 
35
  ## Training and evaluation data
36
 
37
- More information needed
 
 
 
 
 
 
38
 
39
  ## Training procedure
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ### Training hyperparameters
42
 
43
  The following hyperparameters were used during training:
@@ -57,10 +93,9 @@ The following hyperparameters were used during training:
57
  | 0.1411 | 2.0 | 3750 | 0.2442 | 0.932 | 0.9320 |
58
  | 0.079 | 3.0 | 5625 | 0.2882 | 0.9347 | 0.9347 |
59
 
60
-
61
  ### Framework versions
62
 
63
  - Transformers 4.42.4
64
  - Pytorch 2.3.0+cu121
65
  - Datasets 2.20.0
66
- - Tokenizers 0.19.1
 
18
 
19
  # distilbert-base-uncased-finetuned-imdb
20
 
21
+ This model is a fine-tuned version of [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the [IMDB dataset](https://huggingface.co/datasets/abhi227070/imdb-dataset).
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.2069
24
  - Accuracy: 0.9257
 
26
 
27
  ## Model description
28
 
29
+ This model is based on DistilBERT, a smaller, faster, cheaper, and lighter version of BERT developed by Hugging Face. It has been fine-tuned specifically for sentiment analysis on the IMDB movie reviews dataset. DistilBERT retains 97% of BERT's performance while being 60% faster and 40% smaller.
30
+
31
+ The model is trained to classify text into positive or negative sentiment, making it suitable for applications that need to understand user opinions or reviews.
32
 
33
  ## Intended uses & limitations
34
 
35
+ ### Intended Uses
36
+
37
+ - **Sentiment Analysis:** This model can be used to classify the sentiment of movie reviews as positive or negative.
38
+ - **Customer Feedback Analysis:** It can be adapted to analyze the sentiment in customer feedback for products or services.
39
+ - **Social Media Monitoring:** It can be used to track sentiment in social media posts or comments.
40
+
41
+ ### Limitations
42
+
43
+ - **Domain Specificity:** The model is specifically fine-tuned on movie reviews and may not perform as well on other types of text.
44
+ - **Binary Classification:** This model only distinguishes between positive and negative sentiments and does not account for neutral sentiments.
45
+ - **Language:** The model is trained on English text and may not perform well on text in other languages.
46
 
47
  ## Training and evaluation data
48
 
49
+ ### Training Data
50
+
51
+ The model is trained on the IMDB dataset, which consists of 50,000 highly polar movie reviews labeled as either positive or negative. The dataset is balanced, with an equal number of positive and negative reviews.
52
+
53
+ ### Evaluation Data
54
+
55
+ The evaluation was performed on a separate validation set derived from the IMDB dataset, ensuring that the model's performance metrics are based on data it has not seen during training.
56
 
57
  ## Training procedure
58
 
59
+ ### Training Procedure
60
+
61
+ The IMDB dataset was first loaded and preprocessed using Pandas and Scikit-learn. The dataset, which originally contained columns for movie reviews and sentiments, was transformed to include a numerical label for sentiment, with positive and negative sentiments encoded as 1 and 0 respectively. This processed dataset was then split into three subsets: training (60%), validation (20%), and test (20%), ensuring that the indices were reset for proper formatting and usage.
62
+
63
+ Next, the datasets were converted into the `DatasetDict` format, which is compatible with the Hugging Face Transformers library. This step facilitated the seamless integration of the data with the tokenization and model training pipeline.
64
+
65
+ The tokenization process was handled by the `AutoTokenizer` from the Hugging Face library, specifically using the DistilBERT tokenizer. A preprocessing function was defined to tokenize the text data, truncating where necessary to fit the model's input requirements. This function was applied to the entire dataset, transforming the text into tokenized sequences ready for model ingestion.
66
+
67
+ To handle the variability in sequence lengths, a `DataCollatorWithPadding` was employed. This ensured that during batching, the input sequences were dynamically padded to the same length, making the training process more efficient and standardized.
68
+
69
+ For the model setup, the `AutoModelForSequenceClassification` was used, initializing it with DistilBERT as the base model. This model was configured for binary classification, with labels mapped to sentiment classes—positive and negative. This setup provided the structural foundation for the sentiment analysis task.
70
+
71
+ Training arguments were meticulously defined, including critical parameters such as learning rate, batch size, number of training epochs, evaluation strategy, and logging steps. These parameters were tailored to optimize the training process, ensuring that the model learned effectively from the data while also being regularly evaluated on the validation set.
72
+
73
+ Evaluation metrics were established to assess the model's performance. The metrics included accuracy and F1 score, which are crucial for evaluating classification models. A dedicated function was implemented to compute these metrics, comparing the predicted labels with the true labels to provide a quantitative measure of model performance.
74
+
75
+ Finally, the training was conducted using the `Trainer` class from the Transformers library. This class orchestrated the training process, integrating the model, training arguments, tokenized datasets, data collator, and evaluation metrics. The training process was conducted over three epochs, with the model being evaluated at the end of each epoch to track its performance and adjust accordingly. This comprehensive training procedure ensured that the model was fine-tuned effectively for sentiment analysis on the IMDB dataset, achieving high accuracy and F1 scores on the evaluation set.
76
+
77
  ### Training hyperparameters
78
 
79
  The following hyperparameters were used during training:
 
93
  | 0.1411 | 2.0 | 3750 | 0.2442 | 0.932 | 0.9320 |
94
  | 0.079 | 3.0 | 5625 | 0.2882 | 0.9347 | 0.9347 |
95
 
 
96
  ### Framework versions
97
 
98
  - Transformers 4.42.4
99
  - Pytorch 2.3.0+cu121
100
  - Datasets 2.20.0
101
+ - Tokenizers 0.19.1