julien-c HF staff commited on
Commit
30d8a53
1 Parent(s): ee7100a

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md

Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - hi
4
+ - en
5
+ tags:
6
+ - es
7
+ - en
8
+ - codemix
9
+ license: "apache-2.0"
10
+ datasets:
11
+ - SAIL 2017
12
+ metrics:
13
+ - fscore
14
+ - accuracy
15
+ - precision
16
+ - recall
17
+ ---
18
+
19
+ # BERT codemixed base model for Hinglish (cased)
20
+
21
+ This model was built using [lingualytics](https://github.com/lingualytics/py-lingualytics), an open-source library that supports code-mixed analytics.
22
+
23
+ ## Model description
24
+
25
+ Input for the model: Any codemixed Hinglish text
26
+ Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)
27
+
28
+ I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [SAIL 2017](http://www.dasdipankar.com/SAILCodeMixed.html) dataset.
29
+
30
+ ## Eval results
31
+
32
+ Performance of this model on the dataset
33
+
34
+ | metric | score |
35
+ |------------|----------|
36
+ | acc | 0.55873 |
37
+ | f1 | 0.558369 |
38
+ | acc_and_f1 | 0.558549 |
39
+ | precision | 0.558075 |
40
+ | recall | 0.55873 |
41
+
42
+ #### How to use
43
+
44
+ Here is how to use this model to get the features of a given text in *PyTorch*:
45
+
46
+ ```python
47
+ # You can include sample code which will be formatted
48
+ from transformers import BertTokenizer, BertModelForSequenceClassification
49
+ tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
50
+ model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
51
+ text = "Replace me by any text you'd like."
52
+ encoded_input = tokenizer(text, return_tensors='pt')
53
+ output = model(**encoded_input)
54
+ ```
55
+
56
+ and in *TensorFlow*:
57
+
58
+ ```python
59
+ from transformers import BertTokenizer, TFBertModel
60
+ tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
61
+ model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
62
+ text = "Replace me by any text you'd like."
63
+ encoded_input = tokenizer(text, return_tensors='tf')
64
+ output = model(encoded_input)
65
+ ```
66
+
67
+ #### Preprocessing
68
+
69
+ Followed standard preprocessing techniques:
70
+ - removed digits
71
+ - removed punctuation
72
+ - removed stopwords
73
+ - removed excess whitespace
74
+ Here's the snippet
75
+
76
+ ```python
77
+ from pathlib import Path
78
+ import pandas as pd
79
+ from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
80
+ from lingualytics.stopwords import hi_stopwords,en_stopwords
81
+ from texthero.preprocessing import remove_digits, remove_whitespace
82
+
83
+ root = Path('<path-to-data>')
84
+
85
+ for file in 'test','train','validation':
86
+ tochange = root / f'{file}.txt'
87
+ df = pd.read_csv(tochange,header=None,sep='\t',names=['text','label'])
88
+ df['text'] = df['text'].pipe(remove_digits) \
89
+ .pipe(remove_punctuation) \
90
+ .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords)) \
91
+ .pipe(remove_whitespace)
92
+ df.to_csv(tochange,index=None,header=None,sep='\t')
93
+ ```
94
+
95
+ ## Training data
96
+
97
+ The dataset and annotations are not good, but this is the best dataset I could find. I am working on procuring my own dataset and will try to come up with a better model!
98
+
99
+ ## Training procedure
100
+
101
+ I trained on the dataset on the [bert-base-multilingual-cased model](https://huggingface.co/bert-base-multilingual-cased).