Migrate model card from transformers-repo

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/rohanrajpal/bert-base-en-hi-codemix-cased/README.md

Files changed (1) hide show

README.md +101 -0

README.md ADDED Viewed

	@@ -0,0 +1,101 @@

+---
+language:
+- hi
+- en
+tags:
+- es
+- en
+- codemix
+license: "apache-2.0"
+datasets:
+- SAIL 2017
+metrics:
+- fscore
+- accuracy
+- precision
+- recall
+---
+# BERT codemixed base model for Hinglish (cased)
+This model was built using [lingualytics](https://github.com/lingualytics/py-lingualytics), an open-source library that supports code-mixed analytics.
+## Model description
+Input for the model: Any codemixed Hinglish text
+Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)
+I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [SAIL 2017](http://www.dasdipankar.com/SAILCodeMixed.html) dataset.
+## Eval results
+Performance of this model on the dataset
+| metric     |    score |
+|------------|----------|
+| acc        | 0.55873 |
+| f1         | 0.558369 |
+| acc_and_f1 | 0.558549 |
+| precision  | 0.558075 |
+| recall     | 0.55873 |
+#### How to use
+Here is how to use this model to get the features of a given text in *PyTorch*:
+```python
+# You can include sample code which will be formatted
+from transformers import BertTokenizer, BertModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+model = AutoModelForSequenceClassification.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in *TensorFlow*:
+```python
+from transformers import BertTokenizer, TFBertModel
+tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+model = TFBertModel.from_pretrained('rohanrajpal/bert-base-en-es-codemix-cased')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
+#### Preprocessing
+Followed standard preprocessing techniques:
+- removed digits
+- removed punctuation
+- removed stopwords
+- removed excess whitespace
+Here's the snippet
+```python
+from pathlib import Path
+import pandas as pd
+from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
+from lingualytics.stopwords import hi_stopwords,en_stopwords
+from texthero.preprocessing import remove_digits, remove_whitespace
+root = Path('<path-to-data>')
+for file in 'test','train','validation':
+  tochange = root / f'{file}.txt'
+  df = pd.read_csv(tochange,header=None,sep='\t',names=['text','label'])
+  df['text'] = df['text'].pipe(remove_digits) \
+                                    .pipe(remove_punctuation) \
+                                    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords)) \
+                                    .pipe(remove_whitespace)
+  df.to_csv(tochange,index=None,header=None,sep='\t')
+```
+## Training data
+The dataset and annotations are not good, but this is the best dataset I could find. I am working on procuring my own dataset and will try to come up with a better model!
+## Training procedure
+I trained on the dataset on the [bert-base-multilingual-cased model](https://huggingface.co/bert-base-multilingual-cased).