bino282 commited on
Commit
9d3a6fe
2 Parent(s): 7d105aa 3f1d124

Merge branch 'main' of https://huggingface.co/NlpHUST/vibert4news-base-cased into main

Browse files
Files changed (3) hide show
  1. README.md +67 -12
  2. config.json +3 -0
  3. tokenizer_config.json +3 -0
README.md CHANGED
@@ -8,10 +8,6 @@ Apply for task sentiment analysis on using [AIViVN's comments dataset](https://w
8
  The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087)
9
  Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)
10
 
11
- ***************New Mar 11 , 2020 ***************
12
-
13
- **[BERT](https://github.com/google-research/bert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
14
-
15
  We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.
16
 
17
  You can download trained model:
@@ -21,9 +17,9 @@ You can download trained model:
21
  Use with huggingface/transformers
22
  ``` bash
23
  import torch
24
- from transformers import AutoTokenizer,AutoModel
25
- tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
26
- bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")
27
 
28
  line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
29
  input_id = tokenizer.encode(line,add_special_tokens = True)
@@ -37,15 +33,74 @@ print(features)
37
 
38
  ```
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  Run training with base config
41
 
42
  ``` bash
43
 
44
- python train_pytorch.py \
45
- --model_path=bert4news.pytorch \
46
- --max_len=200 \
47
- --batch_size=16 \
48
- --epochs=6 \
49
  --lr=2e-5
50
 
51
  ```
 
8
  The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087)
9
  Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)
10
 
 
 
 
 
11
  We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.
12
 
13
  You can download trained model:
 
17
  Use with huggingface/transformers
18
  ``` bash
19
  import torch
20
+ from transformers import BertTokenizer,BertModel
21
+ tokenizer= BertTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
22
+ bert_model = BertModel.from_pretrained("NlpHUST/vibert4news-base-cased")
23
 
24
  line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
25
  input_id = tokenizer.encode(line,add_special_tokens = True)
 
33
 
34
  ```
35
 
36
+ # Vietnamese toolkit with bert
37
+ ViNLP is a system annotation for Vietnamese, it use pretrain [Bert4news](https://github.com/bino282/bert4news/) to fine-turning to NLP problems in Vietnamese components of wordsegmentation,Named entity recognition (NER) and achieve high accuravy.
38
+
39
+ ### Installation
40
+ ```bash
41
+ git clone https://github.com/bino282/ViNLP.git
42
+ cd ViNLP
43
+ python setup.py develop build
44
+ ```
45
+
46
+ ### Test Segmentation
47
+ The model achieved F1 score : 0.984 on VLSP 2013 dataset
48
+
49
+ |Model | F1 |
50
+ |--------|-----------|
51
+ | **BertVnTokenizer** | 98.40 |
52
+ | **DongDu** | 96.90 |
53
+ | **JvnSegmenter-Maxent** | 97.00 |
54
+ | **JvnSegmenter-CRFs** | 97.06 |
55
+ | **VnTokenizer** | 97.33 |
56
+ | **UETSegmenter** | 97.87 |
57
+ | **VnTokenizer** | 97.33 |
58
+ | **VnCoreNLP (i.e. RDRsegmenter)** | 97.90 |
59
+
60
+
61
+ ``` bash
62
+ from ViNLP import BertVnTokenizer
63
+ tokenizer = BertVnTokenizer()
64
+ sentences = tokenizer.split(["Tổng thống Donald Trump ký sắc lệnh cấm mọi giao dịch của Mỹ với ByteDance và Tecent - chủ sở hữu của 2 ứng dụng phổ biến TikTok và WeChat sau 45 ngày nữa."])
65
+ print(sentences[0])
66
+ ```
67
+ ``` bash
68
+ Tổng_thống Donald_Trump ký sắc_lệnh cấm mọi giao_dịch của Mỹ với ByteDance và Tecent - chủ_sở_hữu của 2 ứng_dụng phổ_biến TikTok và WeChat sau 45 ngày nữa .
69
+
70
+ ```
71
+
72
+ ### Test Named Entity Recognition
73
+ The model achieved F1 score VLSP 2018 for all named entities including nested entities : 0.786
74
+
75
+ |Model | F1 |
76
+ |--------|-----------|
77
+ | **BertVnNer** | 78.60 |
78
+ | **VNER Attentive Neural Network** | 77.52 |
79
+ | **vietner CRF (ngrams + word shapes + cluster + w2v)** | 76.63 |
80
+ | **ZA-NER BiLSTM** | 74.70 |
81
+
82
+ ``` bash
83
+ from ViNLP import BertVnNer
84
+ bert_ner_model = BertVnNer()
85
+ sentence = "Theo SCMP, báo cáo của CSIS với tên gọi Định hình Tương lai Chính sách của Mỹ với Trung Quốc cũng cho thấy sự ủng hộ tương đối rộng rãi của các chuyên gia về việc cấm Huawei, tập đoàn viễn thông khổng lồ của Trung Quốc"
86
+ entities = bert_ner_model.annotate([sentence])
87
+ print(entities)
88
+
89
+ ```
90
+ ``` bash
91
+ [{'ORGANIZATION': ['SCMP', 'CSIS', 'Huawei'], 'LOCATION': ['Mỹ', 'Trung Quốc']}]
92
+
93
+ ```
94
+
95
  Run training with base config
96
 
97
  ``` bash
98
 
99
+ python train_pytorch.py \\\\
100
+ --model_path=bert4news.pytorch \\\\
101
+ --max_len=200 \\\\
102
+ --batch_size=16 \\\\
103
+ --epochs=6 \\\\
104
  --lr=2e-5
105
 
106
  ```
config.json CHANGED
@@ -1,4 +1,7 @@
1
  {
 
 
 
2
  "attention_probs_dropout_prob": 0.1,
3
  "directionality": "bidi",
4
  "hidden_act": "gelu",
 
1
  {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
  "attention_probs_dropout_prob": 0.1,
6
  "directionality": "bidi",
7
  "hidden_act": "gelu",
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "do_lower_case": false
3
+ }