Gerwin commited on
Commit
c6b99ba
1 Parent(s): 0b80012

final updates

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -13,7 +13,7 @@ metrics:
13
  ---
14
 
15
  # Legal BERT model applicable for Dutch and English
16
- A BERT model further trained from [mBERT](https://huggingface.co/bert-base-multilingual-uncased) on legal documents. The thesis can be downloaded [here](https://www.ru.nl/publish/pages/769526/gerwin_de_kruijf.pdf)
17
 
18
  ## Data
19
  The model is further trained the same way as [EurlexBERT](https://huggingface.co/nlpaueb/bert-base-uncased-eurlex): regulations, decisions, directives, and parliamentary questions were acquired in both Dutch and English. A total of 184k documents, around 295M words, was used to further train the model. This is less than 9% the size of the original BERT model.
@@ -24,11 +24,11 @@ Further training was done for 60k steps, since it showed better results compared
24
  from transformers import AutoTokenizer, AutoModel, TFAutoModel
25
  tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")
26
  model = AutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # PyTorch
27
- model = TFAutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # Tensorflow
28
  ```
29
 
30
  ## Benchmarks
31
- The thesis lists various benchmarks. Here are a couple of comparisons between popular BERT models and this model. The fine-tuning procedures for these benchmarks are identical for each pre-trained model, and are more explained in the thesis. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures. The table shows the weighted F1-scores.
32
 
33
  ### Legal topic classification
34
  | Model | [Multi-EURLEX (NL)](https://huggingface.co/datasets/multi_eurlex) |
 
13
  ---
14
 
15
  # Legal BERT model applicable for Dutch and English
16
+ A BERT model further trained from [mBERT](https://huggingface.co/bert-base-multilingual-uncased) on legal documents. The thesis can be downloaded [here](https://www.ru.nl/publish/pages/769526/gerwin_de_kruijf.pdf).
17
 
18
  ## Data
19
  The model is further trained the same way as [EurlexBERT](https://huggingface.co/nlpaueb/bert-base-uncased-eurlex): regulations, decisions, directives, and parliamentary questions were acquired in both Dutch and English. A total of 184k documents, around 295M words, was used to further train the model. This is less than 9% the size of the original BERT model.
 
24
  from transformers import AutoTokenizer, AutoModel, TFAutoModel
25
  tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")
26
  model = AutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # PyTorch
27
+ model = TFAutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # TensorFlow
28
  ```
29
 
30
  ## Benchmarks
31
+ The thesis lists various benchmarks. Here are a couple of comparisons between popular BERT models and this model. The fine-tuning procedures for these benchmarks are identical for each pre-trained model, and are more explained in the thesis. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures. The table shows the weighted F1 scores.
32
 
33
  ### Legal topic classification
34
  | Model | [Multi-EURLEX (NL)](https://huggingface.co/datasets/multi_eurlex) |