final updates
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ metrics:
|
|
13 |
---
|
14 |
|
15 |
# Legal BERT model applicable for Dutch and English
|
16 |
-
A BERT model further trained from [mBERT](https://huggingface.co/bert-base-multilingual-uncased) on legal documents. The thesis can be downloaded [here](https://www.ru.nl/publish/pages/769526/gerwin_de_kruijf.pdf)
|
17 |
|
18 |
## Data
|
19 |
The model is further trained the same way as [EurlexBERT](https://huggingface.co/nlpaueb/bert-base-uncased-eurlex): regulations, decisions, directives, and parliamentary questions were acquired in both Dutch and English. A total of 184k documents, around 295M words, was used to further train the model. This is less than 9% the size of the original BERT model.
|
@@ -24,11 +24,11 @@ Further training was done for 60k steps, since it showed better results compared
|
|
24 |
from transformers import AutoTokenizer, AutoModel, TFAutoModel
|
25 |
tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")
|
26 |
model = AutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # PyTorch
|
27 |
-
model = TFAutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") #
|
28 |
```
|
29 |
|
30 |
## Benchmarks
|
31 |
-
The thesis lists various benchmarks. Here are a couple of comparisons between popular BERT models and this model. The fine-tuning procedures for these benchmarks are identical for each pre-trained model, and are more explained in the thesis. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures. The table shows the weighted F1
|
32 |
|
33 |
### Legal topic classification
|
34 |
| Model | [Multi-EURLEX (NL)](https://huggingface.co/datasets/multi_eurlex) |
|
|
|
13 |
---
|
14 |
|
15 |
# Legal BERT model applicable for Dutch and English
|
16 |
+
A BERT model further trained from [mBERT](https://huggingface.co/bert-base-multilingual-uncased) on legal documents. The thesis can be downloaded [here](https://www.ru.nl/publish/pages/769526/gerwin_de_kruijf.pdf).
|
17 |
|
18 |
## Data
|
19 |
The model is further trained the same way as [EurlexBERT](https://huggingface.co/nlpaueb/bert-base-uncased-eurlex): regulations, decisions, directives, and parliamentary questions were acquired in both Dutch and English. A total of 184k documents, around 295M words, was used to further train the model. This is less than 9% the size of the original BERT model.
|
|
|
24 |
from transformers import AutoTokenizer, AutoModel, TFAutoModel
|
25 |
tokenizer = AutoTokenizer.from_pretrained("Gerwin/legal-bert-dutch-english")
|
26 |
model = AutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # PyTorch
|
27 |
+
model = TFAutoModel.from_pretrained("Gerwin/legal-bert-dutch-english") # TensorFlow
|
28 |
```
|
29 |
|
30 |
## Benchmarks
|
31 |
+
The thesis lists various benchmarks. Here are a couple of comparisons between popular BERT models and this model. The fine-tuning procedures for these benchmarks are identical for each pre-trained model, and are more explained in the thesis. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures. The table shows the weighted F1 scores.
|
32 |
|
33 |
### Legal topic classification
|
34 |
| Model | [Multi-EURLEX (NL)](https://huggingface.co/datasets/multi_eurlex) |
|