- Corpora: bigbio/cas
- Embeddings & Sequence Labelling: DrBERT-7GB
- Number of Epochs: 200
DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks. Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.
CAS: French Corpus with Clinical Cases
Train | Dev | Test | |
---|---|---|---|
Documents | 5,306 | 1,137 | 1,137 |
The ESSAIS (Dalloux et al., 2021) and CAS (Grabar et al., 2018) corpora respectively contain 13,848 and 7,580 clinical cases in French. Some clinical cases are associated with discussions. A subset of the whole set of cases is enriched with morpho-syntactic (part-of-speech (POS) tagging, lemmatization) and semantic (UMLS concepts, negation, uncertainty) annotations. In our case, we focus only on the POS tagging task.
Model Metric
precision recall f1-score support
ABR 0.8683 0.8480 0.8580 171
ADJ 0.9634 0.9751 0.9692 4018
ADV 0.9935 0.9849 0.9892 926
DET:ART 0.9982 0.9997 0.9989 3308
DET:POS 1.0000 1.0000 1.0000 133
INT 1.0000 0.7000 0.8235 10
KON 0.9883 0.9976 0.9929 845
NAM 0.9144 0.9353 0.9247 834
NOM 0.9827 0.9803 0.9815 7980
NUM 0.9825 0.9845 0.9835 1422
PRO:DEM 0.9924 1.0000 0.9962 131
PRO:IND 0.9630 1.0000 0.9811 78
PRO:PER 0.9948 0.9931 0.9939 579
PRO:REL 1.0000 0.9908 0.9954 109
PRP 0.9989 0.9982 0.9985 3785
PRP:det 1.0000 0.9985 0.9993 681
PUN 0.9996 0.9958 0.9977 2376
PUN:cit 0.9756 0.9524 0.9639 84
SENT 1.0000 0.9974 0.9987 1174
SYM 0.9495 1.0000 0.9741 94
VER:cond 1.0000 1.0000 1.0000 11
VER:futu 1.0000 0.9444 0.9714 18
VER:impf 1.0000 0.9963 0.9981 804
VER:infi 1.0000 0.9585 0.9788 193
VER:pper 0.9742 0.9564 0.9652 1261
VER:ppre 0.9617 0.9901 0.9757 203
VER:pres 0.9833 0.9904 0.9868 830
VER:simp 0.9123 0.7761 0.8387 67
VER:subi 1.0000 0.7000 0.8235 10
VER:subp 1.0000 0.8333 0.9091 18
accuracy 0.9842 32153
macro avg 0.9799 0.9492 0.9623 32153
weighted avg 0.9843 0.9842 0.9842 32153
Citation BibTeX
@inproceedings{labrak2023drbert,
title = {{DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains}},
author = {Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Rouvier, Mickael and Morin, Emmanuel and Daille, Béatrice and Gourraud, Pierre-Antoine},
booktitle = {Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL'23), Long Paper},
month = july,
year = 2023,
address = {Toronto, Canada},
publisher = {Association for Computational Linguistics}
}
- Downloads last month
- 21