--- license: mit language: - fr library_name: transformers tags: - linformer - medical - RoBERTa - pytorch --- # Jargon-NACHOS-4096 [Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture. Jargon is available in several versions with different context sizes and types of pre-training corpora. | **Model** | **Initialised from...** |**Training Data**| |-------------------------------------------------------------------------------------|:-----------------------:|:----------------:| | [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) | scratch |8.5GB Web Corpus| | [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) | jargon-general-base |5.4GB Medical Corpus| | jargon-general-legal | jargon-general-base |18GB Legal Corpus | [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) | jargon-general-base |Medical+Legal Corpora| | jargon-legal | scratch |18GB Legal Corpus| | [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096) | scratch |18GB Legal Corpus| | [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) | scratch |5.4GB Medical Corpus| | [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) | scratch |5.4GB Medical Corpus| | [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)| | [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)| ## Evaluation The Jargon models were evaluated on an range of specialized downstream tasks. ## Biomedical Benchmark Results averaged across five funs with varying random seeds. | |[**FrenchMedMCQA**](https://huggingface.co/datasets/qanastek/frenchmedmcqa)|[**MQC**](https://aclanthology.org/2020.lrec-1.72/)|[**CAS-POS**](https://clementdalloux.fr/?page_id=28)|[**ESSAI-POS**](https://clementdalloux.fr/?page_id=28)|[**CAS-SG**](https://aclanthology.org/W18-5614/)|[**MEDLINE**](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)|[**EMEA**](https://huggingface.co/datasets/mnaguib/QuaeroFrenchMed)|[**E3C-NER**](https://live.european-language-grid.eu/catalogue/corpus/7618)|[**CLISTER**](https://aclanthology.org/2022.lrec-1.459/)| |-------------------------|:-----------------------:|:-----------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:| | **Task Type** | Sequence Classification | Sequence Classification | Token Classification | Token Classification | Token Classification | Token Classification | Token Classification | Token Classification | STS | | **Metric** | EMR | Accuracy | Macro-F1 | Macro-F1 | Weighted F1 | Weighted F1 | Weighted F1 | Weighted F1 | Spearman Correlation | | jargon-general-base | 12.9 | 76.7 | 96.6 | 96.0 | 69.4 | 81.7 | 96.5 | 91.9 | 78.0 | | jargon-biomed | 15.3 | 91.1 | 96.5 | 95.6 | 75.1 | 83.7 | 96.5 | 93.5 | 74.6 | | jargon-biomed-4096 | 14.4 | 78.9 | 96.6 | 95.9 | 73.3 | 82.3 | 96.3 | 92.5 | 65.3 | | jargon-general-biomed | 16.1 | 69.7 | 95.1 | 95.1 | 67.8 | 78.2 | 96.6 | 91.3 | 59.7 | | jargon-multidomain-base | 14.9 | 86.9 | 96.3 | 96.0 | 70.6 | 82.4 | 96.6 | 92.6 | 74.8 | | jargon-NACHOS | 13.3 | 90.7 | 96.3 | 96.2 | 75.0 | 83.4 | 96.8 | 93.1 | 70.9 | | jargon-NACHOS-4096 | 18.4 | 93.2 | 96.2 | 95.9 | 74.9 | 83.8 | 96.8 | 93.2 | 74.9 | For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/). ## Using Jargon models with HuggingFace transformers You can get started with `jargon-NACHOS-4096` using the code snippet below: ```python from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-NACHOS-4096", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-NACHOS-4096", trust_remote_code=True) jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer) output = jargon_maskfiller("Il est allé au hier") ``` You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question. - **Language(s):** French - **License:** MIT - **Developed by:** Vincent Segonne - **Funded by** - GENCI-IDRIS (Grant 2022 A0131013801) - French National Research Agency: Pantagruel grant ANR-23-IAS1-0001 - MIAI@Grenoble Alpes ANR-19-P3IA-0003 - PROPICTO ANR-20-CE93-0005 - Lawbot ANR-20-CE38-0013 - Swiss National Science Foundation (grant PROPICTO N°197864) - **Authors** - Vincent Segonne - Aidan Mannion - Laura Cristina Alonzo Canul - Alexandre Audibert - Xingyu Liu - Cécile Macaire - Adrien Pupier - Yongxin Zhou - Mathilde Aguiar - Felix Herron - Magali Norré - Massih-Reza Amini - Pierrette Bouillon - Iris Eshkol-Taravella - Emmanuelle Esperança-Rodier - Thomas François - Lorraine Goeuriot - Jérôme Goulian - Mathieu Lafourcade - Benjamin Lecouteux - François Portet - Fabien Ringeval - Vincent Vandeghinste - Maximin Coavoux - Marco Dinarelli - Didier Schwab ## Citation If you use this model for your own research work, please cite as follows: ```bibtex @inproceedings{segonne:hal-04535557, TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}}, AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier}, URL = {https://hal.science/hal-04535557}, BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}}, ADDRESS = {Turin, Italy}, YEAR = {2024}, MONTH = May, KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription}, PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf}, HAL_ID = {hal-04535557}, HAL_VERSION = {v1}, } ```