File size: 3,742 Bytes
e1762a5 104e42d e1762a5 fff0b41 e1762a5 688bbe6 e1762a5 a3d78c0 e1762a5 8cb04cd 168b482 a81a6b4 e1762a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
language: fr
license: mit
datasets:
- Jean-Baptiste/wikiner_fr
widget:
- text: "Boulanger, habitant à Boulanger et travaillant dans le magasin Boulanger situé dans la ville de Boulanger."
---
DistilCamemBERT-NER
==================
We present DistilCamemBERT-NER which is [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) fine tuned for the NER (Named Entity Recognition) task for the French language. The work is inspired by [Jean-Baptiste/camembert-ner](https://huggingface.co/Jean-Baptiste/camembert-ner) based on the [CamemBERT](https://huggingface.co/camembert-base) model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase for example. Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which **divides the inference time by 2** with the same consumption power thanks to [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base).
Dataset
----------
The dataset used is [wikiner_fr](https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr) which represents ~170k sentences labelized in 5 categories :
* PER: personality ;
* LOC: location ;
* ORG: organization ;
* MISC: Miscellaneous entities ;
* O: background (Other).
Evaluation results
------------------------
| class | precision (%) | recall (%) | f1 (%) | support (#sub-word) |
| :----: | :---------: | :-----------: | :-----: | :-----------------: |
| global | 98.35 | 98.36 | 98.35 | 492'243 |
| PER | 96.22 | 97.41 | 96.81 | 27'842 |
| LOC | 93.93 | 93.50 | 93.72 | 31'431 |
| ORG | 85.13 | 87.08 | 86.10 | 7'662 |
| MISC | 88.55 | 81.84 | 85.06 | 13'553 |
| O | 99.40 | 99.55 | 99.47 | 411'755 |
How to use DistilCamemBERT-NER
------------------------------------------------
```python
from transformers import pipeline
ner = pipeline('ner', model="cmarkea/distilcamembert-base-ner", tokenizer="cmarkea/distilcamembert-base-ner", aggregation_strategy="simple")
result = ner("Le Crédit Mutuel Arkéa est une banque Française, elle comprend le CMB qui est une banque située en Bretagne et le CMSO qui est une banque qui se situe principalement en Aquitaine. C'est sous la présidence de Louis Lichou, dans les années 1980 que différentes filiales sont créées au sein du CMB et forment les principales filiales du groupe qui existent encore aujourd'hui (Federal Finance, Suravenir, Financo, etc.).")
result
[{'entity_group': 'ORG',
'score': 0.99327177,
'word': 'Crédit Mutuel Arkéa',
'start': 3,
'end': 22},
{'entity_group': 'LOC',
'score': 0.5869117,
'word': 'Française',
'start': 38,
'end': 47},
{'entity_group': 'ORG',
'score': 0.9728106,
'word': 'CMB',
'start': 66,
'end': 69},
{'entity_group': 'LOC',
'score': 0.9974824,
'word': 'Bretagne',
'start': 99,
'end': 107},
{'entity_group': 'ORG',
'score': 0.956406,
'word': 'CMSO',
'start': 114,
'end': 118},
{'entity_group': 'LOC',
'score': 0.99741644,
'word': 'Aquitaine',
'start': 169,
'end': 178},
{'entity_group': 'PER',
'score': 0.9988959,
'word': 'Louis Lichou',
'start': 208,
'end': 220},
{'entity_group': 'ORG',
'score': 0.93090177,
'word': 'CMB',
'start': 291,
'end': 294},
{'entity_group': 'ORG',
'score': 0.9965743,
'word': 'Federal Finance',
'start': 374,
'end': 389},
{'entity_group': 'ORG',
'score': 0.99655724,
'word': 'Suravenir',
'start': 391,
'end': 400},
{'entity_group': 'ORG',
'score': 0.99653435,
'word': 'Financo',
'start': 402,
'end': 409}]
```
|