README.md · oliverguhr/fullstop-dutch-sonar-punctuation-prediction at main

metadata

language:
  - nl
tags:
  - punctuation prediction
  - punctuation
datasets: sonar
license: mit
widget:
  - text: Ondanks dat het nu bijna voorjaar is hebben we nog steds best koude dagen
    example_title: Dutch Sample
metrics:
  - f1

This model predicts the punctuation of Dutch texts. We developed it to restore the punctuation of transcribed spoken language.

This model was trained on the SoNaR Dataset.

The model restores the following punctuation markers: "." "," "?" "-" ":"

Sample Code

We provide a simple python package that allows you to process text of any length.

Install

To get started install the package from pypi:

pip install deepmultilingualpunctuation

Restore Punctuation

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-dutch-sonar-punctuation-prediction")
text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
result = model.restore_punctuation(text)
print(result)

output

hervatting van de zitting. ik verklaar de zitting van het europees parlement, die op vrijdag 17 december werd onderbroken, te zijn hervat.

Predict Labels

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-dutch-sonar-punctuation-prediction")
text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

output

[['hervatting', '0', 0.99998724], ['van', '0', 0.9999784], ['de', '0', 0.99991274], ['zitting', '.', 0.6771242], ['ik', '0', 0.9999466], ['verklaar', '0', 0.9998566], ['de', '0', 0.9999783], ['zitting', '0', 0.9999809], ['van', '0', 0.99996245], ['het', '0', 0.99997795], ['europees', '0', 0.9999783], ['parlement', ',', 0.9908242], ['die', '0', 0.999985], ['op', '0', 0.99998224], ['vrijdag', '0', 0.9999831], ['17', '0', 0.99997985], ['december', '0', 0.9999827], ['werd', '0', 0.999982], ['onderbroken', ',', 0.9951485], ['te', '0', 0.9999677], ['zijn', '0', 0.99997723], ['hervat', '.', 0.9957053]]

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores:

Label	F1 Score
0	0.985816
.	0.854380
?	0.684060
,	0.719308
:	0.696088
-	0.722000
macro average	0.776942
micro average	0.963427

Languages

Models

Languages	Model
English, Italian, French and German	oliverguhr/fullstop-punctuation-multilang-large
English, Italian, French, German and Dutch	oliverguhr/fullstop-punctuation-multilingual-sonar-base
Dutch	oliverguhr/fullstop-dutch-sonar-punctuation-prediction

Community Models

Languages	Model
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian	kredor/punctuate-all
Catalan	softcatala/fullstop-catalan-punctuation-prediction

You can use different models by setting the model parameter:

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

How to cite us

@misc{https://doi.org/10.48550/arxiv.2301.03319,
  doi = {10.48550/ARXIV.2301.03319},
  url = {https://arxiv.org/abs/2301.03319},
  author = {Vandeghinste, Vincent and Guhr, Oliver},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7},
  title = {FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers},
  publisher = {arXiv},
  year = {2023},  
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}