Edit model card

Segmentext is a specialized language model for text-segmentation. Segmentext has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.

In contrast with most text-segmentation approach, Segmentext is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.

Segmentext was trained using HPC resources from GENCIโ€“IDRIS on Ad Astra with 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).

Given the diversity of the training data, Segmentext should work correctly on diverse document formats in the main European languages.

Segmentext can be tested on PleIAs-Bad-Data-Editor, a free demo along with OCRonos, another model trained by PleIAs for the correction of OCR errors and other digitization artifact.

Use

Segmentext support the following text segmentation:

  • Text
  • Separator - actually a segmentation separator, generally based on newline (actually ยถ) with some variations due to text segmentation understanding.
  • Title
  • Table
  • Dialog - any kind of speaker attributed intervention.
  • Bibliography - statement of a specific bibliographic reference, either in a bibliography section or a footnote.
  • Contact - personal information, can be especially useful in the context of PII removal.
  • Paratext - any non-meaningful text included in standard documents like header, page numbering, section recall, etc.
  • Author - author names and signatures.
  • Date - statement of date and time, common in letters and newspaper articles.
  • Keyword - list of keywords, especially common in scientific publications.

Example

Downloads last month
127
Safetensors
Model size
278M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using PleIAs/Segmentext 5

Collection including PleIAs/Segmentext