39 10 82

Pierre-Carl Langlais

Pclanglais

Dorialexander

AI & ML interests

Open data & open LLMs

Recent Activity

updated a dataset about 12 hours ago

PleIAs/post-ocr

Articles

Releasing the largest multilingual open pretraining dataset

8 days ago

• 94

The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

Aug 4

• 26

Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing

Jul 19

• 17

Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM

Apr 26

• 15

Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data

Apr 18

• 22

Releasing Common Corpus: the largest public domain dataset for training LLMs

Mar 20

• 17

Organizations

Posts 6

Post

2492

We release today our first foundation model and experiment with a new category: specialized pre-training.

OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…

We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…

OCRonos-Vintage: PleIAs/OCRonos-Vintage
CPU Demo: PleIAs/OCRonos-Vintage-CPU
GPU Demo: PleIAs/OCRonos-Vintage-GPU
Our annoncement and call for specialized pre-training: https://huggingface.co/blog/Pclanglais/specialized-pre-training

Post

1076

Since it is release season, at PleIAs we announce our first suite of specialized language models for document processing tasks (OCR correction, text segmentation, bibliographic extraction) and the release of the largest multimodal dataset of financial document Finance Commons: https://huggingface.co/blog/Pclanglais/finance-commons-bad-data-toolbox

LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production.

Having a wider range of real life data proved critical for this project. A few months after the release of Common Corpus, we expanded our pool of "training data commons" with a major multimodal ressource: document released as open financial data. Finance commons comprises 17 billion tokens and 1.25 PDF corporate documents released by the SEC, WTO, AMF, EU Tenders In a multiple languages with a large variety of document layouts and challenging sources to train more robust models.

With HuggingFace compute support, we release an entire pipeline to process bad data sources and make them usable in production for LLMOps or simply retrieval: PleIAs/PleIAs-Editor

This approach is based on our new series of specialized models for document processing, the "bad data toolbox" comprising:
*OCRonos, the best available model to date for OCR correction. PleIAs/OCRonos
*Segmentext, a pure semantic small model for text segmentation, working without any visual reference. PleIAs/Segmentext
*Bibtexer, a small model for bibliographic data extraction acting as a "reversed-Zotero." PleIAs/BibTexer

View all posts