HTRflow - A tool for HTR and OCR
TL;DR: Riksarkivet has released HTRflow, an open-source tool for simplifying HTR and OCR. Check out our latest models supported by HTRflow in our Model collection and learn more in the HTRflow documentation.
Introduction
The National Archives of Sweden (Riksarkivet) preserves an extensive collection of archival records, encompassing a diverse range of data from various sources, with a significant portion in handwritten form. However, these digital materials are currently just images and unlikely to foster new research and knowledge unless transcribed into searchable text, thus limiting the potential for data-driven studies and large-scale analysis.
Furthermore, the challenge today is to facilitate scalable and accessible Handwritten Text Recognition (HTR) and Optical Character Recognition (OCR) across different types of materials. At Riksarkivet, we often find ourselves reinventing the wheel to meet these goals. To address this, the AI Lab at Riksarkivet (AIRA) has released HTRflow, an open-source package designed to simplify and streamline the use of HTR and OCR for everyone.
What is HTR?
HTR is a technology that enables the automatic conversion of handwritten text in images or scanned documents into machine-readable text. Unlike traditional OCR, which is optimized for printed or typewritten text, HTR focuses on interpreting the nuances and variations of human handwriting. This technology is essential for digitizing historical documents, making them searchable and accessible for modern research and analysis.
Below you can see an example of material that has been transcribed using HTR:
HTR models
You can perform HTR using Python and the Transformer-based Optical Character Recognition (TrOCR) model from Hugging Face 🤗. Specifically, the model Riksarkivet/trocr-base-handwritten-hist-swe-2
is trained for recognizing handwritten Swedish historical texts. See the example below on how to use this model with Transformers:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests
# Load the processor and model
processor = TrOCRProcessor.from_pretrained('Riksarkivet/trocr-base-handwritten-hist-swe-2')
model = VisionEncoderDecoderModel.from_pretrained('Riksarkivet/trocr-base-handwritten-hist-swe-2')
# Load an image containing handwritten text, e.g local image or an image URL
image_url = 'https://example.com/your_handwritten_image.jpg' # Replace with your image URL
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
# Preprocess the image
pixel_values = processor(images=image, return_tensors="pt").pixel_values
# Generate transcription (you can adjust parameters like max_length and num_beams)
generated_ids = model.generate(pixel_values, max_length=512)
# Decode the generated IDs to text
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)
Check out our collection of new models released with HTRflow v0.1.2: Models Collection.
What is HTRflow
HTRflow is an open-source tool that simplifies the process of performing HTR and OCR tasks using pre-trained models in a "pipeline/blueprint" style. Key features include:
- Flexibility: Customize the HTR/OCR process for different kinds of materials.
- Compatibility: HTRflow supports all models trained by the AI lab - and more!
- YAML pipelines: HTRflow YAML pipelines are easy to create, modify and share.
- Export: Export results as Alto XML, Page XML, plain text or JSON.
- Evaluation: Compare results from different pipelines with ground truth.
At its core, HTRflow uses a pipeline pattern that operates on Collection instances, which act as the data structure. HTRflow's data handling is based on a hierarchical tree composed of interconnected nodes. This structure mirrors the physical layout of a document with pages, paragraphs, lines, marginalia and words, see below:
Collection # <-- Document
├── Page # <-- A page from the document
│ ├── Node # <-- A paragraph from the page
│ │ ├── Node # <-- A text line with transcription
│ │ └── Node
│ └── Node
│ ├── Node
│ └── Node
│
└── Page (Parent)
├── Node (Child)
└── Node (Child)
Each node in this hierarchical tree knows its parent and its children, effectively modeling the document hierarchy (e.g. Document > Page > Paragraph > Line > Word). This parent-child relationship allows for efficient organization and processing of the document's elements.
Processing results, such as recognized text, are stored at the appropriate node level. For example, text recognized from a text line in the image is stored in that line's SegmentNode, as illustrated below:
Collection
└── Page
└── SegmentNode ── TextRec
To interact with the Collection, you can traverse through the nodes down to the leaves of the tree. Methods like traverse()
and leaves()
enable easy navigation through the tree, allowing you to apply functions to all nodes, specific levels, or leaf nodes only. Each pipeline step takes a Collection and returns an updated Collection. Essentially, you declaratively define at each step what type of model or function should run before updating the Collection instances.
Here's an illustration of how the tree structure is updated during the use of a HTR (TextRecognition) and a reading order (OrderLines) step:
# Before Processing: # After Processing:
Collection Collection
├── Page ├── Page ── ReadOrder1
| └── SegmentNode | └── SegmentNode1 ── TextRec
└── Page -> └── Page ── ReadOrder2
├── SegmentNode ├── SegmentNode21 ── TextRec
└── SegmentNode └── SegmentNode22 ── TextRec
By organizing the document into a hierarchical tree structure, HTRflow enables efficient processing and manipulation of data, ensuring that each element is handled appropriately. This structure not only preserves the document's natural layout but also allows for the results to be easily exported in various formats such as ALTO XML, PAGE XML, plain text, or JSON, facilitating further analysis and integration into other workflows.
Note that HTRflow is not currently an integrated library with Hugging Face 🤗; however, you can download models as you would with Transformers.
Using HTRflow
To use HTRflow simply install with pip:
pip install htrflow
Once HTRflow is installed, run it with:
htrflow pipeline <path/to/pipeline.yaml> <path/to/image>
The pipeline
sub-command tells HTRflow to apply the pipeline defined in pipeline.yaml
on image.jpg
. To get started, try the example pipeline in the next section.
An example pipeline
Here is an example of an HTRflow pipeline:
steps:
- step: Segmentation
settings:
model: yolo
model_settings:
model: Riksarkivet/yolov9-lines-within-regions-1
- step: TextRecognition
settings:
model: TrOCR
model_settings:
model: Riksarkivet/trocr-base-handwritten-hist-swe-2
- step: OrderLines
- step: Export
settings:
format: txt
dest: outputs
This pipeline consists of four steps:
- Segmentation: Segments the image into lines using a YOLO model.
- TextRecognition: Transcribes the segmented lines using TrOCR.
- OrderLines: Orders the lines according to the reading order.
- Export: Exports the result as a text file to a directory called
outputs
.
To run the demo pipeline on your selected image, paste the pipeline content into an empty text file and save it as pipeline.yaml. Assuming the input image is called image.jpg, run HTRflow with:
htrflow pipeline pipeline.yaml image.jpg
Here are the segmentation results:
And here is the transcribed text:
Monmouth den 29 1882.
Platskade Syster emot Svåger
Hå godt är min önskan
Jag får återigen göra försöket
att sända eder bref, jag har
förut skrifvitt men ej erhallett
någott wår från eder var. varför
jag tager det för troligt att
brefven icke har gått fram.
jag har erinu den stora gåfvan
att hafva en god helsa intill
skrifvande dag, och önskligt
voro att dessa rader trefar
eder vid samma goda gofva.
och jag får önska eder lycka
på det nya åratt samt god
fortsättning på detsamma.
And that's how you make an HTRflow pipeline! 🎉
For more details about HTRflow, visit the HTRflow documentation. To learn more about our models or dataset, check out our Hugging Face organization page.