What counts as "text"?
This model is awesome and is something I've been looking for for a very long time. I just have a question so I understand how to use it better - what did you consider as "text" when you were doing the original training data? For instance, did you put bounding boxes around logical components like paragraphs or sections? Was there any specific criteria you used?
This model is awesome and is something I've been looking for for a very long time. I just have a question so I understand how to use it better - what did you consider as "text" when you were doing the original training data? For instance, did you put bounding boxes around logical components like paragraphs or sections? Was there any specific criteria you used?
Hi
@mstachow
,
"text" are titles, section text, and other main sections that can be parsed with an OCR model later.
I excluded certain parts: page header, footer, footnotes, authors and affiliations.
The model will break a long text sections into multiple smaller ones. This is by design and the idea is to make the OCR text length more normalized.