article.md · cxeep/latex-ocr at main

What's the point of this?

LaTeX is the de-facto standard markup language for typesetting pretty equations in academic papers. It is extremely feature rich and flexible but very verbose. This makes it great for typesetting complex equations, but not very convenient for quick note-taking on the fly.

For example, here's a short equation from this page on Wikipedia about Quantum Electrodynamics and the corresponding LaTeX code:

$Example$

{\displaystyle {\mathcal {L}}={\bar {\psi }}(i\gamma ^{\mu }D_{\mu }-m)\psi -{\frac {1}{4}}F_{\mu \nu }F^{\mu \nu },}

This demo is a first step in solving this problem. Eventually, you'll be able to take a quick partial screenshot from a paper and a program built with this model will generate its corresponding LaTeX source code so that you can just copy/paste straight into your personal notes. No more endless googling obscure LaTeX syntax!

How does it work?

Because this problem involves looking at an image and generating valid LaTeX code, the model needs to understand both Computer Vision (CV) and Natural Language Processing (NLP). There are some other projects that aim to solve the same problem with some very interesting models. These generally involve some kind of "encoder" that looks at the image and extracts/encodes the information about the equation from the image, and a "decoder" that takes that information and translates it into what is hopefully both valid and accurate LaTeX code. The "encode" part can be done using classic CNN architectures commonly used for CV tasks, or newer vision transformer architectures. The "decode" part can be done with LSTMs or transformer decoders, using attention mechanism to make sure the decoder understands long range dependencies, e.g. remembering to close a bracket that was opened a long sequence away.

I chose to tackle this problem with transfer learning, using an existing OCR model and fine-tuning it for this task. The biggest reason for this is computing constraints - GPU hours are expensive so I wanted training to be reasonably fast, on the order of a couple of hours. There are some other benefits to this approach, e.g. the architecture is already proven to be robust. I chose TrOCR, a model trained at Microsoft for text recognition tasks which uses transformer architecture for both the encoder and decoder.

For the data, I used the im2latex-100k dataset, which includes a total of roughly 100k formulas and images. Some preprocessing steps were done by Harvard NLP for the im2markup project. To limit the scope of the project and simplify the task, I limited training data to only look at equations containing 100 LaTeX tokens or less. This covers most single line equations, including fractions, subscripts, symbols, etc, but does not cover large multi line equations, some of which can have up to 500 LaTeX tokens. GPU training was done on a Kaggle GPU Kernel in roughly 3 hours. You can find the full training code on my Kaggle profile here.

What's next?

There's multiple improvements that I'm hoping to make to this project.

More robust prediction

If you've tried the examples above (randomly sampled from the test set), you've noticed that the model predictions aren't quite perfect and the model occasionally misses, duplicates or mistakes tokens. More training on the existing data set could help with this.

More data

There's a lot of LaTeX data available on the internet besides im2latex-100k, e.g. arXiv and Wikipedia. It's just waiting to be scraped and used for this project. This means a lot of hours of scraping, cleaning, and processing but having a more diverse set of input images could improve model accuracy significantly.

Faster and smaller model

The model currently takes a few seconds to process a single image. I would love to improve performance so that it can run in one second or less, maybe even on mobile devices. This might be impossible with TrOCR which is a fairly large model, designed for use on GPUs.

Made by Young Ho Shin

Email | Github | Linkedin