|
## What's the point of this? |
|
|
|
LaTeX is the de-facto standard markup language for typesetting pretty equations in academic papers. |
|
It is extremely feature rich and flexible but very verbose. |
|
This makes it great for typesetting complex equations, but not very convenient for quick note-taking on the fly. |
|
|
|
For example, here's a short equation from [this page](https://en.wikipedia.org/wiki/Quantum_electrodynamics) on Wikipedia about Quantum Electrodynamics |
|
and the corresponding LaTeX code: |
|
|
|
![Example]( https://wikimedia.org/api/rest_v1/media/math/render/svg/6faab1adbb88a567a52e55b2012e836a011a0675 ) |
|
|
|
``` |
|
{\displaystyle {\mathcal {L}}={\bar {\psi }}(i\gamma ^{\mu }D_{\mu }-m)\psi -{\frac {1}{4}}F_{\mu \nu }F^{\mu \nu },} |
|
``` |
|
|
|
|
|
This demo is a first step in solving this problem. |
|
Eventually, you'll be able to take a quick partial screenshot from a paper |
|
and a program built with this model will generate its corresponding LaTeX source code |
|
so that you can just copy/paste straight into your personal notes. |
|
No more endless googling obscure LaTeX syntax! |
|
|
|
## How does it work? |
|
|
|
Because this problem involves looking at an image and generating valid LaTeX code, |
|
the model needs to understand both Computer Vision (CV) and Natural Language Processing (NLP). |
|
There are some other projects that aim to solve the same problem with some very interesting models. |
|
These generally involve some kind of "encoder" that looks at the image and extracts/encodes the information about the equation from the image, |
|
and a "decoder" that takes that information and translates it into what is hopefully both valid and accurate LaTeX code. |
|
The "encode" part can be done using classic CNN architectures commonly used for CV tasks, or newer vision transformer architectures. |
|
The "decode" part can be done with LSTMs or transformer decoders, using attention mechanism to make sure the decoder understands long range dependencies, e.g. remembering to close a bracket that was opened a long sequence away. |
|
|
|
I chose to tackle this problem with transfer learning, using an existing OCR model and fine-tuning it for this task. |
|
The biggest reason for this is computing constraints - |
|
GPU hours are expensive so I wanted training to be reasonably fast, on the order of a couple of hours. |
|
There are some other benefits to this approach, |
|
e.g. the architecture is already proven to be robust. |
|
I chose [TrOCR](https://arxiv.org/abs/2109.10282), a model trained at Microsoft for text recognition tasks which uses transformer architecture for both the encoder and decoder. |
|
|
|
For the data, I used the `im2latex-100k` dataset, which includes a total of roughly 100k formulas and images. |
|
Some preprocessing steps were done by Harvard NLP for the [`im2markup` project](https://github.com/harvardnlp/im2markup). |
|
To limit the scope of the project and simplify the task, I limited training data to only look at equations containing 100 LaTeX tokens or less. |
|
This covers most single line equations, including fractions, subscripts, symbols, etc, but does not cover large multi line equations, some of which can have up to 500 LaTeX tokens. |
|
GPU training was done on a Kaggle GPU Kernel in roughly 3 hours. |
|
You can find the full training code on my Kaggle profile [here](https://www.kaggle.com/code/younghoshin/finetuning-trocr/notebook). |
|
|
|
## What's next? |
|
|
|
There's multiple improvements that I'm hoping to make to this project. |
|
|
|
### More robust prediction |
|
|
|
If you've tried the examples above (randomly sampled from the test set), you've noticed that the model predictions aren't quite perfect and the model occasionally misses, duplicates or mistakes tokens. |
|
More training on the existing data set could help with this. |
|
|
|
### More data |
|
|
|
There's a lot of LaTeX data available on the internet besides `im2latex-100k`, e.g. arXiv and Wikipedia. |
|
It's just waiting to be scraped and used for this project. |
|
This means a lot of hours of scraping, cleaning, and processing but having a more diverse set of input images could improve model accuracy significantly. |
|
|
|
### Faster and smaller model |
|
|
|
The model currently takes a few seconds to process a single image. |
|
I would love to improve performance so that it can run in one second or less, maybe even on mobile devices. |
|
This might be impossible with TrOCR which is a fairly large model, designed for use on GPUs. |
|
|
|
|
|
<p style='text-align: center'>Made by Young Ho Shin</p> |
|
<p style='text-align: center'> |
|
<a href = "mailto: [email protected]">Email</a> | |
|
<a href='https://www.github.com/yhshin11'>Github</a> | |
|
<a href='https://www.linkedin.com/in/young-ho-shin/'>Linkedin</a> |
|
</p> |