--- library_name: transformers license: mit datasets: - SpursgoZmy/MMTab - apoidea/pubtabnet-html language: - en base_model: google/pix2struct-base pipeline_tag: image-to-text --- # pix2struct-base-table2html *Turn table images into HTML!* ## Demo app Try the [demo app](https://huggingface.co/spaces/KennethTM/Table2html-table-detection-and-recognition) which contains both table detection and recognition! ## About This model takes an image of a table and outputs HTML - the model parses the image and performs optical character recognition (OCR) and structure recognition to HTML format. The model expects an image containing only a table. If the table is embedded in a document, first use a table detection model to extract it (e.g. [Microsoft's Table Transformer model](https://huggingface.co/microsoft/table-transformer-detection)). The model is finetuned from [Pix2Struct base model](https://huggingface.co/google/pix2struct-base) using a max_patch_length of 1024 and max generation length of 1024. The max_patch_length should likely not be changed for inference but the generation length can be changed. The model has been trained using two datasets: [MMTab](https://huggingface.co/datasets/SpursgoZmy/MMTab) and [PubTabNet](https://huggingface.co/datasets/apoidea/pubtabnet-html). ## Usage Below is a complete example of loading the model and performing inference on an example table image (example from the [MMTab dataset](https://huggingface.co/datasets/SpursgoZmy/MMTab)): ```python import torch from transformers import AutoProcessor, Pix2StructForConditionalGeneration from PIL import Image import requests from io import BytesIO # Load model and processor device = "cuda" if torch.cuda.is_available() else "cpu" processor = AutoProcessor.from_pretrained("KennethTM/pix2struct-base-table2html") model = Pix2StructForConditionalGeneration.from_pretrained("KennethTM/pix2struct-base-table2html") model.to(device) model.eval() # Load example image from URL url = "https://huggingface.co/KennethTM/pix2struct-base-table2html/resolve/main/example_recog_1.jpg" response = requests.get(url) image = Image.open(BytesIO(response.content)) # Run model inference encoding = processor(image, return_tensors="pt", max_patches=1024) with torch.inference_mode(): flattened_patches = encoding.pop("flattened_patches").to(device) attention_mask = encoding.pop("attention_mask").to(device) predictions = model.generate(flattened_patches=flattened_patches, attention_mask=attention_mask, max_new_tokens=1024) predictions_decoded = processor.tokenizer.batch_decode(predictions, skip_special_tokens=True) # Show predictions as text print(predictions_decoded[0]) ``` Example image: ![](https://huggingface.co/KennethTM/pix2struct-base-table2html/resolve/main/example_recog_1.jpg) Model HTML output for example image: ```html
Rank Lane Name Nationality Time Notes
4 Michael Phelps United States 51.25 OR
3 Ian Crocker United States 51.29
5 Andriy Serdinov Ukraine 51.36 EU
4 1 Thomas Rupprath Germany 52.27
5 6 Igor Marchenko Russia 52.32
6 2 Gabriel Mangabeira Brazil 52.34
7 8 Duje Draganja Croatia 52.46
8 7 Geoff Huegill Australia 52.56
``` And the rendered HTML table:
Rank Lane Name Nationality Time Notes
4 Michael Phelps United States 51.25 OR
3 Ian Crocker United States 51.29
5 Andriy Serdinov Ukraine 51.36 EU
4 1 Thomas Rupprath Germany 52.27
5 6 Igor Marchenko Russia 52.32
6 2 Gabriel Mangabeira Brazil 52.34
7 8 Duje Draganja Croatia 52.46
8 7 Geoff Huegill Australia 52.56