metadata

license: apache-2.0
datasets:
  - HuggingFaceM4/DocumentVQA
language:
  - en
library_name: transformers

Model Card for Florence-2-FT-DocVQA

This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks.

Model Details

Model Description

Developed by: Mayank Chaudhary
Model type: AutoModelForCausalLM
Language(s) (NLP): English
License: apache-2.0
Finetuned from model: Florence-2-base-ft

The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images.

Model Sources

Repository: GitHub - FineTuning-VLMs
Paper [optional]: arXiv:2311.06242

Uses

The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering.

Requirements

datasets
transformers
torch
Pillow

How to Get Started with the Model

To get started with the Florence-2-FT-DocVQA model, you can use the following code:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device)
processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA")

data = load_dataset("HuggingFaceM4/DocumentVQA")

def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

for idx in range(3):
    print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))