mynkchaudhry's picture
Update README.md
36d8a20 verified
|
raw
history blame
2.42 kB
metadata
license: apache-2.0
datasets:
  - HuggingFaceM4/DocumentVQA
language:
  - en
library_name: transformers

Alt text

Model Card for Florence-2-FT-DocVQA

This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks.

Model Details

Model Description

Developed by: Mayank Chaudhary
Model type: AutoModelForCausalLM
Language(s) (NLP): English
License: apache-2.0
Finetuned from model: Florence-2-base-ft

The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images.

Model Sources

Uses

The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering.

Requirements

  • datasets
  • transformers
  • torch
  • Pillow

How to Get Started with the Model

To get started with the Florence-2-FT-DocVQA model, you can use the following code:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device)
processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA")

data = load_dataset("HuggingFaceM4/DocumentVQA")

def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

for idx in range(3):
    print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))