metadata
license: apache-2.0
datasets:
- HuggingFaceM4/DocumentVQA
language:
- en
library_name: transformers
Model Card for Florence-2-FT-DocVQA
This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks.
Model Details
Model Description
Developed by: Mayank Chaudhary
Model type: AutoModelForCausalLM
Language(s) (NLP): English
License: apache-2.0
Finetuned from model: Florence-2-base-ft
The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images.
Model Sources
- Repository: GitHub - FineTuning-VLMs
- Paper [optional]: arXiv:2311.06242
Uses
The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering.
Requirements
- datasets
- transformers
- torch
- Pillow
How to Get Started with the Model
To get started with the Florence-2-FT-DocVQA model, you can use the following code:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device)
processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA")
data = load_dataset("HuggingFaceM4/DocumentVQA")
def run_example(task_prompt, text_input, image):
prompt = task_prompt + text_input
if image.mode != "RGB":
image = image.convert("RGB")
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
return parsed_answer
for idx in range(3):
print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))