README.md · EDGEwww25/EDGE-Model at main

metadata

license: mit
datasets:
  - EDGEwww25/EDGE-Dataset
  - liuhaotian/LLaVA-Instruct-150K
  - echo840/Monkey_Data
language:
  - en
base_model:
  - echo840/Monkey-Chat

This is the model repository of paper EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data.

The model is fine-tuned based on Monkey. In order to speed up the training, we also made some minor modifications:

Instead of using the Lora Adapters in Monkey, the five patches of the raw image are stacked in an extra batch dimension and sent to the image encoder for processing at the same time.
Inside the image encoder, we use flash attention instead of the manually implemented attention.
Separate the step of reading the image from the forward propagation and make it a step of dataset preprocessing to speed up image reading using the Dataloader in pytorch.

The training dataset (i.e. all training QAs in .jsonl format, excluding images) is published in repository EDGE-Dataset.

The model training and inference scripts are published in anonymous repository EDGE.