tarekziade
/

test-push

vision-encoder-decoder

image-text-to-text

image-captioning

Inference Endpoints

Model card Files Files and versions Community

Edit model card

distilvit

This model is a work in progress. Fine-tuned version of those base models:

a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2

This model was trained on:

Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
COCO 2017: https://cocodataset.org

You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.

It was then further fine-tuned on :

Flickr30k debiased: https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions
DocOrNot: https://huggingface.co/datasets/Mozilla/docornot

You can find the code used to create the model here: https://github.com/mozilla/distilvit

Framework versions

Transformers 4.40.2
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

Downloads last month: 1

Inference Examples

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for tarekziade/test-push

Base model

google/vit-base-patch16-224-in21k

Quantized

(8)

this model

Dataset used to train tarekziade/test-push

Evaluation results

ROUGE-1 on nlphuji/flickr30k
self-reported

43.006
ROUGE-2 on nlphuji/flickr30k
self-reported

16.994
ROUGE-L on nlphuji/flickr30k
self-reported

38.892
ROUGE-LSUM on nlphuji/flickr30k
self-reported

38.888
loss on nlphuji/flickr30k
self-reported

0.199
gen_len on nlphuji/flickr30k
self-reported

11.327

View on Papers With Code