Use Molmo vision encoder for classification.

#20

by shafeeq007 - opened 4 days ago

4 days ago

I want to use Molmo's vision encoder to encode images and train a classification head. I have few questions.

How can I encode the images in batch as processor creates multiple random crops of input images according to image resolution.
What is the best way to combine/pool the embeddings of crops of a single image before passing them to the classification head.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment