Use Molmo vision encoder for classification.

#20
by shafeeq007 - opened

I want to use Molmo's vision encoder to encode images and train a classification head. I have few questions.

  1. How can I encode the images in batch as processor creates multiple random crops of input images according to image resolution.
  2. What is the best way to combine/pool the embeddings of crops of a single image before passing them to the classification head.

Sign up or log in to comment