A Touch, Vision, and Language Dataset for Multimodal Alignment

by Max (Letian) Fu, Gaurav Datta*, Huang Huang*, William Chung-Ho Panitch*, Jaimyn Drake*, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).

[Paper] | [Project Page] | [Checkpoints] | [Dataset] | [Citation]

This repo contains the official checkpoints for A Touch, Vision, and Language Dataset for Multimodal Alignment.

The tactile encoders comes in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base, all of which are stored in

ckpt/tvl_enc

TVL-LLaMA, the generative counterparts, are stored in

ckpt/tvl_llama

Inference

For zero-shot classification, we would require OpenCLIP with the following configuration:

CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"

For TVL-LLaMA, please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading.

For the complete info, please take a look at the GitHub repo to see instructions on pretraining, fine-tuning, and evaluation with these models.

Citation

Please give us a star 🌟 on Github to support us!

Please cite our work if you find our work inspiring or use our code in your work:

@article{fu2024tvl,
    title={A Touch, Vision, and Language Dataset for Multimodal Alignment}, 
    author={Letian Fu and Gaurav Datta and Huang Huang and William Chung-Ho Panitch and Jaimyn Drake and Joseph Ortiz and Mustafa Mukadam and Mike Lambeta and Roberto Calandra and Ken Goldberg},
    journal={arXiv preprint arXiv:2402.13232},
    year={2024}
}