Transformers
PyTorch
Inference Endpoints
Edit model card

image/jpeg

Model Card for SpaceLLaVA-lite

SpaceLLaVA-lite fine-tunes MobileVLM on a dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM

Model Details

Model Description

This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.

  • Developed by: remyx.ai
  • Model type: MultiModal Model, Vision Language Model, MobileVLM
  • License: Apache-2.0
  • Finetuned from model: MobileVLM

Model Sources

Uses

Use this model to query spatial relationships between objects in a scene.

Run it using MobileVLM inference code:

# assuming cwd is /path/to/MobileVLM/
from scripts.inference import inference_once
model_path = "/path/to/SpaceLLaVA-lite"
image_file = "/path/to/your-image.jpg"
prompt_str = "For each object in the scene, describe the distance between objects in meters"

args = type('Args', (), {
    "model_path": model_path,
    "image_file": image_file,
    "prompt": prompt_str,
    "conv_mode": "v1",
    "temperature": 0, 
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512,
    "load_8bit": False,
    "load_4bit": False,
})()

inference_once(args)

Try it on Discord: http://discord.gg/b2yGuCNpuC

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@article{chu2023mobilevlm,
  title={Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices},
  author={Chu, Xiangxiang and Qiao, Limeng and Lin, Xinyang and Xu, Shuang and Yang, Yang and Hu, Yiming and Wei, Fei and Zhang, Xinyu and Zhang, Bo and Wei, Xiaolin and others},
  journal={arXiv preprint arXiv:2312.16886},
  year={2023}
}

@article{chu2024mobilevlm,
  title={MobileVLM V2: Faster and Stronger Baseline for Vision Language Model},
  author={Chu, Xiangxiang and Qiao, Limeng and Zhang, Xinyu and Xu, Shuang and Wei, Fei and Yang, Yang and Sun, Xiaofei and Hu, Yiming and Lin, Xinyang and Zhang, Bo and others},
  journal={arXiv preprint arXiv:2402.03766},
  year={2024}
}
Downloads last month
3,366
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Collection including remyxai/SpaceLLaVA-lite