Afeng-x
/

SPHINX-V-Model

Model card Files Files and versions Community

SPHINX-V-Model / README.md

JackyZhuo's picture

update

fee1c84 7 months ago

|

1.8 kB

	---
	license: apache-2.0
	language:
	- en
	---

	# SPHINX-V Model Card

	## Model type:
	SPHINX-V is a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.


	## Paper or resources for more information:
	Project Page: [Draw-and-Understand](https://draw-and-understand.github.io/) \
	Paper: [https://arxiv.org/abs/2403.20271](https://arxiv.org/abs/2403.20271) \
	Code: [https://github.com/AFeng-x/Draw-and-Understand](https://github.com/AFeng-x/Draw-and-Understand) \
	Dataset: [MDVP-Data & MDVP-Bench](https://huggingface.co/datasets/Afeng-x/Draw-and-Understand)


	## Intended use
	Primary intended uses:
	The principal application of SPHINX-V is centered around conducting research in the realm of visual prompting large multimodal models and chatbots.

	Primary intended users:
	The model is primarily designed for use by researchers and enthusiasts specializing in fields such as computer vision, natural language processing, and interactive artificial intelligence.


	## License
	Llama 2 is licensed under the LLAMA 2 Community License,
	Copyright (c) Meta Platforms, Inc. All Rights Reserved.


	## Citations
	```
	@misc{lin2024drawandunderstand,
	title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want},
	author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
	year={2024},
	eprint={2403.20271},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```