LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Abstract
In this paper, we introduce LLaVA-phi (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfMLLM: A Unified Framework for Visual-Language Tasks (2023)
- Generative Multimodal Models are In-Context Learners (2023)
- Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models (2023)
- Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs (2023)
- InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
tell me about the image
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper