Jointly Training Large Autoregressive Multimodal Models
Abstract
In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (2023)
- Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (2023)
- Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts (2023)
- AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model (2023)
- Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper