|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- liuhaotian/LLaVA-CC3M-Pretrain-595K |
|
- liuhaotian/LLaVA-Instruct-150K |
|
- FreedomIntelligence/ALLaVA-4V-Chinese |
|
- shareAI/ShareGPT-Chinese-English-90k |
|
language: |
|
- zh |
|
- en |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
# Model Card for IAA: Inner-Adaptor Architecture |
|
|
|
**Github**:https://github.com/360CVGroup/Inner-Adaptor-Architecture |
|
|
|
**[IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities](https://www.arxiv.org/abs/2408.12902)** |
|
|
|
|
|
Bin Wang*, Chunyu Xie*, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author) |
|
|
|
[![arXiv](https://img.shields.io/badge/arXiv-2408.12902-b31b1b.svg)](https://www.arxiv.org/abs/2408.12902) |
|
|
|
We propose a MLLM based on Inner-Adaptor Architecture (IAA). IAA demonstrates that training with a frozen language model can surpass the models with fine-tuned LLMs in both multimodal comprehension and visual grounding tasks. Moreover, after deployment, our approach incorporates multiple workflows, thereby preserving the NLP proficiency of the language model. With a single download, the model can be finetuned to cater to various task specifications. Enjoy the seamless experience of utilizing our IAA model. |
|
|
|
|
|
<p align="center"> |
|
<img src="overview.png" width=80%/> |
|
</p> |
|
|
|
|
|
## Model Performance |
|
### Main Results on General Multimodal Benchmarks. |
|
|
|
<p align="center"> |
|
<img src="mmresult.png" width=90%/> |
|
</p> |
|
|
|
### Results on Visual Grounding Benchmarks. |
|
<!-- grounding_re --> |
|
|
|
<p align="center"> |
|
<img src="grounding_re.png" width=90%/> |
|
</p> |
|
|
|
### Comparison on text-only question answering. |
|
<!-- grounding_re --> |
|
|
|
<p align="center"> |
|
<img src="NLPresult.png" width=90%/> |
|
</p> |
|
|
|
## Quick Start 🤗 |
|
### First pull off our model |
|
```Shell |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
from PIL import Image |
|
|
|
checkpoint = "qihoo360/Inner-Adaptor-Architecture" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval() |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True) |
|
vision_tower = model.get_vision_tower() |
|
vision_tower.load_model() |
|
vision_tower.to(device="cuda", dtype=torch.float16) |
|
image_processor = vision_tower.image_processor |
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
terminators = [ |
|
tokenizer.convert_tokens_to_ids("<|eot_id|>",) |
|
] |
|
``` |
|
|
|
|
|
|
|
### Multimodal Workflow: task_type="MM" |
|
```Shell |
|
image = Image.open("readpanda.jpg").convert('RGB') |
|
query = "What animal is in the picture?" |
|
|
|
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor) |
|
|
|
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True) |
|
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True) |
|
|
|
output_ids = model.generate( |
|
input_ids, |
|
task_type="MM", |
|
images=images, |
|
do_sample=False, |
|
eos_token_id=terminators, |
|
num_beams=1, |
|
max_new_tokens=512, |
|
use_cache=True) |
|
|
|
input_token_len = input_ids.shape[1] |
|
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0] |
|
outputs = outputs.strip() |
|
print(outputs) |
|
``` |
|
|
|
### Grounding Workflow: task_type="G" |
|
```Shell |
|
image = Image.open("COCO_train2014_000000014502.jpg").convert('RGB') |
|
query = "Please provide the bounding box coordinate of the region this sentence describes: dude with black shirt says circa." |
|
|
|
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor) |
|
|
|
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True) |
|
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True) |
|
|
|
output_ids = model.generate( |
|
input_ids, |
|
task_type="G", |
|
images=images, |
|
do_sample=False, |
|
eos_token_id=terminators, |
|
num_beams=1, |
|
max_new_tokens=512, |
|
use_cache=True) |
|
input_token_len = input_ids.shape[1] |
|
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0] |
|
outputs = outputs.strip() |
|
print(outputs) |
|
``` |
|
|
|
### Text-only Workflow: task_type="Text" |
|
|
|
```Shell |
|
query = "What is the approximate weight of an adult red panda?" |
|
inputs = model.build_conversation_input_ids(tokenizer, query=query) |
|
|
|
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True) |
|
images = None |
|
|
|
|
|
output_ids = model.generate( |
|
input_ids, |
|
task_type="Text", |
|
images=images, |
|
do_sample=False, |
|
eos_token_id=terminators, |
|
num_beams=1, |
|
max_new_tokens=512, |
|
use_cache=True) |
|
|
|
input_token_len = input_ids.shape[1] |
|
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0] |
|
outputs = outputs.strip() |
|
print(outputs) |
|
``` |
|
|
|
## We Are Hiring |
|
We are seeking academic interns in the Multimodal field. If interested, please send your resume to [email protected]. |
|
|
|
## Citation |
|
If you find IAA useful for your research and applications, please cite using this BibTeX: |
|
|
|
``` |
|
@article{Wang2024IAA, |
|
title={IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities}, |
|
author={Bin Wang and Chunyu Xie and Dawei Leng and Yuhui Yin}, |
|
journal={arXiv preprint arXiv:2408.12902}, |
|
year={2024}, |
|
} |
|
``` |
|
|
|
## License |
|
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. |
|
The content of this project itself is licensed under the [Apache license 2.0] |
|
|
|
**Where to send questions or comments about the model:** |
|
https://github.com/360CVGroup/Inner-Adaptor-Architecture |
|
|
|
|
|
|
|
## Related Projects |
|
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks! |
|
- [Meta Llama 3](https://github.com/meta-llama/llama3) |
|
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA) |
|
- [360VL](https://github.com/360CVGroup/360VL) |
|
|
|
|
|
|
|
|