--- license: apache-2.0 language: - en base_model: - meta-llama/Meta-Llama-3.1-8B --- # 🦙 Llama3.1-8b-vision-audio Model Card ## Model Details This repository contains a version of the [LLaVA](https://github.com/haotian-liu/LLaVA) model that supports image and audio input from the [Llama 3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) foundation model using the [PKU-Alignment/align-anything](https://github.com/PKU-Alignment/align-anything) library. - **Developed by:** the [PKU-Alignment](https://github.com/PKU-Alignment) Team. - **Model Type:** An auto-regressive language model based on the transformer architecture. - **License:** Non-commercial license. - **Fine-tuned from model:** [meta-llama/Llama 3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B). ## Model Sources - **Repository:** - **Dataset:** - - - ## How to use model (reprod.) - Using align-anything ```python from align_anything.models.llama_vision_audio_model import ( LlamaVisionAudioForConditionalGeneration, LlamaVisionAudioProcessor, ) import torch import torchaudio from PIL import Image path = processor = LlamaVisionAudioProcessor.from_pretrained(path) model = LlamaVisionAudioForConditionalGeneration.from_pretrained(path) prompt = "<|start_header_id|>user<|end_header_id|>: Where is the capital of China?\n<|start_header_id|>assistant<|end_header_id|>: " inputs = processor(text=prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=1024) print(processor.decode(outputs[0], skip_special_tokens=True)) prompt = "<|start_header_id|>user<|end_header_id|>: Summarize the audio's contents.