metadata

license: cc
datasets:
  - liuhaotian/LLaVA-Instruct-150K
  - liuhaotian/LLaVA-Pretrain
language:
  - en

Model Card for LLaVA-Video-Llama-3.1-8B

Please follow my github repo LLaVA-Unified for more details on fine-tuning VidLM model with Llama-3/Llama-3.1 as the foundatiaon LLM.

Updates

[10/11/2024] A completely new video-based LLM LLaVA-Video-Llama-3.1-8B is released, with the SigLIP-g-384px as vision encoder and average pooling vision-language projector. Via sampling one frame per 30 frames, VidLM can comprehend up to 14min-length videos.
[6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
[5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.

Model Details

Max Frame Input: Each frame is represented as 144 tokens, and VidLM supports up to 800 video frames as the input.
Template: We follow the LLaVA-v1 template for constructing the conversation.
Architecture: visual encoder (SigLIP-so400m-384px) + Average Pooling projector + LLM backbone

How to Use

Please firstly install llava via

pip install git+https://github.com/Victorwz/LLaVA-Unified.git

You can load the model and perform inference as follows:

from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token
from PIL import Image
import requests
import cv2
import torch

# load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3.1-8B", None, "Video-Language-Model-Llama-3.1-8B", False, False, device=device)

# prepare image input
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"

def read_video(video_url):
    response = requests.get(url)
    if response.status_code != 200:
        print("Failed to download video")
        exit()
    else:
        with open("tmp_video.mp4", 'wb') as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)
    
    video = cv2.VideoCapture("tmp_video.mp4")
    video_frames = []
    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        pil_image = Image.fromarray(frame_rgb)
        video_frames.append(pil_image)

    video.release()
    print(len(video_frames), "frames read.")
    return video_frames

video_frames = read_video(video_url=url)
image_tensors = []

# Please change the total number frames loaded based on your video length
total_num_frames = 30
samplng_interval = int(len(video_frames) / total_num_frames)
for i in range(0, len(video_frames), samplng_interval):
    image_tensor = image_processor.preprocess(video_frames[i], return_tensors='pt')['pixel_values'][0].half().cuda()
    image_tensors.append(image_tensor)

# prepare inputs for the model
text = "\n".join(['<image>' for i in range(len(image_tensors))]) + '\n' + "Why is this video funny"
conv = conv_templates["llama_3"].copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, return_tensors='pt').unsqueeze(0).cuda()

# autoregressively generate text
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensors,
        do_sample=False,
        max_new_tokens=512,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])

The image caption results look like:

The video is funny because the baby is wearing glasses while reading a book, which is an unusual and amusing sight. Babies typically do not wear glasses, and it is not common for them to read books at such a young age. The combination of the baby's actions and the fact that they are wearing glasses creates a humorous and endearing scene.

Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data

Please refer to our LLaVA-Unified git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.

Citation

@misc{wang2024llavavideollama3,
  title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
  author={Wang, Weizhi},
  year={2024}
}