metadata

language:
  - ja
tags:
  - vision-language
  - image-captioning
  - japanese-stable-vlm
pipeline_tag: image-to-text
license: other
extra_gated_fields:
  Name: text
  Email: text
  Country: text
  Organization or Affiliation: text
  I allow Stability AI to contact me about information related to its models and research: checkbox

Japanese Stable VLM

Model Details

Japanese Stable VLM is a vision-language instruction-following model that enables to generate Japanese descriptions for input images and optionally input texts such as questions.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForVision2Seq, AutoImageProcessor
from PIL import Image
import requests

# helper function to format input prompts
def build_prompt(prompt="", sep="\n\n### "):
    sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
    p = sys_msg
    roles = ["指示", "応答"]
    default_prompt = "与えられた画像について、詳細に述べてください。"
    if not prompt:
        prompt = default_prompt
    msgs = [": \n" + prompt, ": \n"]
    for role, msg in zip(roles, msgs):
        p += sep + role + msg
    return p

# load model
model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-stable-vlm", trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained("stabilityai/japanese-stable-vlm")
tokenizer = AutoTokenizer.from_pretrained("stabilityai/japanese-stable-vlm")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# prepare inputs
url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "" # input empty string for image captioning. You can also input questions as prompts 
prompt = build_prompt(prompt)
inputs = processor(images=image, return_tensors="pt")
text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
inputs.update(text_encoding)

# generate
outputs = model.generate(
    **inputs.to(device, dtype=model.dtype),
    num_beams=5,
    max_new_tokens=64,
    min_length=1,
    repetition_penalty=1.5,
)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
# 桜と東京スカイツリー

Model Details

Developed by: Stability AI
Model type: Auto-regressive Vision Language Model
Language(s): Japanese
License: STABILITY AI JAPANESE STABLE VLM COMMUNITY LICENSE.

Training

This model is a vision-language instruction-following model with the LLaVA 1.5 architecture. It uses stabilityai/japanese-stablelm-instruct-gamma-7b as a language model and openai/clip-vit-large-patch14 as an image encoder. During training, the MLP projection was trained from scratch at the first stage and the language model and the MLP projection was further trained at the second stage.

Training Dataset

The training dataset includes the following public datasets:

CC12M with captions translated into Japanese
MS-COCO with STAIR Captions
Japanese Visual Genome VQA dataset

Use and Limitations

Intended Use

This model is intended to be used by the open-source community in vision-language applications.

Limitations and bias

The training dataset may have contained offensive or inappropriate content even though we applied data filters. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups.

How to cite

@misc{JapaneseStableVLM, 
    url    = {[https://huggingface.co/stabilityai/japanese-stable-vlm](https://huggingface.co/stabilityai/japanese-stable-vlm)}, 
    title  = {Japanese Stable VLM}, 
    author = {Shing, Makoto and Akiba, Takuya}
}