Upload folder using huggingface_hub

05244bc 8 months ago

7.38 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- ja
	tags:
	- clip
	- ja
	- japanese
	- japanese-clip
	pipeline_tag: feature-extraction
	---

	# Japanese CLIP ViT-H/14 (Wider)

	## Table of Contents

	1. [Overview](#overview)
	1. [Usage](#usage)
	1. [Model Details](#model-details)
	1. [Evaluation](#evaluation)
	1. [Limitations and Biases](#limitations-and-biases)
	1. [Citation](#citation)
	1. [See Also](#see-also)
	1. [Contact Information](#contact-information)

	## Overview

	* Developed by: [HAKUHODO Technologies Inc.](https://www.hakuhodo-technologies.co.jp/)
	* Model type: Contrastive Language-Image Pre-trained Model
	* Language(s): Japanese
	* LICENSE: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)

	Presented here is a Japanese [CLIP (Contrastive Language-Image Pre-training)](https://arxiv.org/abs/2103.00020) model,
	mapping Japanese texts and images to a unified embedding space.
	Capable of multimodal tasks including zero-shot image classification,
	text-to-image retrieval, and image-to-text retrieval,
	this model extends its utility when integrated with other components,
	contributing to generative models like image-to-text and text-to-image generation.

	## Usage

	### Dependencies

	```bash
	python3 -m pip install pillow sentencepiece torch torchvision transformers
	```

	### Inference

	The usage is similar to [`CLIPModel`](https://huggingface.co/docs/transformers/model_doc/clip)
	and [`VisionTextDualEncoderModel`](https://huggingface.co/docs/transformers/model_doc/vision-text-dual-encoder).

	```python
	import requests
	import torch
	from PIL import Image
	from transformers import AutoModel, AutoProcessor, BatchEncoding

	# Download
	model_name = "hakuhodo-tech/japanese-clip-vit-h-14-bert-wider"
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)
	processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

	# Prepare raw inputs
	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	# Process inputs
	inputs = processor(
	text=["犬", "猫", "象"],
	images=image,
	return_tensors="pt",
	padding=True,
	)

	# Infer and output
	outputs = model(**BatchEncoding(inputs).to(device))
	probs = outputs.logits_per_image.softmax(dim=1)
	print([f"{x:.2f}" for x in probs.flatten().tolist()]) # ['0.00', '1.00', '0.00']
	```

	## Model Details

	### Components

	The model consists of a frozen ViT-H image encoder from
	[laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K)
	and a 12-layer 24-head BERT text encoder initialized from
	[hakuhodo-tech/japanese-clip-vit-h-14-bert-base](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-base)
	with [Model Fusion](https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/B6-5.pdf).

	### Training

	Model training is done by Zhi Wang with 8 A100 (80 GB) GPUs.
	[Locked-image Tuning (LiT)](https://arxiv.org/abs/2111.07991) is adopted.
	See more details in [the paper](https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/B6-5.pdf).

	### Dataset

	The Japanese subset of the [laion2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) dataset containing ~120M image-text pairs.

	## Evaluation

	### Testing Data

	The 5K evaluation set (val2017) of [MS-COCO](https://cocodataset.org/)
	with [STAIR Captions](http://captions.stair.center/).

	### Metrics

	Zero-shot image-to-text and text-to-image recall@1, 5, 10.

	### Results

	\| \| \| \| \| \| \| \|
	\| :---------------------------------------------------------------------------------------------------------------------- \| :------: \| :------: \| :------: \| :------: \| :------: \| :------: \|
	\| <td colspan=3 align=center>Text Retrieval</td> <td colspan=3 align=center>Image Retrieval</td> \|
	\| \| R@1 \| R@5 \| R@10 \| R@1 \| R@5 \| R@10 \|
	\| [recruit-jp/japanese-clip-vit-b-32-roberta-base](https://huggingface.co/recruit-jp/japanese-clip-vit-b-32-roberta-base) \| 23.0 \| 46.1 \| 57.4 \| 16.1 \| 35.4 \| 46.3 \|
	\| [rinna/japanese-cloob-vit-b-16](https://huggingface.co/rinna/japanese-cloob-vit-b-16) \| 37.1 \| 63.7 \| 74.2 \| 25.1 \| 48.0 \| 58.8 \|
	\| [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16) \| 36.9 \| 64.3 \| 74.3 \| 24.8 \| 48.8 \| 60.0 \|
	\| [Japanese CLIP ViT-H/14 (Base)](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-base) \| 39.2 \| 66.3 \| 76.6 \| 28.9 \| 53.3 \| 63.9 \|
	\| [Japanese CLIP ViT-H/14 (Deeper)](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-deeper) \| 48.7 \| 74.0 \| 82.4 \| 36.5 \| 61.5 \| 71.8 \|
	\| [Japanese CLIP ViT-H/14 (Wider)](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-wider) \| 47.9 \| 74.2 \| 83.2 \| 37.3 \| 62.8 \| 72.7 \|

	\* [Japanese Stable CLIP ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) is excluded for zero-shot retrieval evaluation as
	[the model was partially pre-trained with MS-COCO](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16#training-dataset).

	## Limitations and Biases

	Despite our data filtering, it is crucial
	to acknowledge the possibility of the training dataset
	containing offensive or inappropriate content.
	Users should be mindful of the potential societal impact
	and ethical considerations associated with the outputs
	generated by the model when deploying in production systems.
	It is recommended not to employ the model for applications
	that have the potential to cause harm or distress
	to individuals or groups.

	## Citation

	If you found this model useful, please consider citing:

	```bibtex
	@article{japanese-clip-vit-h,
	author = {王直 and 細野健人 and 石塚湖太 and 奥田悠太 and 川上孝介},
	journal = {言語処理学会年次大会発表論文集},
	month = {Mar},
	pages = {1547--1552},
	title = {日本語特化の視覚と言語を組み合わせた事前学習モデルの開発 Developing Vision-Language Pre-Trained Models for {J}apanese},
	volume = {30},
	year = {2024}
	}
	```

	## See Also

	* [Japanese CLIP ViT-H/14 (Base)](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-base)
	* [Japanese CLIP ViT-H/14 (Deeper)](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-deeper)

	## Contact Information

	Please contact
	[hr-koho\@hakuhodo-technologies.co.jp](mailto:[email protected]?subject=Japanese%20CLIP%20ViT-H/14%20Models)
	for questions and comments about the model,
	and/or
	for business and partnership inquiries.

	お問い合わせは
	[hr-koho\@hakuhodo-technologies.co.jp](mailto:[email protected]?subject=日本語CLIP%20ViT-H/14モデルについて)
	にご連絡ください。