Model Information
Converted version of meta-llama/Llama-3.2-11B-Vision-Instruct to OpenVINO Intermediate Representation (IR) for CPU devices inference.
Model consists of 2 parts:
- Image Encoder, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space;
- Language Model, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens.
Then, for reducing memory consumption, weights compression optimization has applied using Neural Network Compression Framework (NNCF) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.
Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml
- 4 bits (INT4)
- group size = 64
- Asymmetrical Quantization
- method AWQ
Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml.
Replication Recipe
Step 1 Install Requirements
I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
Step 2 Convert the model in OpenVINO Intermediate Representation (IR)
from pathlib import Path
from ov_mllama_helper import convert_mllama
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_dir = Path(model_id.split("/")[-1]) / "OpenVino"
convert_mllama(model_id, model_dir)
Step 3 INT4 Compression
from ov_mllama_compression import compress
from ov_mllama_compression import compression_widgets_helper
compression_scenario, compress_args = compression_widgets_helper()
compression_scenario
compression_kwargs = {key: value.value for key, value in compress_args.items()}
language_model_path = compress(model_dir, **compression_kwargs)
Step 4 INT8 Image Enconder Optimization
from ov_mllama_compression import vision_encoder_selection_widget
vision_encoder_options = vision_encoder_selection_widget(device.value)
vision_encoder_options
from transformers import AutoProcessor
import nncf
import openvino as ov
import gc
from data_preprocessing import prepare_dataset_vision
processor = AutoProcessor.from_pretrained(model_dir)
core = ov.Core()
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
calibration_data = prepare_dataset_vision(processor, 100)
ov_model = core.read_model(fp_vision_encoder_path)
calibration_dataset = nncf.Dataset(calibration_data)
quantized_model = nncf.quantize(
model=ov_model,
calibration_dataset=calibration_dataset,
model_type=nncf.ModelType.TRANSFORMER,
advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
)
ov.save_model(quantized_model, int8_vision_encoder_path)
del quantized_model
del ov_model
del calibration_dataset
del calibration_data
gc.collect()
vision_encoder_path = int8_vision_encoder_path
License
Disclaimer
This quantized model comes with no warrenty. It has been developed only for research purposes.
- Downloads last month
- 1,741
Model tree for fbaldassarri/meta-llama_Llama-3.2-11B-Vision-Instruct-OpenVino
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct