Error when attempting to run either model... ValueError: embed_dim must be divisible by num_heads (got `embed_dim`: 1152 and `num_heads`: 14).

#4
by jdc4429 - opened

You are using a model of type llava to instantiate a model of type llava_onevision. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "C:\Detection\run3.py", line 7, in
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-onevision-qwen2-7b-si", torch_dtype=torch.float16, low_cpu_mem_usage=True)
File "C:\Detection\transformers\src\transformers\modeling_utils.py", line 3848, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "C:\Detection\transformers\src\transformers\models\llava_onevision\modeling_llava_onevision.py", line 356, in init
self.vision_tower = AutoModel.from_config(
File "C:\Detection\transformers\src\transformers\models\auto\auto_factory.py", line 434, in from_config
return model_class._from_config(config, **kwargs)
File "C:\Detection\transformers\src\transformers\modeling_utils.py", line 1510, in _from_config
model = cls(config, **kwargs)
File "C:\Detection\transformers\src\transformers\models\siglip\modeling_siglip.py", line 1147, in init
self.vision_model = SiglipVisionTransformer(config)
File "C:\Detection\transformers\src\transformers\models\siglip\modeling_siglip.py", line 1062, in init
self.encoder = SiglipEncoder(config)
File "C:\Detection\transformers\src\transformers\models\siglip\modeling_siglip.py", line 846, in init
self.layers = nn.ModuleList([SiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\Detection\transformers\src\transformers\models\siglip\modeling_siglip.py", line 846, in
self.layers = nn.ModuleList([SiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\Detection\transformers\src\transformers\models\siglip\modeling_siglip.py", line 618, in init
self.self_attn = SIGLIP_ATTENTION_CLASSESconfig._attn_implementation
File "C:\Detection\transformers\src\transformers\models\siglip\modeling_siglip.py", line 366, in init
raise ValueError(
ValueError: embed_dim must be divisible by num_heads (got embed_dim: 1152 and num_heads: 14).

This error happens with both .5B SI and 7B SI... upgraded transformers to latest main..

Llava Hugging Face org

Hey @jdc4429 ! As the message says, You are using a model of type llava to instantiate a model of type llava_onevision. This is not supported for all configurations of models and can yield errors.. You have to load the model with LlavaOnevisionForConditionalGeneration.from_pretrained().

Please let me know if you saw somewhere in the docs that LlavaForConditionalGeneration is used, that has to be fixed

Sign up or log in to comment