mistralai/Pixtral-Large-Instruct-2411 · Loading with Quantization?

Hello Mistralai Team,

is there any chance of loading this model on less powerful hardware?

Our biggest VM can use 2x NVIDIA A6000 cards.

Docker does not work:

#!/bin/sh
export CUDA_VISIBLE_DEVICES="0,1"

docker run \
       --gpus='"device=0,1"' \
       --runtime nvidia \
       -v /opt/cache/huggingface:/root/.cache/huggingface \
       --env "HUGGING_FACE_HUB_TOKEN=SECRET" \
       -p 8000:8000 \
       --ipc=host \
       vllm/vllm-openai:latest \
       --model mistralai/Pixtral-Large-Instruct-2411 \
       --tokenizer_mode mistral \
       --load_format mistral \
       --config_format mistral \
       --limit_mm_per_prompt 'image=10' \
       --tensor-parallel-size 8 \
       --max_model_len=1024 \
       --quantization=fp8
# EOF

Errors:

... ... ...
INFO 11-20 01:54:28 config.py:1861] Downcasting torch.float32 to torch.float16.
INFO 11-20 01:54:28 config.py:1020] Defaulting to use ray for distributed inference
WARNING 11-20 01:54:28 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argu
ment. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-20 01:54:28 config.py:791] Possibly too large swap space. 32.00 GiB out of the 62.84 GiB total CPU memory is allocated for the swap space.
INFO 11-20 01:54:33 config.py:1020] Defaulting to use ray for distributed inference
WARNING 11-20 01:54:33 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argu
ment. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-20 01:54:33 config.py:791] Possibly too large swap space. 32.00 GiB out of the 62.84 GiB total CPU memory is allocated for the swap space.
2024-11-20 01:54:35,429 INFO worker.py:1819 -- Started a local Ray instance.
Process SpawnProcess-1:
ERROR 11-20 01:54:36 engine.py:366] The number of required GPUs exceeds the total number of available GPUs in the placement group.
ERROR 11-20 01:54:36 engine.py:366] Traceback (most recent call last):
ERROR 11-20 01:54:36 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-20 01:54:36 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
... ... ...
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
root@ai-ubuntu22gpu-big:/opt#