Loading with Quantization?

#12
by danielplominski - opened

Hello Mistralai Team,

is there any chance of loading this model on less powerful hardware?

Our biggest VM can use 2x NVIDIA A6000 cards.

Docker does not work:

#!/bin/sh
export CUDA_VISIBLE_DEVICES="0,1"

docker run \
       --gpus='"device=0,1"' \
       --runtime nvidia \
       -v /opt/cache/huggingface:/root/.cache/huggingface \
       --env "HUGGING_FACE_HUB_TOKEN=SECRET" \
       -p 8000:8000 \
       --ipc=host \
       vllm/vllm-openai:latest \
       --model mistralai/Pixtral-Large-Instruct-2411 \
       --tokenizer_mode mistral \
       --load_format mistral \
       --config_format mistral \
       --limit_mm_per_prompt 'image=10' \
       --tensor-parallel-size 8 \
       --max_model_len=1024 \
       --quantization=fp8
# EOF

Errors:

... ... ...
INFO 11-20 01:54:28 config.py:1861] Downcasting torch.float32 to torch.float16.
INFO 11-20 01:54:28 config.py:1020] Defaulting to use ray for distributed inference
WARNING 11-20 01:54:28 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argu
ment. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-20 01:54:28 config.py:791] Possibly too large swap space. 32.00 GiB out of the 62.84 GiB total CPU memory is allocated for the swap space.
INFO 11-20 01:54:33 config.py:1020] Defaulting to use ray for distributed inference
WARNING 11-20 01:54:33 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argu
ment. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-20 01:54:33 config.py:791] Possibly too large swap space. 32.00 GiB out of the 62.84 GiB total CPU memory is allocated for the swap space.
2024-11-20 01:54:35,429 INFO worker.py:1819 -- Started a local Ray instance.
Process SpawnProcess-1:
ERROR 11-20 01:54:36 engine.py:366] The number of required GPUs exceeds the total number of available GPUs in the placement group.
ERROR 11-20 01:54:36 engine.py:366] Traceback (most recent call last):
ERROR 11-20 01:54:36 engine.py:366]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-20 01:54:36 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
... ... ...
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
root@ai-ubuntu22gpu-big:/opt#

Sign up or log in to comment