Hardware Requirements
What is the exact hardware requirement to run mistralai/Mixtral-8x7B-Instruct-v0.1 locally on the machine or VM. Storage,RAM,GPU, cache/buffer etc. Please tell
like while inferencing the locally downloaded model it should have a speed of 5 tokens/sec. you can provide the details at different speed also it would be helpful
Also tell the minimum requirement. It would be helpful. On the Mixtral website ->
Mixtral requires 64GB of RAM and 2 GPUs, which increases the cost by a factor of 3 (1.3$/h vs. 4.5$/h). This is mentioned can anyone elaborate on this.
You can run it with 8-bit precision on one A100 (80GB) which costs ~$1.89/h on Runpod.
I failed to run on A100(40G):
INFO 07-30 01:44:07 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1)
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 223, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 122, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 148, in load_model
[rank0]: self.model = get_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 261, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 98, in _initialize_model
[rank0]: return model_class(config=model_config.hf_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 508, in init
[rank0]: self.model = MixtralModel(config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 449, in init
[rank0]: self.layers = nn.ModuleList([
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 450, in
[rank0]: MixtralDecoderLayer(config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 388, in init
[rank0]: self.block_sparse_moe = MixtralMoE(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 103, in init
[rank0]: self.w13_weight = nn.Parameter(torch.empty(self.num_total_experts,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 78, in torch_function
[rank0]: return func(*args, **kwargs)
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU
Hi there, Mixtral 8x7b requires around 100GB of VRAM for full precision inference, to run on lower you will have to use quantization and run on lower precision.