mistralai/Mixtral-8x7B-Instruct-v0.1 · Hardware Requirements

Jan 8

What is the exact hardware requirement to run mistralai/Mixtral-8x7B-Instruct-v0.1 locally on the machine or VM. Storage,RAM,GPU, cache/buffer etc. Please tell

deleted

Jan 8

it all depends on how fast you want to go.

ShivanshMathur007

Jan 8

•

edited Jan 8

like while inferencing the locally downloaded model it should have a speed of 5 tokens/sec. you can provide the details at different speed also it would be helpful

ShivanshMathur007

Jan 9

Also tell the minimum requirement. It would be helpful. On the Mixtral website ->
Mixtral requires 64GB of RAM and 2 GPUs, which increases the cost by a factor of 3 (1.3$/h vs. 4.5$/h). This is mentioned can anyone elaborate on this.

aigeek0x0

Jan 19

You can run it with 8-bit precision on one A100 (80GB) which costs ~$1.89/h on Runpod.

royiluo6267

Jul 30

•

edited Jul 30

I failed to run on A100(40G):

INFO 07-30 01:44:07 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1)
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 223, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 122, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 148, in load_model
[rank0]: self.model = get_model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 261, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 98, in _initialize_model
[rank0]: return model_class(config=model_config.hf_config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 508, in init
[rank0]: self.model = MixtralModel(config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 449, in init
[rank0]: self.layers = nn.ModuleList([
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 450, in
[rank0]: MixtralDecoderLayer(config,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 388, in init
[rank0]: self.block_sparse_moe = MixtralMoE(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 103, in init
[rank0]: self.w13_weight = nn.Parameter(torch.empty(self.num_total_experts,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 78, in torch_function
[rank0]: return func(*args, **kwargs)
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU

pandora-s

Mistral AI_ org Jul 31

Hi there, Mixtral 8x7b requires around 100GB of VRAM for full precision inference, to run on lower you will have to use quantization and run on lower precision.