Text Generation
Transformers
PyTorch
JAX
Safetensors
bloom
text-generation-inference
Inference Endpoints

Can I run model in my desktop?

#14
by mitu820 - opened

Hi, I have an 8 Core Ryzen 7 desktop PC with 64GB Ram, I have an old 4GB GPU also. Is it possible to run this model on my PC?

If yes then do we have any guide for this??

BigScience Workshop org

You can try running it on your CPUs but it will be extremely slow.
If you want to run it on a single GPU, I'd recommend at least a 40GB GPU with FP16 support.

There's some inference information here for the 176B model: https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/scripts/inference
It should also be applicable to this model just that you need less memory & FP16 instead of BF16.

You can try running it on your CPUs but it will be extremely slow.
If you want to run it on a single GPU, I'd recommend at least a 40GB GPU with FP16 support.

There's some inference information here for the 176B model: https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/scripts/inference
It should also be applicable to this model just that you need less memory & FP16 instead of BF16.

Thank you, I thought 7b model takes low resources.

You can try running it on your CPUs but it will be extremely slow.
If you want to run it on a single GPU, I'd recommend at least a 40GB GPU with FP16 support.

There's some inference information here for the 176B model: https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/scripts/inference
It should also be applicable to this model just that you need less memory & FP16 instead of BF16.

Is this the largest version I can do inference on using say, 4 x NVIDIA A10s? The total GPU memory would be 96GB. It would be nice to have a breakdown of all system requirements for each model.

BigScience Workshop org

You may be able to run the 176B by sacrificing performance or time, see https://huggingface.co/bigscience/bloom/discussions/87 or https://huggingface.co/bigscience/bloom/discussions/88

The thing is there are no hard system requirements. It depends on how fast you want it to be and how much performance you're willing to sacrifice (by e.g. reducing precision).

If you want to run it on a single GPU, I'd recommend at least a 40GB GPU with FP16 support.

This is not true. I was able to run the 7b1 model in fp16 on my GPU with 24GB VRAM.

BigScience Workshop org

If you want to run it on a single GPU, I'd recommend at least a 40GB GPU with FP16 support.

This is not true. I was able to run the 7b1 model in fp16 on my GPU with 24GB VRAM.

Very nice! Didn't say it wasn't possible 👍

I ran into a strange issue after a few runs though. Plenty of VRAM is free, yet PyTorch reports OOM.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 24.00 GiB total capacity; 7.04 GiB already allocated; 15.73 GiB free; 7.04 GiB reserved in total by PyTorch)

My guess is this is some sort of memory leak, since the issue doesn't occur after a system restart. I know this might not be the right place to ask, but is this a known issue?

I can run BLOOM 7B1 and LoRa on 24GB GPU, it's take about 17GB

christopher changed discussion status to closed

Sign up or log in to comment