"Model is overloaded, please wait for a bit"

#70

by jmarxza - opened Aug 1, 2022

Discussion

jmarxza

Aug 1, 2022

Any way to stop this image from popping up?

FernandoCosta

Aug 1, 2022

•

edited Aug 1, 2022

I was facing this issue a few hours ago. But now it is working on Huggingface's Accelerated Inference! It takes normally more than 90s to generate 64 tokens, with "use_gpu": True, but it runs.

ziansu

Oct 10, 2022

I have the same problem here. I can only get generated_text from bloom-3b and never succeeded with bloom. Any solutions?

TimeRobber

BigScience Workshop org Oct 19, 2022

Are you still having the issue? We're recently been moving to AzureML and so service might have been disrupted at some point. But it should be a lot more stable now.

Just to be clear we're talking about the Inference API?

ziansu

Oct 20, 2022

Are you still having the issue? We're recently been moving to AzureML and so service might have been disrupted at some point. But it should be a lot more stable now.

Just to be clear we're talking about the Inference API?

Now it's working (yes, it is the Inference API). But now I have another problem. It seems that even if I specify num_return_sequences to be more than 1, I can only get 1 generated_text from bloom. I can get the right number of generated_text with bloom-3b. Is it because bloom is too big so that it can only do greedy decoding?

TimeRobber

BigScience Workshop org Oct 21, 2022

•

edited Oct 21, 2022

We have a custom deployment setup right now for BLOOM (in order to improve inference speed and such), which doesn't support all the options right now. We'll try to support new options as the requests come in I guess.

Is it because bloom is too big so that it can only do greedy decoding?

Actually it does more than greedy decoding, you can add top_k and top_p options.

darragh

Oct 26, 2022

•

edited Oct 26, 2022

Hi @TimeRobber , The max tokens for the API seems to be 1024, although I believe the bloom model can take longer sequences. Andy greater than 1024, and I get the message - Model is overloaded, please wait for a bit. Is this max length fixed for all users, or paid plans can increase this ?

TimeRobber

BigScience Workshop org Oct 27, 2022

I think we hard limit incoming requests that are beyond a specific length so that people don't spam our service. In theory if you host the model yourself, you can go to arbitrarily long sequences as it uses relative positional embeddings system that can extrapolate to any length regardless of what was the sequence length when training. More details can be found here: https://arxiv.org/abs/2108.12409

Max length should be fixed for all users. At least from this API endpoint. cc @olivierdehaene

Concerning whether there's a paid plan you'd have to ask @Narsil to confirm, but I think there should be none.

darragh

Oct 27, 2022

•

edited Oct 27, 2022

Thanks, re: hosting this ourselves, can I confirm when I call API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom" I am hitting large bigscience/bloom version and not one of the smaller versions, like bigscience/bloom7b1; and also can I confirm if this is running on CPU.
It seems very fast for a CPU model on large bloom - I get 10 seconds. If it was feasible to get this speed on onnx accelerated large bloom, I could try hosting myself.

P.s. I saw in the docu to check x-compute-type in the headers of the response to check if it is CPU or GPU, but I could not see that values.

TimeRobber

BigScience Workshop org Oct 27, 2022

when I call API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom" I am hitting large bigscience/bloom version and not one of the smaller versions, like bigscience/bloom7b1

Yes you're running the big model

can I confirm if this is running on CPU

No it runs on GPUs in a parallel fashion. You can find more details at https://huggingface.co/blog/bloom-inference-optimization

BLOOM is a more special deployment, and it's currently being powered by AzureML. There won't be cpu inference on BLOOM

darragh

Oct 28, 2022

Thanks a lot @TimeRobber this is very helpful

olivierdehaene

BigScience Workshop org Oct 28, 2022

If you are interested in the code behind our BLOOM deployment, you can find the new version currently running here: https://github.com/huggingface/text-generation-inference.
The original code described in the the blog post can also be found here: https://github.com/huggingface/transformers_bloom_parallel/.

Johnlate

Dec 13, 2022

I have the same problem all night. I just wanted to try it out and see if I can get any kind of response, maybe I have to wait or call the API myself is working better, anyone has a solution?

TimeRobber

BigScience Workshop org Dec 15, 2022

Hi! Bloom hosting is currently undergoing maintenance by the AzureML team and will be back up as soon as this has been completed. We'll try to get it back up ASAP.

olivierdehaene

BigScience Workshop org Dec 16, 2022

Model is back up.

christopher

BigScience Workshop org Dec 16, 2022

Indeed! https://twitter.com/julien_c/status/1603797500955181056

christopher changed discussion status to closed Dec 16, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment