Running Inference Server
I'm struggling to start a server that is capable of running this model, just because of the size. I am inexperienced in running very large models. How would I approach this issue on a machine with 8x A100 with 80GB memory each? Just for inference, not for training.
Hi, even 2 A100 is more than enough to host the model. Did you look into TGI? They have a good documentation, that's the easiest way - it boils down to pulling their Docker image and writing a correct launch command (don't forget to set bfloat16 precision): https://huggingface.co/docs/text-generation-inference/en/index
I created an inference endpoint in AWS using 2x A100s (recommended configuration). After setting the endpoint URL and using the sample python code to run generate_story()
I get the following error on the first API request:
Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: captures_underway == 0 INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1699449201336/work/c10/cuda/CUDACachingAllocator.cpp":2939, please report a bug to PyTorch.