Runs out of memory on free tier Google Colab
Tried inference on free tier Google Colab with is code but crashed on memory.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Wanfq/FuseLLM-7B", use_fast=False)
model = AutoModel.from_pretrained("Wanfq/FuseLLM-7B", torch_dtype="auto")
model.cuda()
inputs = tokenizer("", return_tensors="pt").to(model.device)
tokens = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.6,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(tokens[0], skip_special_tokens=True))
Hello, you can set "load_in_8bit=True" when you load the model if you don't have enough GPU memory:
model = AutoModelForCausalLM.from_pretrained(
"Wanfq/FuseLLM-7B",
load_in_8bit=True,
)
It loads now but I got this error.
TypeError: The current model class (LlamaModel) is not compatible with .generate()
, as it doesn't have a language model head. Please use one of the following classes instead: {'LlamaForCausalLM'}
Please suggest.
Never mind it runs when I change to AutoModelForCausalLM. I missed this part in your above solution.