Great work. I had an issue running this in colab
/usr/local/lib/python3.9/dist-packages/bitsandbytes/functional.py in transform(A, to_order, from_order, out, transpose, state, ld)
1696
1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=None):
-> 1698 prev_device = pre_call(A.device)
1699 if state is None: state = (A.shape, from_order)
1700 else: from_order = state[1]
AttributeError: 'NoneType' object has no attribute 'device'
Can you please check.
Thanks. I checked and got it working.
Hi!
I've been running this model for the past couple days, really nice model, tysm for open-sourcing it! π
Anyway, currently having the same issue with the VRAM usage, any development on this?
If it's of any help, I don't see an increase on every call from the looks of it, just occasionally.
Hi!
I've been running this model for the past couple days, really nice model, tysm for open-sourcing it! π
Anyway, currently having the same issue with the VRAM usage, any development on this?
If it's of any help, I don't see an increase on every call from the looks of it, just occasionally.
Messed around with it today, seems like adding a
torch.cuda.empty_cache()
import gc; gc.collect()
to the generate() function helped! :)
Sure! It's really just adding those calls into the function (idt the place you put them matters tbh, they're just garbage collector calls, added two of them just to make sure).
def generate(
instruction,
input=None,
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=4,
**kwargs,
):
torch.cuda.empty_cache()
import gc; gc.collect()
prompt = generate_prompt(instruction, input)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_config = GenerationConfig(
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_beams=num_beams,
**kwargs,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=256,
)
s = generation_output.sequences[0]
output = tokenizer.decode(s)
torch.cuda.empty_cache()
import gc; gc.collect()
return output.split("### Response:")[1].strip().split("Below")[0]
What I've found is that this seems to only occur for large prompts, I'm not sure where the threshold is to trigger it, but from what I can tell the size of the prompt is really what did it.