Issue trying to tune using the fork of qlora, KeyError: 'response' ?

#2
by GentlePickle - opened

Hey Jon, big fan over here! I've been following your airoboros models for a little bit now, and then I saw you post about this on reddit and had to try it. Being able to fine-tune this model on my hardware would be an amazing step for me.

Unfortunately, I can't seem to get the qlora.py script to recognize the 'response' key in any dataset I give it. I'm specifying airoboros as the dataset format as per your suggestion in the github fork that goes with this model, and I've even tried running it with an 'instructions.jsonl' dataset that I copied from an airoboros model just to see if I was screwing up my own dataset formatting, but it gives the same error. Here's the part of the code that it's failing on, the stack trace error that it gives, and the script call and parameters that I'm using are below it.

this is where it's failing and I added two dictionary counter print statements for debugging:

    elif dataset_format == 'airoboros':
        count_without_response = sum(1 for x in dataset if 'response' not in x)
        count_dicts_in_dataset = sum(1 for x in dataset)

        print(f"Number of dictionaries: {count_without_response}")
        print(f"Number of dictionaries without 'response' key: {count_without_response}")
        
        dataset = dataset.map(lambda x: {
            'input': AIROBOROS_PROMPT.format(instruction=x["instruction"]),
            'output': x["response"],
        })

Output, directly followed by stack trace error:

Number of dictionaries: 1 # This seems odd, it came back with a count of 1 for both my dataset and the instructions.jsonl file
Number of dictionaries without 'response' key: 1 # Line throwing the error, count of lines where no response key found
Traceback (most recent call last):
File "/workspace/qlora/qlora.py", line 715, in
train()
File "/workspace/qlora/qlora.py", line 648, in train
data_module = make_data_module(tokenizer=tokenizer, args=args)
File "/workspace/qlora/qlora.py", line 566, in make_data_module
dataset = format_dataset(dataset, args.dataset_format)
File "/workspace/qlora/qlora.py", line 551, in format_dataset
dataset = dataset.map(lambda x: {
File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 851, in map
{
File "/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py", line 852, in
k: dataset.map(
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 580, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 545, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3087, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3441, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3344, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/workspace/qlora/qlora.py", line 553, in
'output': x["response"],
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 270, in getitem
value = self.data[key]
KeyError: 'response'

Script with parms:

CUDA_VISIBLE_DEVICES=1,2 python qlora.py –learning_rate 0.0001
--model_name_or_path jondurbin_mpt-30b-qlora-compatible
--dataset_format 'airoboros'
--dataset_name 'airoboros_dataset.json' \ # Also tried instructions.jsonl, and tried with my own dataset using newline delim json and standard format json
--mpt True
--max_new_tokens 8192
--bits 4
--model_max_len 8192
--learning_rate 0.0001
--num_train_epochs 1
--trust_remote_code True

Any advice you can give, or if there's anything I can do to help, I'd appreciate it. I'd really like to be able to tune the MPT model as my company has a relationship with Databricks and they just bought Mosaic, so getting this running would be awesome! Keep doing everything you're doing, the airoboros models are great and we all appreciate your work!

Interesting... Can you add a debug print of the dataset before the lambda (line 546), or line 465 in local dataset? Seems perhaps the dataset is empty?

I see you tried this, but to confirm, the dataset format I've been using is a single JSON string per line, with keys "instruction" and "response".

Thanks for the reply. I went back and added that print statement and confirmed that it was empty.. then I went back and checked every parm against your example in the model card window, and realized I was using --dataset_name instead of --dataset ...

Fixed that and now I'm getting an OOM error, so that's a step in the right direction!

Sign up or log in to comment