Attention mask for generation function in the future?
In the card it states:
In the generation function, our model currently does not support beam search (num_beams >1) and `attention_mask' parameters. Furthermore, in the forward pass of the model, we currently do not support outputing hidden states or attention values, or using custom input embeddings (instead of the model's).
I was just wondering if there is intention to support attention_mask
parameters in the future? Has this just not been implemented just yet?
Hello @rchan26 ! I hope everything is going well with you.
This was our first deployment of a model to HF, so we wanted to be sure everything was running smoothly. We already have the attention_mask
working out locally and our plan is to update both Phi-1 and Phi-1.5 over the next days.
Regards,
Gustavo.
Thanks! looking forward to testing it out! 😄
Hey
@gugarosa
I assume the attention masks parameter isn't set up yet because you're using a torch.nn.Sequential wrapper.
Has the team created a custom torch class to square this away yet? Looking to fine tune and run inference in batches. If it's not something in the pipeline, I'll prolly just write the custom torch class myself. But if it is something that gonna get squared away soon, I won't waste my time. Lemme know and thanks.
Hello @rchan26 and @bennicholl !
I just updated the model files and added the attention_mask
support. Sorry for taking so much time. This should server as a proxy till Phi get's fully implemented in transformers
.
However, please note that we still do not have support for attention_mask
during training/fine-tuning, only inference. But this shouldn't be a problem in adding in the next upcoming days.
Thanks for working on this! When I try and perform inference with below code
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
model(input_ids = torch.tensor(tokened_words), attention_mask = torch.tensor(attention_mask))
I get a value error
ValueError: not enough values to unpack (expected 3, got 2)
It seems the error is derived from below
541 kv = update_kv_cache(qkv[:, :, 1:], past_key_values, self.layer_idx)
543 if attention_mask is not None:
--> 544 attention_mask, cu_seqlens, max_seqlen = attention_mask
545 attention_mask = attention_mask.to(qkv.device)
547 attention_kwargs = {"attention_mask": attention_mask}
But I'm not sure whats going on with that code
My bad @bennicholl !
Just fixed this. We use the flash-attn
style for performing cached inference and the attention layer was not aware that attention_mask
could be passed as a single tensor.
Should be working now, tested an inference in the way you posted and it worked!
@gugarosa Thanks for the quick response man! I think they're may be a bug in the attention masking. The output for a sentence is different if I run two examples instead of one. Here is some code to reproduce.
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
tokenizer.pad_token = tokenizer.eos_token
#HERE IS CODE RUNNING ONE SENTENCE
encoded_inputs = tokenizer(["this is the first sentence"])
print(encoded_inputs ) = {'input_ids': [[5661, 318, 262, 717, 6827]], 'attention_mask': [[1, 1, 1, 1, 1]]}
tokened_words = encoded_inputs['input_ids']
attention_mask = encoded_inputs['attention_mask']
model(input_ids = torch.tensor(tokened_words), attention_mask = torch.tensor(attention_mask))
OUTPUT:
CausalLMOutputWithPast(loss=None, logits=tensor([[[15.9766, 16.5625, 13.4219, ..., 2.6074, 2.6074, 2.6074],
[12.3047, 15.2344, 10.3672, ..., 2.3027, 2.3047, 2.3027],
[ 8.8672, 11.7188, 6.6055, ..., 1.0361, 1.0371, 1.0371],
[12.4844, 13.6406, 7.1406, ..., 0.2700, 0.2722, 0.2703],
[20.4688, 22.5625, 14.8438, ..., 3.3477, 3.3477, 3.3457]]],
#HERE IS CODE RUNNING TWO SENTENCES
encoded_inputs = tokenizer(["this is the first sentence", "this is another sentence and is longer than the first"], padding = 'longest')
print(encoded_inputs ) = {'input_ids': [[5661, 318, 262, 717, 6827, 50256, 50256, 50256, 50256, 50256], [5661, 318, 1194, 6827, 290, 318, 2392, 621, 262, 717]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]} # MASKING LOOKS CORRECT
tokened_words = encoded_inputs['input_ids']
attention_mask = encoded_inputs['attention_mask']
OUTPUT:
CausalLMOutputWithPast(loss=None, logits=tensor([[[15.9688, 16.5625, 13.4219, ..., 2.6074, 2.6094, 2.6074],
[12.3125, 15.2344, 10.3672, ..., 2.3047, 2.3047, 2.3047], # NOTICE SOME OF THE VALUES, SUCH AS THE VERY FIRST VALUE IN UPPER LEFT HAND CORNER IS DIFFERENT THAN THE
[ 8.8672, 11.7188, 6.6055, ..., 1.0391, 1.0400, 1.0400], # VALUE IN THE SAME LOCATION IN THE FIRST MATRIX
...,
[13.9922, 17.0156, 18.8750, ..., 2.4453, 2.4453, 2.4453],
[13.8750, 16.8750, 18.7500, ..., 2.4082, 2.4062, 2.4062],
[13.7109, 16.6094, 18.5625, ..., 2.3477, 2.3457, 2.3457]],
[[15.9688, 16.5625, 13.4219, ..., 2.6074, 2.6094, 2.6074],
[12.3125, 15.2344, 10.3672, ..., 2.3047, 2.3047, 2.3047],
[12.3125, 14.6250, 7.8828, ..., 0.5962, 0.5967, 0.5972],
...,
[10.6875, 15.7188, 9.0234, ..., 1.4434, 1.4424, 1.4414],
[ 8.5469, 12.7188, 6.2656, ..., 0.2693, 0.2688, 0.2676],
[17.5000, 20.3906, 12.9453, ..., 2.2891, 2.2891, 2.2891]]],
some of the values in the first matrix are different than the values in the first output of the second matrix. For example the first value in the upper right hand corner, 15.9766 and 15.9688 should be the same, but they are slightly different
@bennicholl
I found the issue, it was related to the precision.torch_dtype="auto"
was forcing the model to use FP16 (maybe with model.half()
), whereas the model is expected to be used with AMP, as follows:
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
model(input_ids = torch.tensor(tokened_words), attention_mask = torch.tensor(attention_mask))
I compared the logits and now they seem to match, and updated the readme with this information. Regarding the source of the issue, I need to double check, but it should be related to the RotaryEmbedding
class.
Awesome, thanks man. Seems to be good now
Hello @rchan26 and @bennicholl !
I just updated the model files and added the
attention_mask
support. Sorry for taking so much time. This should server as a proxy till Phi get's fully implemented intransformers
.However, please note that we still do not have support for
attention_mask
during training/fine-tuning, only inference. But this shouldn't be a problem in adding in the next upcoming days.
Hello, thanks for adding the attention_mask, seem like it still does not work for fine-tuning. Is it possible to add it, or maybe disable it somehow, in the HF trainer (trainer = transformers.Trainer) where data is data_collator=transformers.DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True)
Maybe cc @muellerzr regarding the trainer question
Any plan for supporting training for the phi series?
What do we mean when we say this model is not supported for training? if I call
loss = phi_model(x_batch, attention_mask=mask_batch, labels = labels_batch)[0]
loss.backward()
optimizer.step()
optimizer.zero_grad()
will the gradient no compute properly? Is this due to the mask not zero'ing out the tokens that should be masked during backprop?
Yeah, maybe. During my fine-tuning, I encountered a warning:
`attention_mask` is not supported during training. Using it might lead to unexpected results.
{'loss': 1.3228, 'learning_rate': 1.999875577156579e-05, 'epoch': 0.02}
1%|▍ | 300/59745 [06:19<20:47:29, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 301/59745 [06:20<20:48:14, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 302/59745 [06:22<20:48:01, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 303/59745 [06:23<20:47:31, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 304/59745 [06:24<20:48:13, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 305/59745 [06:25<20:49:27, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 306/59745 [06:27<20:48:52, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 307/59745 [06:28<20:48:29, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 308/59745 [06:29<20:49:14, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
1%|▍ | 309/59745 [06:30<20:49:49, 1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
{'loss': 1.5263, 'learning_rate': 1.9998671442394832e-05, 'epoch': 0.02}
@SinclairWang I've encountered that error as well. While fine tuning my loss was continuing to go down, but my outputs for my specific task were clearly not improving. That's why I'm curious as to the reason for the warning and how the attention mask could work for feed forward but not for backprop.
It is my training loss curve.
"While fine tuning my loss was continuing to go down, but my outputs for my specific task were clearly not improving. "
I also observed the same case. It confused me. I may not continue to fine-tune this model as I can not be sure the processing is ok due to the issue of the attention mask until this issue is solved.
I also get bad results when fine-tuning probably because of the attention mask problem.
I also get the same warning.
So, any solutions?
I hope @gugarosa will help. He said they will fix that as well. It is probably not easy to fix and test such thing.
I am not aware of any model of similar size and performance.
I am also looking for powerful models with about 1B parameters.