Multi-GPU training fails when using device_map = "auto"

#23
by aveer30 - opened

Hi, I get an error when finetuning the model using device_map = "auto". The issue looks similar to the 128k variant. The fix is also provided on the below discussion. Could any of you verify this and push a fix? Thanks
https://huggingface.co/microsoft/Phi-3-small-128k-instruct/discussions/19#6677dc5020ff491d382a0221

File "/opt/conda/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run
kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

Same error here for phi-3-small-8k

Sign up or log in to comment