Performance and Stability

#47
by TeaCult - opened

I have managed to write scripts to translate in batches. I first split into sentences then send it to GPU as a batch. This is the the way which provides highest performance. My GPU is RX 7600 XT I am on ROCM 6.0.2 on archlinux.

Performance is 4MB per hour text generation which is roughly equal to 1M tokens per hour.
However there are crashes. Although I am cleaning up VRAM after batches , It reaches %100 at some point. I configured batch size and max token generation dynamically. It can fairly generate 4000 character batches with 50 max tokens. (This is not per batch this is per batch item in this model somehow) . But it crashes some times with 4000/50 . Default max_tokens is 250 so it crashes more frequently with this.

What I figured out is model generates some unrelated stuff, repeats and hallucinates time to time. I am mainly batch translating small part of TinyStories dataset from Ronen Eldan's repository.

Could not be able to quantisize the model using ONNX or Torch or BitsandBytes , There is an FP16 version of this in huggingface, It does not save VRAM or increase tok/sec generation speed at all. Tried with both Cuda and ROCM.

Bottom line is :
While this model is accurate and performant , there are stability issues and cannot be easily quantisized.

PS : Translating from English to Turkish and vice versa.

!!! Thank you all meta researchers for providing open source AI models. I would love to see links to the datasets models are trained on in the README file. Thank you again and wish you happy researching :)

hello sir can we connect with us

Sign up or log in to comment