hpcai-tech/grok-1 · How is inference with this version of grok-1?

1st off, thank you hpcai-tech for taking the time to convert this model for pytorch. I'm hoping that this helps with the open community. It does seem that python+pytorch is the main setup used for most open source LLM's.

I don't have access to anything that can run it. And as a interested watching bystander, I'm curious if this model has been able to run successfully for someone that has access to enough hardware and funds to run this model in pytorch. (Very large model)
Mostly curious as to what is similar and what is different compared to what is seen with other open source MOE's. (Or against its self in rust+jax form. Or other base models as well.)
Is it coherent for starters? How efficient is the setup with pytorch? (How many tokens per sec?)
Where there any hardware surprises/special accommodations needed when trying to run it?
If it seems that it's running ok, can known evals be run against it to see how it fairs with marks against other baseline models?
What other bumps/surprises where there along the way that are unique for this model compared to others?

I understand that this could be asking a lot. And it's ok if question like these can not be answered anytime soon.

Thanks for taking your time to read this.

it is a mixture of experts model !
Quite large , we dont know how it was trained either ?
Was it trained as a whole ? or was it trained layer by layer expansion ?

How did you determine your size when you planned this model ? and how did you determine the compute requirements ? especially for a mixture of experts model ?

If the model was trained correctly layer by layer then it is justofied to Chop out a Smaller model from this model ! ... by instaciating the model with custom setting limiting the layer from Start to 32 for aa 7b

or even using the mergekit to take out specifc layers , such as 16 from and 16 back ! which after an alingment would also be good !

with super large models if trained correctly can be valuable resources from model creation ! , such that we can chop out the size we need from the trained layers , hence nneding to know how the model was trained as a whole or layer by layer ! --- So in the future it eeds to be trained layer by layer ! Especially if you have the compute as now the model can be used as BASE FOR BUSINESS MODELS ! << of Custom Sizes !

Hence im also intewrested to see how the model Actually runs ! from the opensource ! .... Also the model should be SAFE TENSORS AND NOT PYTORCH ! as the INDUSTRY MOVED QUICKLY ! AGAIN !