meta-llama/Meta-Llama-3.1-70B-Instruct · what is a way to verify the model I am running is performing as expected?

I would like to know if there is a way to verify that the model I am executing performs the same as the one published by meta-llama. A way to verify the model and the software hosting the model.
For example, how would I verify that the output of a Llama 3.1-70b-instruct model pulled from Ollama's model library and executed by the Ollama host (which has various hardware profiles) would be similar to the model and environment used for benchmarks?

There are subtle things that might affect the model, such as how the host was compiled, which gguf version was selected, and the hardware features available.
I am asking because I have a situation where Llama 3.1 70B instruct q8_0 can't complete a task completed by Llama 3 70B instruct q8_0 and q6_K.

My assumption is that 3.1 is supposed to be better than Llama 3, yet I do not observe this. This is why I am looking for some way to confirm that the model is performing as designed.

I know each of the gguf versions will perform differently from the original model. However, I assume the fp16 would perform the same as the original model.