How was the rope_theta value determined?
#6
by
ddh0
- opened
Hi, I see you're using a rope_theta value of 3M. The original llama 3 70B model used 500k, so I would normally expect to see a proportionate increase of 32768/8192, or a factor of 4. Instead you're using a value 6 times larger.
This is intriguing to me because I've recently been experimenting with RoPE theta values and I've currently settled on the formula ((n_ctx_desired/n_ctx_train)^(2^(1/4))) * rope_freq_base_train.
I'd be interested to know more about why this specific rope_theta value was chosen. Thanks!
It came over from abacusai/Smaug-Llama-3-70B-Instruct-32K and I don't know how they determined that value.
sophosympatheia
changed discussion status to
closed