context window size

#4
by stephen-standd - opened

I can't find this in the blog post or readme but what is the context window available for these models? I might have missed it!

8K sequence length noted from their product page. https://mistral.ai/product/
For a more detailed specification read here: https://mistral.ai/news/announcing-mistral-7b/

Mistral AI_ org

Thanks @ZeroXClem for the answer, and it's also in Transformers documentation: https://huggingface.co/docs/transformers/v4.34.0/en/model_doc/mistral#model-details

lerela changed discussion status to closed

Technically it's unlimited with a 4k sliding window context size.

Under the hood the stacked layers allow the possibility of indirectly attending to things more than 4k token previously, but that now requires multiple attention-hops, and there's no backward-in-token-space propagation of the attention-query to before the sliding window. So in effect, if something's outside the sliding window, the model can attend to it only if it previously attended to it at a token less than 4k tokens previously: so outside the sliding window the model can learn to have capabilities more like LSTM-level than attention.

Sign up or log in to comment