Omnibus/google-gemma · Inference API

Documentation for model: https://huggingface.co/google/gemma-7b
This demo is using the huggingface_hub InferenceClient, and it is not a documented way of deploying these models, so they may not be optimized for it.

This model has a max token limit of 8192 tokens, which is the input + output.
This demo retains a number of previous [(input, output), (input, output)] conversations that are in the Chatbot window (Chat Memory).
These previous conversations + the new input + new output must be less than 8000 tokens, or the Error that you mentioned will be raised.

The default Chat Memory is 4 conversations, but this can be reduced to provide more tokens for new (input, output), at the expense of losing some context of the conversation.
Reducing the 'Max new tokens' value will tell the model to return less tokens in it's output, which will also help stay within the 8192 total token range.

To modify the parameters that are sent with the prompt + system prompt using this demos UI, there is an accordion labelled "Modify Prompt Format".
The default for the prompt parameters are: "<start_of_turn>userUSER_INPUT<end_of_turn><start_of_turn>model", where USER_INPUT will be replaced by the values in the "Prompt" + "System Prompt" input boxes.