mistralai/Mistral-7B-Instruct-v0.1 · Using FAST API to query the model.

moc1pher

Feb 26

How to use FAST API to query the model. Any example that I can use ?

Suparious

Feb 26

•

edited Feb 26

tabbyAPI is based on fastAPI - can probably learn a lot from it.
https://github.com/theroyallab/tabbyAPI

moc1pher

Feb 27

Is there an option where it is completely Fastapi ?

Suparious

Feb 27

•

edited Feb 27

Maybe this is better for your use case: https://github.com/c0sogi/LLMChat
based on fastAPI, but without the OpenAI API layer
I just like tabbyAPI for running local models on exl2 quant and consuming them like you would OpenAI API

hamza13-10

13 days ago

•

edited 13 days ago

I wrote the following example that you can use:

# In-memory store for conversation history
session_history = {}

class ChatInput(BaseModel):
    session_id: str
    prompt: str
    clear_history: bool = False
    tone: str = None

class ChatOutput(BaseModel):
    response: str



@app

	.post("/bot", response_model=ChatOutput)
def chat_bot(input_data: ChatInput):
    try:
        session_id = input_data.session_id
        prompt = input_data.prompt
        clear_history = input_data.clear_history
        tone = input_data.tone

        if clear_history:
            session_history.pop(session_id, None)

        if session_id not in session_history:
            session_history[session_id] = []

        chat_history = session_history[session_id]

        if prompt:
            if tone:
                prompt = f"[Tone: {tone}] {prompt}"

            chat_history.append({"role": "user", "content": prompt})

            input_ids = tokenizer.apply_chat_template(chat_history, return_tensors="pt").to(device)

            input_length = input_ids.shape[1]

            # Generate output from the model
            outputs = model.generate(
                input_ids,
                max_new_tokens=200,
                do_sample=True
            )

            generated_tokens = outputs[0][input_length:]

            assistant_response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

            chat_history.append({"role": "assistant", "content": assistant_response})

            # Return the response to the client
            return {"response": assistant_response}
        else:
            return {"response": "Chat history cleared." if clear_history else "No prompt provided."}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))