Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ric 
posted an update 9 days ago
Post
1160
> Article read: Simple guide to LLM inference and to TGI

I've just read article "LLM inference at scale with TGI" by @martinigoyanes . It's really good content, a must-read if you want a good low-level intro to LLM inference with TGI!

My takeaways:

How does inference work?
🧠 Prefill: the input prompt is tokenized on CPU, then transferred to GPU. Then one single forward pass generates the initial token.
🔄 Decode: the model generates ("decodes") tokens one by one, each time appending the new token to the current input of size N to then generate a new token again with this augmented input of length N+1. This loop ends either when a specific token called "End-of-sequence" is generated or when the completion reaches a pre-specified maximum length. Then the sequence is de-tokenized on CPU to yield text again.
⏱️ This step's speed determines the Time Per Output Token, which directly translates to the key metric: Throughput

🤔 How was the separation between the two steps decided ? Like, why does prefill include this strange generation of only one token at then end?
➡️ The cost of attention scales quadratically with the number of tokens, so it can really explode quickly.
To compensate for that, a really important technique called KV caching was devised: using the fact that when generating token N+1, the Key and Value (K and V) matrices generated inside the Transformers are a simple extension from the K and V from the previous step, the model caches the K and V matrices between steps : thus the separation - the prefill part is the part that prepares this KV cache, while the decoding is the one that leverages it and expands it by one at each step.

TGI-specific takeaways:
⚙️ TGI has many SOTA techniques for decoding: Paged Attention, KV Caching and Flash Attention…
🔀 TGI's router handles generations finishing early because of an EOS token: instead of static batching, it continuously batches requests to the inference engine & filters away finished requests.
In this post