The insane waste of time and money in LLM token generation
If you've worked with GPT and other LLMs like Claude or Mistral, you've likely noticed a big boost in productivity. However, the more you use Generative AI, the more you encounter the Pareto Principle --- where finishing touches become harder than creating the main piece.
This issue is more noticeable as your content (like code or text) grows, making even small changes feel tedious, inefficient, and costly in terms of time (Delays) and resources. I've studied this pattern for years, especially in code generation, and I want to share my observations and potential solutions to these challenges, helping you create great work with ****a lot less cost, effort ****and delays.
Sounds abstract? Let's take an example.
I need to create a landing page for my new product/service. The landing page should be designed to attract and convert visitors by clearly communicating the value proposition, features, and benefits.....
You inspect it and like it, but then you want this tiny change.
So, you ask GPT to make the modification.
... the hero section should ...
**A few seconds **(or minutes) later, depending on the length of the original document, it regenerates the entire document.
It does some changes, but it's not what you expect.
So, you ask again.
... make some adjustments. The hero section is too cluttered.
A few seconds (or minutes) later, depending on the length of the modified document...And again, it regenerates the entire document.
It's better, but still not what you expected.
And on and on.
Let's illustrate this behaviour (Thanks Claude:)
The larger the document, the more tedious, time-consuming, and expensive this undertaking becomes.
It can be very frustrating, especially after having a lot of hope and feeling enthusiastic about the initial generation.
We encounter these behaviors and challenges across a lot of use cases:
Let's take a quick look at some examples (very brief)
- Summarising: Big context and a summary in one piece.
☞ Works well if you have a good prompt and a good model - Answering questions on a long document
☞ Works pretty well, even for long documents, especially if you have caching. - Initial code generation (greenfield) for say, a hello world application
☞ Usually easy and quick if the model is good. - ****Modifying generated code (after initial generation)****
☞ Depends heavily on the reasoning capabilities of the underlying model, the size of the files, and the change you want to make. - ****Editing existing documents, contracts, etc.****
☞Depends on the number of changes, the length of those changes and how they are distributed (in one paragraph or over many paragraphs etc.)
In look at it in a more abstract way:
If we look at these scenarios from 1000 feet up, we'll find some patterns, or let's say scenarios (high level).
Initial generation of long text
Duration and number of tokens depend heavily on the length of the text, the speed and the cost of the model. Progress is being made in this area almost daily across the industry.
****Editing parts of a long text that are close together:****
In many editors, such as Canvas or Artifacts, you can select parts of the area you want to edit, reducing the amount of change and avoiding generating a lot of unmodified text.
Editing several parts of a short text
When we are dealing with short text, it does not really matter whether the change is distributed or not, the whole generation does not take that much time, so this should not be an issue for most fast and cheap models.