Good LLM Dev and Usage Patterns
- published
Introduction
This is a list of patterns regarding usage of LLMs (Large Language Models) that I’ve observed result in positive outcomes. I’ve split them into two categories:
- Usage, for patterns used in systems that are partly agentic, i.e. the system utilizes LLMs during its operation. The examples given are focused on coding agents, but they are generally applicable.
- Development, for patterns used during software development
If the category split feels arbitrary, feel free to ignore the (ir)relevant headings.
Usage Patterns
Criticism helps LLMs too
The actor/critic pattern is where input is given to one LLM agent, called the ‘Actor’, and the output of that is fed into a different LLM agent, called the ‘Critic’. The process is iterative; the Critic is instructed to be stringent in what they accept, and failure to meet their standards results in the actor being given the Critic’s feedback, and asked to produce a better result.
This can be an expensive way to dramatically increase the quality of output, similar to the “four eyes” method / pair programming. Note that the model(s) used, and any processing of the context, are implementation details. It is a good idea to keep track of decisions made by the system in a way that persists across iterations, such that the agents gradually refine the approach with a decreasing set of changes each time, until the Critic fully approves.
An alternative (or complement, if you have infinite budget) of this is to generate multiple responses to each prompt, and pick the best one. I believe that is not good practice, as it does not scale very well, and iterative improvements generally pay off better vs attempting to pick between what most often is different ways of expressing a very similar one-shot result. The tokens you spend on the critic + an additional iteration are therefore more efficiently spent than the tokens on multiple permutations of any given step.
Minimize the context needed to accomplish each task
Humans split work into tasks. Each one is a building block towards creating a bigger piece of software. Agents too should only receive the information they need to accomplish one (relatively) small task at a time.
In practice, this means that the documents that the agent needs to read to get up to speed are concise, the agent’s harness only has the necessary tools that the agent needs to do their job, and QA / testing / deployment are separate concerns from development. The more each agent has to do, the more opportunities they have to forget or deviate from their assigned tasks.
This also ties in with good privacy & security practices: You don’t need the agent to read or provide e.g. user IDs to call a tool, your software can provide that info itself when the call is made. This both makes it easier for the agent to call a tool, because less tokens are needed to call it successfully, and prevents the agent from having to handle confidential information.
Making sure that the agent needs to output the minimum number of tokens to accomplish something makes the likelihood of success higher (in general at least, with diminishing benefits). In general, the less the agent is expected to do (at least, per invocation), the higher the likelihood of the agent producing an acceptable result. As the software being built starts to take shape, the context needed to make even a small change will naturally increase. Keeping the increase minimal ensures that agents can work cheaper (less tokens of context), faster (less tokens to process) and better (greater likelihood of producing good results).
Be cautious and defensive
On a less positive note, remember that what an agent can do, it should be expected to (eventually) do. So never give any agent enough rope to hang you with, because it is not impossible that it might.
It is important to expect that the agent will abuse anything given to it, especially if it ever misinterprets an instruction. Some providers suggest having a second LLM assess users’ prompts before sending them to your agent. This does make attacks harder, but not impossible, and it should not be considered a robust defense on its own. A better approach is to never give access to extremely damaging tools to agents, make the tools require user confirmation, or have the agent work in an environment where any damage that it does can be easily reverted (which technically makes the tools provided to the agent not extremely damaging, but if I don’t mention sandboxes explicitly someone might think I forgot).
Do note that having tools require user confirmation is not ideal if it will be frequently requested, as humans quickly get alert fatigue. If you cannot reliably alert humans to only the agents’ actions that are impactful enough to require a second set of eyes, instead opt for not giving the agent access to damaging tools in the first place.
Regarding defensiveness, imposing constraints on what valid agent output looks like, that are simple enough to be checked deterministically (i.e. enforcing that a field only takes a number), means that you can catch simple mistakes that the agent could make and re-prompt the agent gracefully, without surfacing errors to the user. Forcing structured output out of agents also helps with parsing it more easily, which can be extremely important - having agents embody the robustness principle helps with making sure that your system is robust.
An easy example of both minimizing context and being defensive is having a coding agent’s harness run formatting, linting, and static analysis tools once the agent commits a set of changes to the Version Control Software (VCS) of choice. This absolves the agent of the responsibility of running the quality/testing pipelines itself, and the user from having to trust the agent to follow good development practices.
You don’t have to use the latest and greatest
State-Of-The-Art (SOTA) LLMs are very good. They are also very expensive. If you do not have a task that warrants using them, opt for something cheaper. If you can have agents produce a non-agentic solution to the problem, even better.
A lot of times, a better harness / prompt / process can provide a meaningful boost to the performance of LLMs. Unless someone is forcing you to spend tokens, you should be mindful of your expenditure, because LLMs are not free, and cost-efficient solutions could mean your bill has one less zero tacked on.
In certain cases, there may not be a need to add LLMs to a process at all. Taking a step back might reveal a better way to architect a process or a system that is simpler, and more efficient, than what currently exists. If you can simplify a process to the point where no intelligence is needed to perform it, automating it will be cheaper, faster, and more reliable than using AI.
Mind the compaction
There are many context reduction strategies that can be implemented, depending on what the agent is doing. Assuming that you will not hit the context window limit is bad practice; compaction should be an explicit consideration. Keeping the context amount low tends to give better results; minimal instructions are easier to follow well.
Tool calls and results are (typically) less useful than the accompanying thought stream. If you compact on a value-of-each-token-in-the-output basis, removing the tool calls and outputs of all-but-the-last-X invocations helps maintain coherency for longer.
All types of context compaction are lossy; even if the text can be compacted while preserving all the information/meaning, the semantics will change, which might have an impact on future responses. So there is pressure to delay compaction as much as possible. Summarizing all but the latest X messages is preferable if the agent works on a task in ‘chunks’, to try and preserve recent reasoning and maintain the agent’s current course better. Different approaches will have different costs, and minding how cached tokens work & are priced is important.
There is no compaction strategy that is clearly better than all the others, since a lot will depend on budget and use-case. You can drop messages, drop tokens, summarize, ask users to start a new session, or anything else as appropriate.
Development Patterns
A failure to plan is a plan to fail
You need to have a good idea of what success looks like before you begin work. If you do not, you will not be able to draft a comprehensive enough plan for agents to execute. If you do not draft a comprehensive plan before you ask agents to execute, you will get bad and/or incomplete solutions. For toy projects, that can be acceptable. For production stuff, that will mean that your agent will forget to turn off anonymous signup/login in your SaaS app, it will wire things ten different ways using JavaScript libraries whose creators forgot they exist, a crash at 0100 will result in having to re-deploy everything from scratch when you wake up, and the occasional data loss will mean that you inadvertently meet your GDPR obligations.
Restated a bit more clearly: For tasks that warrant more than a throwaway script, build a comprehensive plan, and iterate on it until neither you nor your agent(s) can find anything more (of substance) to consider. One more idiom, for the road: An ounce of prevention is worth a pound of cure.
Test away uncertainty
There are a lot of unknown unknowns when it comes to development. Creating a lot of Proof-of-Concept (PoC) scripts to test aspects of the approach is cheap, and can surface issues before committing to writing production-ready code. PoCs can also help document why certain decisions were made, as they can, for example, demonstrate performance and usability differences between libraries.
Any decision you neglected to opine on, agents will decide for you. If your overall plan is not made clear enough, some decisions could accelerate the accumulation of tech debt and result in increased code churn. It is important to have a good idea of what is important and what is not. What is important should be tested, validated, and documented, what is unimportant can be left for the agent to decide.
Frequently do reviews and revisions
LLMs can also (help you) review code. If it’s been a while since a piece of code was looked at, ask an LLM if it’s well-written, if it has any bugs, if it can be improved, if a cautious and diligent senior-principal super 10x rockstar engineer would approve of it, if it was written today. Doing that a couple of times should surface most of the issues worth fixing, if any. Both people and LLMs make mistakes; adding more eyeballs surfaces bugs that were previously overlooked . Code does not have to languish abandoned once written; maintenance is now cheaper, so we should take advantage.
Write your own benchmarks
Benchmarks can be, and are, gamed by labs to get a higher ‘score’. In certain cases, content from the benchmarks could find its way into the training data, which makes the results of benchmarks dubious as LLMs can recall information very well . In some cases, benchmarks may be poorly constructed, allowing agents to ‘cheat’. In all cases, benchmarks test LLMs in a way that most likely will not match your own usage. It is therefore wise to test what models fit your use case the best, and most importantly, have some means of scoring them such that you can identify and use whatever model is good enough for your purposes, without over/under spending.
Scoring should not be based on vibes. There is a lot that goes into making a good benchmark, and a lot of details are going to vary based on use-case. If token costs are going to be a factor at any point (and they typically become one eventually), investing in figuring out what LLM is the most efficient for you is going to be worthwhile. Note that we have reached a point where for modern LLMs, using more tokens will almost always give better results, at least for tasks with defined acceptance criteria. There is, naturally, a limit to how much you can express per token. Make sure that you are aware of both your time and money budget(s), and be consistent with these across your benchmarks.
Testing your own harness/tools that the agent is meant to use is also really important, as you will be able to see if your tools help the agent perform better or worse than whatever baseline you have established.
Again, it is entirely possible that the best model for your use case is not the most expensive one available - do not default to any model just because it is popular.
Test multiple prompt/response iterations
LLM output is non-deterministic. You should definitely test agents’ responses to e.g. malicious prompts more than once, especially when using smaller models. You don’t need to collect a lot of samples, but you do need to be sure that your agentic pipelines work properly (much) more often than not.
Some thoughts
It is easy to both overstate and understate the impact of LLMs. Depending on what ‘good enough’ looks like, LLMs can accelerate output and help people ship more, faster. The quality of engineering is not important to most people, as evidenced by the wide adoption of OpenClaw and derivatives . Organizations and individuals should decide for themselves how strict they want to be with setting a baseline for the quality of output shipped, at the cost of the speed at which the output is generated.