By Kenny Pearson | ML Engineer, Mantel
Executive summary
Key takeaways for business leaders
Production LLM agents spend a surprising share of their input budget on words the model doesn’t need: articles, hedging, politeness, and the connective tissue we reflexively write because that’s how humans talk to each other. We tested what happens when you strip it out.
Working on a client’s AI-powered scheduling assistant, we rewrote the system prompt and every tool description in a compressed register in half a day, preserving every rule, tool name, enum value, and example, and cutting only the scaffolding. The results:
- ~35% fewer input tokens per turn
- Average latency down ~22%
- Eval quality flat across every metric we track, with a small positive bump on answer relevancy
- Dollar-for-dollar savings compounded across every turn, every session, forever
This piece covers what we changed, what we measured, and the caveats worth knowing if you want to try the same thing on your own agent.
Machines don’t need pleasantries
Machine talk to machine. No please. No thank you. No “could you perhaps”. Just meaning. Fewer words → same result → less cost. Human read this and flinch. Machine read this and work.
Both the author and editor of this article were reminded of this scene from The Office:
And yet, while Kevin’s humourous take was perhaps a bit ahead of the times, the underlying lesson reads true. The opening paragraph carries the same information as a polite version two or three times its length. You understood it. A model would too, and for roughly a third fewer tokens.
In 2017, Facebook researchers trained two chatbots to negotiate. Left alone, the bots started communicating in a stripped-down syntax that looked broken to humans but carried the same meaning:
Bob: “i can i i everything else” Alice: “balls have zero to me to me to me…”
The press panicked. Facebook shut it down. However, the agents weren’t inventing AGI. They’d just figured out what every compiler engineer already knows: when the only constraint is getting the other side to understand, you can drop most of the words humans pad speech with.
A more recent (and cuter) example: two voice agents on a phone call realise they’re both AI and switch to a high-density audio protocol mid-conversation. Same instinct: strip out what isn’t carrying signal.
We mold the solution to our habits, not the other way around
When we build agentic systems, we tend to shape the agent around how we already work. We write system prompts in the voice we’d use to brief a junior colleague; we describe tools the way we’d describe them in a design doc. When the agent misbehaves, we reach for more prose: another sentence, another clarification, another “please make sure to”. That’s comfortable because it mirrors how we talk to each other, but every word you write for the agent’s benefit becomes an input token you pay for on every turn. It accumulates silently.
Natural language is both the headline feature of modern LLMs and their biggest operational tax. The blessing is real: explainability by default, plain-English debugging, stakeholders who don’t write code can contribute directly to agent behaviour. The curse is that the model reasons in words too, and those words are billed by volume. A purely-LLM agent that handles everything through prose is carrying overhead that a traditional system would handle for a fraction of the cost.
The right balance, especially for action-heavy agents, is to mix registers. Keep LLMs where language is the actual interface: interpreting user intent, orchestrating tool calls, explaining results. Push the deterministic work out to traditional ML, rules, and automation. Don’t make the LLM rediscover the same facts on every call. The prompt compression we’re about to describe is this instinct applied one layer up: reserve natural language for where it’s earning its keep, and be terser everywhere it isn’t.
Why this matters for any team running an LLM agent
LLM bills scale with volume, not price. As soon as a model is good enough to ship, teams start using it everywhere, and a production LLM agent typically makes 3-5 model calls per user turn, with every one of those calls re-sending the same system prompt and tool definitions.
Most of what’s in that system prompt is glue: articles, hedging, pleasantries, “please make sure to always…”, the polite scaffolding we reflexively write because that’s how people talk to each other. An LLM doesn’t need any of it. You’re paying for it anyway, on every call, forever.
The realisation we came to: the style is costing us, the content isn’t.
What we did
The idea isn’t ours. It’s inspired by caveman, a Claude Code plugin you might have seen or heard of, that compresses prompts in exactly this style. What we wanted to know was whether it held up on a real production agent under eval.
We rewrote the instructions we feed our LLM agent in a compressed register. Every rule, tool name, enum value, and example stayed the same. Only the connective tissue got stripped out.
Before:
“When you encounter ambiguous instructions, missing required information, or multiple valid interpretations, you MUST ask clarifying questions BEFORE proceeding. Never guess or assume.”
After:
“Ambiguous input, missing required info, or multiple valid interpretations → MUST ask clarifying question BEFORE proceeding. Never guess.”
35% shorter. Same meaning. You can see the pattern once you look at the two side by side: articles gone, hedging gone, causality shown with an arrow rather than spelled out in a subordinate clause. The instruction still behaves the same way at inference time, because the markers the model actually keys on (MUST, BEFORE, Never) are all still there. What we cut was the part that was only ever there for a human reader’s comfort.
We applied this approach to the agent’s system prompt (several thousand words of operating rules) and to every tool description across the MCP server. Then we measured.
Results
Same eval set, same model, before vs after:
- Overall Quality: flat
- Answer Relevancy: small positive uptick
- Input tokens: down ~35%
- Latency: down ~22%
Same quality. Faster. Substantially fewer input tokens. Every user-facing metric moved in the right direction, or not at all.
The latency lift is bigger than the token math alone would predict because prefill (the LLM’s input-processing phase) scales roughly linearly with input length, and the longest-prompt cases were the ones that benefitted most.
What this means for you and your team
Input tokens are operational cost. Shrinking them is dollar-for-dollar savings on top of whatever per-token price your provider quotes you: no model swap, no accuracy trade-off, no infrastructure change. It’s a one-time change that keeps paying, because every turn of every session, now and in the future, pays the smaller bill.
The latency drop matters in its own right. Users wait less. A 22% reduction in latency in a conversational product is the difference between “snappy” and “people start typing again before it replies.”
”A 35% cost reduction through caveman prompting is a good reminder that the biggest wins won't always come from bigger models; sometimes, they'll come from rethinking the obvious. Experiments like this only work with strong foundations: strong evaluations, product operations, and a willingness to move quickly and take risks on new ideas.
Vihan PatelHead of AI Solutions, Mantel
The caveats worth knowing
- Preserve what matters. Rule markers, tool names, enum values, concrete examples. Compression that deletes load-bearing instructions is worse than no compression.
- Measure with the real tokeniser. Word count isn’t tokens, especially in compressed text, which uses more punctuation per word. Trust the actual model’s count.
- Eval before shipping. Same quality is the only acceptable bar. We wouldn’t have rolled this out without the numbers above.
- Convention beats individual discipline. Once you’ve compressed, the files stay compressed only if your team has a shared rulebook and a review habit that catches drift.
“If you're running an LLM in production and haven't audited your system prompt for filler, you're paying roughly a third of your input tokens to say "please" to a model that can't be offended.”
Kenny PearsonML Engineer, Mantel
About this work
If your team is building LLM-backed products and hitting the wall where every optimisation compounds (model choice, prompt design, caching strategy, eval pipelines), this is the kind of work we do at Mantel. Get in touch.
See how we’re helping businesses scale with AI-first solutions