5 Lessons from Scaling B2B AI Agents to 10K Users

How Mantel turned a $4-per-prompt demo into a profitable B2B product.

By Rico Wismans | Senior Data & AI Consultant, Mantel

“The first 90% of building an agent takes a weekend. The last 10% – making it reliable and production-ready – is the rest of your roadmap.” Harrison Chase (CEO, LangChain)

At Mantel, scaling an autonomous workforce to thousands of B2B SaaS users proved this quote entirely true. While our initial prototype felt like magic, the unit economics painted a grim picture: $4 per interaction and 20+ seconds of latency.

Surviving that “last 10%” required abandoning the smartest, most expensive frontier models for smaller, faster alternatives. This strict financial constraint became our greatest strategic advantage. It forced a fundamental rethink of every workflow step, replacing raw compute with disciplined engineering.

Here is our playbook for scaling agentic systems for speed, reliability, and enterprise ROI.

Executive summary

Key takeaways for the C-suite

Constraint breeds innovation: Forcing engineers to use smaller, cheaper models leads to better architectural breakthroughs than simply “throwing more compute” at a problem.
Engineering as discovery: The old sequential change model cannot absorb unpredictable AI evolution. Developers and PMs must validate technical feasibility before design even starts.
The investment portfolio strategy: AI delivery is probabilistic. Dedicated experimentation is not a distraction from delivery; it is the ultimate form of risk mitigation.
Agent-ready APIs: “AI readiness” isn’t about buying a smarter model; it’s about refactoring your backend infrastructure so machines can interact with it efficiently.
The AI testing imperative: Stop paying engineers to fix broken test suites for volatile AI features. Rely on mocked integration and deterministic unit tests until your product is stable.

The 5 Lessons

Lesson 1

Deconstructing the Monolith (A Strategy Beyond Frontier Models)

In software engineering, the pendulum often swings from monoliths to microservices, and back to consolidated architectures. Agentic AI is experiencing this exact evolution at warp speed.

Initially, our AI was a monolith: a massive “frontier” model (e.g., GPT-5.2, Opus 4.6) handling every task. This is the equivalent of hiring a PhD physicist to assemble IKEA furniture; expensive, slow, and prone to overthinking.

We then tried a “microservices” approach: a multi-agentic system. It failed. Agent-to-Agent (A2A) communication introduced unnecessary friction, hallucinated hand-offs, and a compounding latency tax for every “agent hop.”

The breakthrough was AI re-monolithisation. We consolidated back to a single, central core agent for tool execution. But instead of reverting to the massive frontier models of our first phase, we kept the cheaper, lightning-fast models from our microservices experiment.

The Tool Filter: Instead of confusing the agent with 50+ tools, a lightweight classifier hands the core only the exact tools it needs for the specific prompt.
Named Entity Recognition (NER): LLMs are probabilistic; databases are deterministic. We use ML models to extract exact IDs before the core agent even wakes up.

Innovation requires constraints. Challenge engineering teams to solve hard problems using small, cheap models to drive true architectural breakthroughs.

Lesson 2

The Inversion of Product Design

In traditional SaaS, the workflow is linear:

Design → Spec → Build.

With Agentic AI, this sequence introduces friction. Because LLMs are probabilistic, it is nearly impossible to write a rigid specification or design a UI for a capability that hasn’t been technically validated. If product and design lead discovery, they often design beautiful interfaces for workflows that the AI cannot reliably execute.

To move faster, we inverted the lifecycle, adopting a “capability-first” approach:

Prototype → Design → Spec → Build.

In a landscape where agentic frameworks and model capabilities change weekly, the prototype becomes the specification. This requires a fundamental shift in organisational roles. Product managers can no longer simply write requirements and wait for design; they must become hands-on in the AI sandbox.

Using prompt playgrounds and AI-assisted prototyping, PMs must work alongside engineering to validate the “physics” of the AI workflow. Only after the team proves the model can actually execute the logic does design step in to shape the user experience around those hard constraints.

Inverting the software lifecycle:

Development Phase	Traditional SaaS Workflow	Agentic AI Workflow (Capability-First)
Step 1	Design (UI/UX)	Prototype (Prompt playgrounds)
Step 2	Specification	Design (Shaped around AI constraints)
Step 3	Build	Specification
Step 4		Build

Lesson 3

Treat Innovation Like a Diversified Portfolio

Traditional software delivery is deterministic. AI delivery is probabilistic – returns are never guaranteed. Yet, enterprise customers still demand to see value delivered within a strict timeframe.

Managing an agentic project like a linear software delivery roadmap inevitably leads to missed deadlines. To balance immediate delivery with the non-deterministic nature of AI, we adopted an investment portfolio strategy. Capacity was divided into two distinct buckets to align with customer timelines while explicitly pricing in “invisible” ROI:

Core delivery: Tweaking existing tools and logic to guarantee the project hits immediate customer milestones.
Strategic exploration: Time-boxed architectural experiments that may not serve the current sprint, but map out tomorrow’s capabilities.

It is tempting to cut experimentation time to meet a tight deadline. However, doing so sacrifices two massive forms of unpriced value:

The option to seamlessly pivot and build upon a new, better solution in the future.
The internal muscle engineering teams develop by staying at the bleeding edge of AI innovation.

In a landscape where the tech stack changes weekly, dedicating sprint capacity to these strategic options is not a distraction from delivery; it is the ultimate risk mitigation strategy.

Lesson 4

API Strategy: Building for Agents, Not Humans

An AI is only as smart as its weakest API. Legacy SaaS APIs were designed for human UIs; permissive with inputs and verbose with outputs. While frontends hide visual noise, AI agents read (and pay for) every single word.

To scale an autonomous workforce, teams must transition from “human-ready” to “agent-ready” APIs across three architectural pillars:

Token Economy (Strict Filtering): Returning a 500-line JSON payload to extract a single email address maxes out the context window and skyrockets inference costs. Endpoints must support strict field filtering.
Schema Standardisation: If a CRM uses customer_id and your billing system uses clientID, the agent burns billable tokens trying to translate the difference. Aligning data dictionaries prevents the model from wasting compute.
Semantic Error Handling: A generic 400 Bad Request causes AI hallucinations and endless retry loops. APIs must return explicit instructions e.g., Error: Invalid Date Format. Required: YYYY-MM-DD so the LLM can autonomously self-correct.

Ultimately, “AI readiness” is less about the model’s intelligence and entirely about refactoring infrastructure for autonomous machine interaction.

Lesson 5

The “Automated Eval” Illusion

State management is a notorious bottleneck in traditional testing, but Agentic AI amplifies this friction. A single intent can be phrased ten different ways, and a prompt that passes today might fail tomorrow.

Allowing the hype around “LLM-as-a-Judge” to drive premature optimisation destroys ROI. While the product is volatile, rely on the AI Testing Pyramid:

Deterministic unit tests (base): Verify underlying tools work perfectly. This is standard software engineering, but non-negotiable for AI.
Integration tests (middle): Verify the agent’s reasoning by mocking tool responses, completely bypassing the state management overhead.
Automated E2E evals (peak): Only invest in complex sandbox testing once the schema and logic are stable.

Building automated evaluations too early means paying engineers to fix broken test suites instead of shipping the product.

Conclusion: The Truth About Scaling Predictable AI

The transition from an expensive prototype to a profitable enterprise product is never achieved by finding a “better prompt.” It is achieved by applying disciplined software engineering to a stochastic technology.

Waiting for Artificial General Intelligence (AGI) to arrive via the next massive frontier model ignores a fundamental reality: the next major leap in AI will come from cleverer engineering. The explosive rise of open-source agent frameworks like OpenClaw proves that massive value is unlocked not by the underlying LLM, which is rapidly commoditising, but by the proprietary, autonomous harness built around it.

Furthermore, innovation requires friction. Providing teams with infinite budgets breeds lazy architecture. Organisations should enforce hard constraints around token usage, model size, latency, and costs to drive true architectural breakthroughs.

To scale predictably, organisations must stop treating AI as a magic wand and manage it like a junior employee – requiring strict boundaries, clear instructions, and agent-ready APIs. Real enterprise value is captured by building the invisible infrastructure that makes the magic efficient.

FAQ’s

What is an “agent-ready” API? An agent-ready API is optimised for Large Language Models. It features strict data filtering to minimise token usage, standardised schemas across platforms, and semantic error handling that gives the AI explicit instructions on how to self-correct a failed request.

Why do multi-agent AI systems often fail in production? Multi-agent systems (where multiple AI models talk to each other) often fail at scale due to compounding latency and “hallucinated hand-offs.” Consolidating to a single, cheaper core model surrounded by deterministic pre-processors is often faster and more reliable.

How should you test an AI agent during early development? Avoid building expensive end-to-end automated evaluations too early. Instead, focus on deterministic unit tests for your tools and use mocked integration tests to verify the agent’s reasoning without the overhead of live state management.