Skip to main content

By Jim Hortle | Lead ML Engineer, Mantel

Executive summary

Key takeaways for business leaders

Implementing voice functionality for your agentic application comes with unique challenges not typically encountered elsewhere. Wrapping your existing text-based agent stack in a voice engine is unlikely to succeed without achieving sub 1 second response times from your agents. While UI improvements help provide a fluid voice experience for users, investing effort upfront in a decomposed, modular approach to building agentic applications that prioritises efficient response times is most likely to positively impact the end result.

To ensure a successful outcome and realise value from your investment:

  • Understand and validate your use-case, ensuring it is well-mapped out and that voice is indeed a core part of your business’ roadmap. Make sure that shared, quantifiable metrics are established and tracked so that engineering effort and customer value can be directly targeted. Demoing functionality is impressive, but demonstrating value is worthwhile.
  • Invest in solid agentic foundations to enable voice functionality while simultaneously reducing friction for this and future use-cases. Voice agent applications require architectural and framework considerations not found in the traditional chatbot or IVR system. Architecting for speed, observability and parallelism are fundamental so that your voice application can achieve a time-to-first-token of 1 second or less.
  • Understand that with new functionality comes new ways of unlocking value. Conversational approaches come with their own set of design challenges but can shift the paradigm for customer interactions. Embrace these challenges to ensure that you are prepared for upcoming innovations such as speech-to-speech.

Jump to the end of this article for some practical takeaways that you can start using today, or, for an in-depth, technical discussion of the points raised here, see the accompanying post on the Mantel Community Hub.

Enterprise voice AI: the shift from chatbots to conversation

As enterprise Generative AI reaches maturity, customer expectations are shifting. Users no longer just want turn-based text chatbots; they are demanding fluid, human-like, real-time voice interactions.

However, enabling voice functionality is rarely as simple as taking a “plug-and-play” approach. Simply wrapping your existing text-based agent in a speech-to-text and text-to-speech engine is a recipe for user frustration and wasted development effort. Implementing enterprise-grade voice comes with unique challenges, and to succeed, organisations must achieve sub 1 second response times.

In parallel, many CIOs and CTOs struggle to articulate the value derived from these initiatives, as noted in another recent post in this blog. In MIT research, 95% of AI initiatives do not demonstrate value, due to a variety of factors; core among these are understanding the use-case and alignment to business and user outcomes.

Based on our recent deployments, here is a practical guide for business leaders to avoid common pitfalls, architect for true conversation, and realise genuine value from their Voice AI investments.

 

What is enterprise voice AI?

Enterprise voice AI refers to AI-powered voice agents deployed within business applications to automate or assist customer and staff interactions. Unlike traditional interactive voice response (IVR) systems, which follow rigid decision trees, voice AI agents use large language models to understand natural speech, manage back-and-forth dialogue, and respond dynamically. The core technical challenge is achieving a response time (time-to-first-token) of under one second to meet user expectations.

1. Validate your use-case and measure what matters

Because most of us converse daily, it is easy to rely on “gut feel” to guide how a voice application should work. But for enterprise applications, rigorous use-case validation is essential.

Map out your customer journey and pinpoint the specific stages where voice will be most effective. The best voice use-cases target the rapid exchange of simple information and involve only low-stakes decision-making. Leave complex, data-heavy, or non-reversible decisions to specialist human operators or advanced text-based agents, as careful consideration will make a voice experience feel sluggish.

Once you have identified a high-impact, low-risk use case, you must focus on the metrics. The golden metric for voice AI is time-to-first-token (TTFT) which measures the time between when a user stops speaking and when the agent begins its response. End-users are highly unlikely to tolerate a delay longer than one second, quickly disengaging from laggy calls. Ensure your engineering and business teams establish clear, shared baselines for these latency and product metrics before development begins, so efforts can be directly targeted at optimising speed and linked back to business outcomes.

2. Build an architecture ready for voice

You cannot expect a monolithic text agent to handle the rigours of real-time conversation. Throwing every tool, prompt, and chat history log into a single agent takes too much processing time for a voice application.

Invest upfront in a decomposed, modular architecture, which is the best way to positively impact your end result. By breaking apart a monolithic agent into smaller, single-responsibility subagents, you can observe exactly which tasks are causing delays. This modularity allows your team to optimise subagent configuration and allocate computing resources precisely where they are needed, significantly reducing response times.

Furthermore, voice requires processing conversation management (like understanding interruptions) and data querying in parallel. Architecting for speed, observability, and parallelism not only helps you achieve that crucial sub 1 second response time, but it also prepares your business for upcoming innovations like advanced speech-to-speech models.

3. Design for human conversation, not chat

Human conversation is noisy, messy, and non-linear. If you try to force users down the brittle, discrete paths of a traditional IVR (interactive voice response) system, you will simply recreate a frustrating experience using newer technology.

Instead, leverage the messiness of human conversation to your advantage. Design your agents to handle interruptions gracefully, manage back-and-forth dialogue, and ask open-ended questions. Putting the user in the driver’s seat allows them to navigate to their solution much faster, skipping lengthy triage steps.

Finally, you must keep the user engaged. Paradoxically, humans expect an AI voice system to have an immediate answer, even though human operators routinely ask for a moment to search their systems. To overcome this, use audio cues or play background office sounds while it processes complex queries. If the call is web-based, display visual indicators that show whether the agent is “listening,” “thinking,” or “talking”.

 

Comparison table: Traditional IVR vs Enterprise voice AI agents

Traditional IVR Enterprise voice AI agent
Dialogue style Scripted, linear Open-ended, adaptive
Input handling Menu selections or simple keywords Natural speech and full sentences
Interruption handling Limited or none Designed to handle interruptions
Response time expectation Variable Under one second (TTFT)
Architecture Monolithic Modular, decomposed
Best suited for Simple routing tasks Conversational, low-complexity exchanges

 

Practical next steps

Voice AI can genuinely transform how customers interact with your organisation, but only if you treat it as a first-class capability rather than an add-on feature.

  • Be ruthless about where voice adds value: Map customer journeys and pick low-risk, high-impact slices where voice is a necessity, not just a gimmick.
  • Make metrics your source of truth: Track and optimise your response latency to keep delays under one second and application metrics to demonstrate business value.
  • Invest in a modular, parallelised architecture: Decompose large agents into smaller, parallelised components to unlock the speed required for real-time conversation and prepare for future upgrades.
  • Leverage messy human interaction: Build resilient systems that can handle interruptions, noisy environments, and non-linear flows that put the user in the driver’s seat.
  • Keep users engaged: Use clear status indicators and natural acknowledgement phrases to bridge processing pauses and maintain customer trust.

Voice AI can genuinely transform how customers interact with your organisation, but only if you treat it as a first-class capability rather than an add-on feature.

Frequently asked questions:

What is enterprise voice AI?

Enterprise voice AI refers to AI-powered systems that enable natural spoken conversation between users and business applications. Unlike traditional IVR systems, they use large language models to understand open-ended speech and manage dynamic, back-and-forth dialogue.

What is time-to-first-token, and why does it matter for voice agents?

Time-to-first-token (TTFT) measures the delay between when a user finishes speaking and when the agent begins its response. For voice applications, users become frustrated with delays longer than one second, making TTFT the primary performance metric to optimise.

What types of tasks are best suited to voice AI?

Voice AI works best for fast, low-complexity exchanges, such as retrieving simple information, answering common questions, or routing users quickly. It is less well-suited to data-heavy tasks, multi-step decisions, or situations requiring irreversible actions.

How does a modular architecture improve voice agent performance?

Breaking a monolithic agent into smaller, single-responsibility components allows tasks to run in parallel rather than sequentially. This reduces overall latency and makes it easier to identify and fix the specific component causing delays.

How do voice AI agents differ from text-based chatbots?

Voice AI agents must handle the natural messiness of spoken language, including interruptions, false starts, and open-ended phrasing. They also face a strict latency requirement that text-based agents do not, and require additional infrastructure for conversation management and audio processing.

How should organisations measure the success of a voice AI initiative?

Set shared baselines for TTFT and establish application-level metrics tied to business outcomes before development begins. Demonstrating measurable value, such as reduced handling time or improved resolution rates, is more useful than showcasing functionality alone.

See how we’re helping businesses scale with AI-first solutions