aa . aa
Lab / Case Study

Voice AI in production.

What twelve to eighteen months shipping voice AI to Fortune 500 enterprises taught me about demos, adoption, integration surface, and kill criteria.

Case Study PM lead, end-to-end 12–18 month arc
The challenge
Real-time voice agents for Fortune 500 contact centers. Sub-second interruption handling, multilingual code-switching, reliable tool execution against enterprise back-office systems. Generic voice-bot platforms failed all three.
AI architecture
Real-time voice runtime with LLM core, deterministic tool layer, telephony partner integration, multilingual prompt stack. Three verticals shipped, one airline deployment live in production, multi-million-dollar pipeline.
Business impact
Mid-seven-figure annual contact-center spend addressable, 70-80% of inbound volume on repeatable intents. Demos to adoption gap closed via integration-surface as primary signal, not stack benchmarks.

Context

Over roughly twelve to eighteen months I led product for a voice AI stack inside an enterprise platform company. The product was real-time voice agents for large enterprises. A customer calls, the agent handles the conversation end to end, executes transactions, hands off when it can't. My scope was the full product surface. Stack choices, demo scripts, customer-facing POCs, integration playbook, multilingual deployment, and launch.

Problem I was solving

Large enterprises wanted voice agents that could actually replace a human on defined intents. The bar was specific. Sub-second interruption handling, multilingual fluency including code-switching, reliable tool execution against their back-office systems. Generic voice bot platforms didn't clear any of the three. Incumbents were IVR-era stacks with LLM glue. New entrants demoed well and fell over under real telephony load.

The pain was concentrated in high-volume inbound centers. A mid-seven-figure annual contact center spend, with 70 to 80 percent on repeatable intent handling a voice agent should own. Every quarter of delay was real money on the floor.

What I considered

Three architecture paths.

Build the full stack ourselves. Our own ASR, our own turn-taking, our own TTS integration. Attractive because we'd own latency end to end with no vendor risk. Rejected because the team was small and the frontier on turn-taking was moving faster than we could chase. Going custom would have cost us a year we didn't have.

Wrap a voice bot platform. Attractive because time to demo was a week. Rejected because the platforms abstracted away the exact things I needed control over. Interruption behavior, prompt architecture, streaming boundaries. When we stress-tested them with real enterprise conversations, the abstraction leaked and we couldn't fix it from our layer.

Compose a stack. A leading real-time voice runtime for turn-taking and ASR, a frontier open-weight LLM for reasoning, a premium TTS provider for voice. Wired together at our orchestration layer. We controlled the prompt, the tools, the state machine. The runtime vendor owned the hard real-time bits. This was what I shipped.

In parallel, a go-to-market call. Depth in one vertical, or breadth across three. Depth compounds reference value. Breadth gives three shots at signal, three pipelines, three logos. I chose breadth. Three demo agents, three industry activity tables as acceptance criteria. The bet was that the technical stack was the solved problem and remaining variance was vertical fit.

What I shipped

The composed stack went into production behind a telephony partner layer. We owned the agent configuration: prompts, tool registry, call stages, guardrails, voice settings. We did not own the telephony carrier or the on-prem SIP stack. That was a deliberate boundary.

Prompt architecture used a single system prompt per agent, no tool calling for demos, with conditionals for language mode and call stage. Tools were MCP servers against customer systems. Turn-taking, interruption, and call state were handled by the runtime natively. I wrote the stack reference myself so every demo and customer engagement used the same primitives and nothing drifted between agents.

For launch, three verticals, three flagship agents, a demo script each. The plan: demos open conversations, POCs close them, references compound.

What happened

Airline adoption went better than planned. A Fortune 500 airline committed to a POC inside the first month, moved to paid production inside the quarter, and is live with the agent handling a material share of inbound volume. Performance is holding in prod. That deal alone was the validation the product needed.

The other two verticals did not convert the same way. Hospitality demos landed in the room and then the pipeline slowed. Telecom demos opened a long RFP cycle with a regional operator in a market where code-switching between English and a low-resource language was mandatory. That POC is progressing but on a very different timeline and with very different technical demands than the airline deal.

The pattern I didn't see coming. Segments with clean intent taxonomies and high contact center spend converted fast. Segments where the voice agent had to negotiate across multiple back-office systems without a clean intent boundary stalled. Not because the voice worked worse, but because the integration surface multiplied. The voice was never the blocker. The tools behind the voice were.

A second surprise. The telephony partner layer became the single biggest deal risk. I had rejected owning telephony because it was a scope boundary. That call was right. But I underestimated how often a partner's SIP integration timeline would sit on the critical path instead of ours. We had multiple conversations where our agent was ready and the integration was four weeks behind. That friction ate into deal momentum in ways I couldn't fix from our side.

A third surprise. Prompt drift in non-English deployments. Our first bilingual prompt for a regional airline engagement and the telecom POC both had the same failure mode. The model drifted into English mid-conversation when reference text was not gated behind language conditionals. Solvable. I re-architected the prompt with explicit language gates, and drift collapsed. But it should have been the default from day one. I had assumed a frontier LLM would hold register from a few examples. It did not hold under real call pressure.

A fourth, minor. Demo scripts initially supported tool calling. Removed after the first two demos showed the tool-call round trip blew the latency budget on real networks. Demos now use a single system prompt, no tools. Tool execution became its own product surface, demonstrated separately.

Metrics, ranges only

Conversion from demo to paid POC ran 40 to 60 percent in the airline vertical, 0 to 20 percent in the other two over the first 90 days. Latency to interruption held sub-second in English, 2 to 3x target in some non-English conditions before the prompt fix. Drop-off inside POC was minimal in airline, noticeably higher in telecom before language gates were rebuilt. Two paying enterprise customers in production, a multi-million-dollar deal pipeline, and an airline deployment performing in prod.

What I learned as a PM

Demo quality is not adoption. We had demo parity across three verticals. Conversion was not remotely parity. The real variable was integration surface area per intent. I now screen opportunities on that ratio before the demo, not after the POC stalls.

Scope boundaries hold value, but partner dependencies need their own playbook. Not owning telephony was correct. Not having a named engineering owner on the partner's side for each deal was a mistake. I would formalize a joint integration contract at deal start, not at POC.

Multilingual voice needs prompt architecture, not prompt examples. The cheapest way to find this out is to run a bilingual conversation on day one of the stack, not at the first non-English POC. I would put a bilingual evaluation harness into the core stack before the first demo script next time.

Breadth bets need explicit kill criteria. Launching three verticals was defensible. Not setting "if segment X has no paid POC by day N, we reallocate" was the mistake. Without that, breadth silently turns into distraction. Next time I write the kill criteria before the demo scripts.

← back to lab next: product shape is what you refuse →