Anatomy of a Sub-400ms STT→LLM→TTS Pipeline

February 23, 2026

The One Second Rule

At the one-second mark, users begin to disengage. Human conversation simply breaks. Human speech has a rhythm. When you ask someone a question, and they pause for more than 800ms before responding, your brain starts to file it as awkward. At 1.2 seconds, you start wondering if they heard you. At two seconds, you reach for the "end call" button.

Voice AI that can’t match that rhythm is broken, regardless of how many parameters its model has or how many benchmarks it passed.

At PolarGrid, our current pre-production benchmarks on Ada Lovelace architecture show 364ms p50 end-to-end latency, measured from audio input to audio output across real network round-trip time. With our Blackwell deployment underway, we are projecting a p50 of sub-300ms. This post explains the architecture behind both numbers and the path between them.

The Latency Budget

We started with a 300ms production ceiling and worked backwards, allocating budget to each stage before writing a line of code.

The breakdown we designed for Blackwell:

  • STT: 80ms. Transcription fast enough that the LLM can begin processing immediately.
  • LLM: 100ms. Time-to-first-token: TTS does not need to wait for a complete response before starting.
  • TTS: 70ms. Time-to-first-byte of audio: streaming synthesis runs in parallel with continued LLM generation.
  • Network: 30ms. Round-trip: the speed-of-light tax.
  • Overhead: 20ms. VAD onset, serialization between stages, and WebRTC framing overhead.

On Ada Lovelace today, we land at 364ms p50: STT at 148ms, LLM at 96ms, TTS at 117ms. Blackwell’s higher memory bandwidth, improved tensor core throughput, and enhanced FP8 quantization support are what close the gap to 300ms.

The Three Stages

Stage 1: Speech-to-Text (Ada: 148ms | Blackwell target: 80ms)

We run Whisper large-v3-turbo for transcription. The turbo variant hits a specific inflection point: it handles real-world speech, accents, background noise, and incomplete sentences, without the latency penalty of the full large model.

Our approach separates voice detection from transcription entirely. We run Silero VAD continuously in the background on the CPU. It’s cheap, highly accurate, and leaves all GPU headroom strictly for inference. As the user speaks, we build an audio buffer in memory. The exact millisecond Silero detects the end of speech, that complete, clean audio chunk is sent to the STT model via the Gateway.

We don't waste time spinning up Whisper until we know the user has finished their thought. On queries under ten words, this clean handoff keeps our STT latency tight, and because the audio boundary is clearly defined, the LLM starts with perfect context rather than correcting speculative transcription errors mid-flight.

Stage 2: LLM Inference (Ada: 96ms | Blackwell target: already met)

The LLM stage is where most voice pipelines give up. They use the same serving infrastructure they'd use for a chatbot.  Yet chatbots optimize for throughput over latency. With servers located on the other side of the continent, they wonder why they're sitting at 800ms time-to-first-token.

The other problem is batching. Standard LLM serving (vLLM's continuous batching, TensorRT-LLM's static batching) groups requests together to maximize GPU utilization. This is the right design for email summarization. For voice, batching is the enemy. Every millisecond your request sits in a queue waiting for other requests is a millisecond of the user's patience burned.

We run Llama 3.1 8B via Triton with a dedicated serving configuration optimized for single-request latency over aggregate throughput. The memory footprint is ~18GB of VRAM, leaving headroom to serve concurrent voice sessions without model contention. Our measured TTFT on this stack is 96ms.

There are two things that make 8B the right size for this application. First, it's fast enough to hit sub-100ms TTFT on Ada Lovelace architecture. Second, voice AI typically operates with tightly scoped system prompts and limited context windows; you don't need the reasoning depth of a 70B model for "schedule a meeting on Thursday" or "what's the status of my order." The failure mode for voice agents is "the model was too slow," rather than "the model wasn't smart enough."

On Blackwell, we are also rolling out support for 70B+ models for workloads where reasoning depth matters more than raw speed, giving us a different operating point with a different latency profile. Having already met the LLM target with our existing Ada architecture, we have room to support larger models.

The streaming handoff matters as much as the model itself. In a standard pipeline, you wait for the LLM to generate an entire paragraph before synthesizing audio. That is how you get 2-second delays.

We pipeline generation and synthesis by anchoring to natural speech boundaries. As the LLM streams tokens, we buffer them in memory. We don't trigger TTS on the first raw token, since synthesizing half a word sounds incredibly jarring to the user. Instead, we use a "Fast Start" heuristic. We scan the incoming token stream for the first natural pause, such as a comma, a colon, or the end of a short opening phrase.

The moment we hit that boundary, we dispatch that sentence chunk to Kokoro. Synthesis begins. The LLM continues generating the rest of the response in the background while the user is already hearing the beginning of the reply. The effective user-perceived latency is anchored to that first generated phrase.

Stage 3: Text-to-Speech (Ada: 117ms | Blackwell target: 70ms)

Rather than waiting for a complete token stream before audio synthesis, we only need the first sentence chunk and enough tokens to produce the first 200ms of audio to start generating.

We run Kokoro 82M, an 82-million parameter TTS model. The size is intentional. For conversational AI, the latency cost of a 500M+ parameter TTS model outweighs its quality gains. Kokoro at 82M parameters delivers 117ms time-to-first-byte on Ada Lovelace and supports a range of human voices. On Blackwell, we project that number coming in under 80ms.

We stream audio output immediately. The user starts hearing the response while the LLM is still generating tokens and Kokoro is still synthesizing the tail of the response. The effective user-perceived latency is anchored to that first audio byte.

The Architecture That Makes It Work

Model choices are table stakes. Teams that experience latency or build systems that can't scale owe that to their architectural choices.

We use a deliberate, three-tier architecture that isolates stateful orchestration from stateless inference. Tier 1 handles WebRTC and orchestration via our Voice Agent; Tier 3 is pure GPU muscle sitting behind a Tier 2 API Gateway powered by NVIDIA Triton Inference Server.

If the orchestrator and the models are tightly coupled in the same memory space, scaling becomes a nightmare. By modularizing the pipeline, we can independently scale GPU backends based on workload and dynamically route STT, LLM, and TTS requests through the Gateway. The overhead of an internal cluster network hop is a few milliseconds, a negligible tax for the ability to hot-swap models, enforce rate limits at the Gateway (running securely in Kata containers), and prevent a heavy LLM load from starving WebRTC audio processing.

The Speed-of-Light Tax

There is a constraint that no amount of model optimization can overcome: physics.

Fiber optic cable transmits data at approximately 200km per millisecond. A user in Vancouver connecting to a data center in Virginia faces roughly 50-80ms of round-trip network latency before any processing begins. Add the return trip, and you have spent 100-160ms on geography before the first model has touched a single token.

Our current infrastructure deployment includes inference nodes in Toronto, Montreal, Vancouver, and Washington. When a user connects, they hit the nearest edge node. Their round-trip network latency is 30ms instead of 160ms. That is a 130ms reduction that no software optimization can replicate.

Our current p50 of 364ms increases to 494ms when we centralize inference. Sub-300ms requires edge colocation of the full inference stack.

Where We're Going

364ms proves the architecture. Blackwell closes the remaining gap. Higher memory bandwidth accelerates Whisper’s audio processing. Better tensor core throughput drops Kokoro’s synthesis latency. FP8 quantization keeps 8B well inside the 100ms TTFT ceiling even as we add 70B support on the same infrastructure for higher-complexity workloads.

The architecture stays constant across both hardware generations: budget-first latency allocation, streaming handoffs at every stage, all three models co-located on edge nodes close to users, continuous VAD running cheaply on CPU, and Triton as the unified serving layer with real-time observability.

The 300ms target is the floor for a voice experience that users do not notice. The goal is a conversation that feels like a conversation, where users stop thinking about the AI and start thinking about what they are trying to accomplish.


By Sev Geraskin, Co-Founder and VP of Engineering, PolarGrid