Inside an AI phone call: how Twilio, Deepgram, Cartesia, and Claude make the line sound human

2026-05-07 9 min read By The booth73 team technicalvoice-aideepgramcartesiatwilioclaudelatency

Inside an AI phone call

A typical booth73 call ends with a transcript in your inbox and a charge of about 79 cents. Between those two events, six different services pass audio packets to each other in a tight loop running about three hundred milliseconds end-to-end. This post walks through what's happening in those packets and why it matters for how human the call feels.

The pipeline at a glance

When a booth73 call starts, the audio path looks like this:

Recipient's phone
  ↓ (PSTN, audio)
Twilio carrier gateway
  ↓ (SIP, real-time audio)
Vapi voice orchestrator
  ↓ (audio chunks, ~20ms each)
Deepgram (Nova-2)
  ↓ (text tokens as they're recognized)
Claude (Sonnet 4.6)
  ↓ (text tokens as they're generated)
Cartesia (Sonic-2 multilingual)
  ↓ (audio chunks, streaming)
Vapi voice orchestrator
  ↓ (SIP, real-time audio)
Twilio carrier gateway
  ↓ (PSTN, audio)
Recipient's phone

Two things make this work: streaming end-to-end (no service waits for the full message before passing what it has), and first-token latency budgeting (every service has a target for how fast it produces the first chunk of output).

The latency budget

A natural human conversation has a turn-taking cadence of roughly 200ms — the gap between when one speaker stops and the other starts. Anything longer than 800ms feels stilted. Anything under 100ms is impossible to hit because of network round-trips alone. So the design target for an AI phone call is somewhere in the 300–500ms range from when the recipient stops speaking to when the AI's first phoneme hits their ear.

Here's how that 350ms target gets split:

Stage	Budget	Reality (typical)
Twilio → Vapi (SIP)	30ms	25–40ms
Deepgram (final transcript)	80ms	60–120ms
Claude (first token)	150ms	200–500ms
Cartesia (first audio chunk)	50ms	40–60ms
Vapi → Twilio → recipient	40ms	30–60ms
Total	350ms	355–780ms

Claude is the variable. The rest of the pipeline is steady. This is also why model choice matters more than people realize: switching from Sonnet to Haiku cuts ~150ms off first-token latency at the cost of conversational depth, which is the trade booth73 explicitly does NOT make. We're optimizing for "this AI sounds smart" not "this AI replies fast."

The trick is that streaming hides most of the variance. By the time Claude's third token comes back, Cartesia has already started speaking the first one. By the time Cartesia finishes the first sentence, Claude has the whole reply queued up. The recipient never hears the latency; they hear continuous speech.

Voice activity detection — the hardest part

The pipeline above describes one turn. The harder problem is detecting when a turn ends.

Humans don't speak in cleanly bounded chunks. We pause mid-sentence to think. We trail off. We say "uh" and "like" and "you know." A naive system would either (a) jump in too early and interrupt the recipient, or (b) wait too long after they stop and feel mechanical.

booth73 uses Vapi's built-in voice activity detection (VAD), which combines silence detection with semantic-completion heuristics. The model isn't just listening for silence — it's also asking "does this look like a complete thought?" If the recipient says "I'd like to make a reservation for…" and pauses, VAD waits longer than if they say "Reservations are full tonight." Both are silences, but only one is the end of a turn.

Getting this wrong is the single biggest reason AI calls feel robotic. Vapi's VAD is good but not perfect; we tune the silence threshold per language because Japanese and Korean have different turn-taking rhythms than English.

Why these specific vendors

Every layer of the stack is replaceable in principle. The choices in production reflect specific tradeoffs.

Twilio for PSTN telephony because it's the most reliable. We've tested Bandwidth, Telnyx, and Plivo; for outbound to consumer numbers worldwide, Twilio's reachability and call quality are still the benchmark. The price premium (~$0.014/minute) buys uptime.

Vapi as the voice orchestrator because rebuilding the SIP-to-WebSocket-to-streaming-LLM glue layer is months of work and they've spent two years on it. Their endCallFunctionEnabled and firstMessageMode: "assistant-speaks-first" settings save us from reimplementing call control. Disclosed: booth73 runs on top of Vapi.

Deepgram Nova-2 for STT because it's the only enterprise model that does sub-100ms streaming transcription with diarization across the languages we support. Whisper is more accurate but it's not designed for real-time streaming; AssemblyAI is good but slower. We're watching Speechmatics and may switch eventually.

Claude Sonnet 4.6 as the reasoning model because it's better than alternatives at the specific failure modes that matter on a phone call: not making things up, deferring to "let me check with the user" when off-script, and ending the call cleanly when the objective is met. We tested GPT-4o and Gemini 2.5 Pro; both made up reservation details that didn't exist on the script when pushed.

Cartesia Sonic-2 for TTS because it's currently the best-in-class for the combination of (a) latency under 60ms, (b) multilingual coverage with a single voice (no separate voice per language), and (c) prosody that doesn't sound flat. ElevenLabs sounds slightly warmer in English but their multilingual cadence is uneven; OpenAI's TTS is fast but the voices are limited. Sonic-2 is the only option that lets us use one voice ID — Sarah — across English, Japanese, French, German, Spanish, Portuguese, Italian, Korean, and Mandarin.

Multilingual is an under-appreciated property

When booth73 calls a hotel in Tokyo, the entire call is in Japanese — opener, conversation, sign-off — and the same Cartesia voice that you'd hear on an English call handles it natively. No language switching, no accent artifacts, no awkward pronunciation of place names.

This matters because the alternative — using a different TTS voice per language — creates a discoverability and trust problem. Customer service reps and front desks immediately register that an "American voice" is unusual when calling from Japan, and that triggers suspicion. A native-cadence Japanese voice reading a script that says "I'm calling on behalf of Mr. Chen" reads as an English-speaker with an assistant who happens to speak Japanese. Subtle but real.

The technical magic: Cartesia's Sonic-2 is trained on multilingual data with shared prosodic embeddings, so the voice's identity (timbre, pace, cadence patterns) is decoupled from the language being synthesized. We pass in raw UTF-8 Japanese text and the model produces Japanese phonemes in Sarah's voice without any explicit language flag. The transcriber knows it's Japanese (we tell Deepgram via the language parameter) but the synthesizer doesn't need to.

The system prompt

The text passed to Claude isn't just the user's script. Every booth73 call carries a hardcoded prompt structure:

You are placing a phone call on behalf of {caller_name}.
Be polite at all times — this is a hard constraint.

LANGUAGE: Conduct this entire call in {language_name}.

IDENTITY: You are an AI assistant representing {caller_name}.
If asked who you are, say you are {caller_name}'s virtual assistant.
Never claim to be human.

OBJECTIVE: {one-line objective the user filled in}

SCRIPT / DIRECTIONS: {user's free-form instructions}

RECORDING DISCLOSURE: This call is being recorded so we can email
a transcript afterward. If asked, answer honestly. If the recipient
objects, end the call politely.

RULES:
- Speak naturally; do not read the script verbatim.
- Use a calm, measured, professional tone. Match the recipient's
  energy. Never be more enthusiastic than they are.
- NEVER agree to charges, sign up for services, accept Terms of
  Service, or commit {caller_name} to any contract.
- NEVER share sensitive info about {caller_name}.
- If the recipient indicates they're a minor, end politely.
- If the recipient becomes hostile, end within one polite sentence.
- TIME BUDGET: this call has approximately {N} minutes of paid
  time. Pace the conversation accordingly.

The OBJECTIVE and SCRIPT / DIRECTIONS sections are the only parts the user controls. Everything else is invariant — every booth73 call carries the same identity discipline, the same tone constraints, the same safety floor. This is how we keep the assistant from being weaponized.

What gets logged

Every call writes a row to SQLite with: started_at, ended_at, phone, caller_name, language, objective, script, first_message, transcript, summary, duration_seconds, cost_cents, card_code, client_ip, recording_url, and a metadata JSON blob with the TCPA attestation paper trail and the use-case classification.

Vapi sends an end-of-call-report webhook when a call terminates; we use that to settle the row, debit the card, and email the transcript. If the webhook is dropped (it happens), our _recover_call_from_vapi helper polls Vapi's GET /call/{id} endpoint when the user opens the lookup page or admin view — same settlement logic, just triggered by user activity instead of webhook receipt.

Idempotency is enforced via a metadata.outcome_email_sent_at flag, so a missed webhook + a recovery + a late webhook can't send three transcripts.

What can go wrong

The failure modes that occur in the wild, in rough order of frequency:

The recipient hangs up on the AI immediately. Not a bug, but it's worth knowing — about 12% of cold calls get a "no thanks" + click within 8 seconds.
Voicemail picks up. Vapi's voicemail-detection module catches this most of the time and ends the call. When it misses, the AI ends up leaving a brief message.
Background noise breaks STT. Restaurants, construction sites, busy lobbies. Deepgram's noise reduction helps but can't recover speech that was never clearly audible.
The recipient code-switches mid-call. Someone starts in English, switches to Spanish. Deepgram is configured for one language per call; the second language transcribes as garbage. Rare but catastrophic when it happens.
Vapi or one of the underlying services has an outage. Twilio uptime is famously good; Vapi's is good but not perfect. We've had two outage windows in six months that resulted in failed calls, both refunded automatically.

Where to go next

If you want to actually use this stack:

How to give Claude a phone number — the practical install
Comparison: AI voice calling APIs — when booth73 is the right layer vs. building your own on Vapi
/llms.txt — the agent quickstart with code examples

If you're curious about the underlying components, the vendor docs are excellent: Vapi, Cartesia, Deepgram, Anthropic.

— The booth73 team