Skip to main content

Voice — Best Practices

Practical guidance from teams shipping v3 voice bots. Read once before launch, revisit before any voice-related change.

Pick the right runtime mode

If your goal is…Pick
Production voice bot, broad provider/language support, full custom voice accessText & TTS (pipeline)
Lowest possible latency for a short demo, with a small voice catalogRealtime Audio
Anything multilingual or with a custom-cloned voiceText & TTS (pipeline) — Realtime can't use your custom voices

Most production deployments stay on Text & TTS for the lifetime of the bot. Don't switch modes unless you have a specific reason.

Pick the right TTS provider

ProviderSweet spot
Yellow AI (default)Broad coverage, full custom-voice support. Right default for most bots.
ElevenLabsBest-in-class English voice quality. Premium voices for high-stakes brand interactions.
MiniMaxMultilingual presets including Mandarin. Good for Asia-Pacific deployments.

Don't switch providers per agent within the same bot if you can avoid it — keeps the brand voice consistent.

Pick the right STT (speech recognition)

Yellow AI is the default and right for most cases. Switch only when:

  • Your audience speaks a language Yellow handles less well — try Deepgram, Sarvam (Indic), or Microsoft Azure.
  • You're getting consistently bad transcripts despite good audio — try a different provider as a diagnostic.

Test STT with real recordings of your audience, not just your own voice. Accents, background noise, and domain vocabulary all matter.

Cloning a voice

  • Record clean. A 5-second clean clip beats a 30-second noisy one. Quiet room, good mic, single take.
  • Read the suggested sample text. It's tuned for clean clones.
  • Name for the voice character, not the use case. "Aria — warm support EN" beats "Customer Support Voice."
  • Test cloned voices in the Voice Playground and on a real phone before assigning to a production bot.

VAD tuning

Default values (threshold 0.85, prefix 300 ms, silence 500 ms) are right for most calls. Don't tune speculatively.

When you do tune:

  • Slow / non-native speakers → raise silence duration to 700–800 ms.
  • Fast-paced sales / outbound → drop silence duration to 300 ms.
  • Noisy environments → raise threshold to 0.9.

Tune one knob at a time, retest, and document why you changed it.

Voice instructions vs bot identity

  • Bot identity says what the agent does and is.
  • Voice instructions say how it speaks — pace, language, tone delivery.

Don't repeat identity in voice instructions. Layer, don't duplicate.

Latency

Conversational voice feels broken above ~2 seconds end-to-end (user finishes speaking → user hears reply start). Aim for under 1.5 seconds.

Where latency hides:

  • Slow model — large prompts, slow model choice. Trim the system prompt and try a faster model.
  • Slow KB lookup — too many results, too-aggressive history concatenation. Tune the KB tool.
  • Slow TTS — provider choice, voice choice. Some voices are slower to start than others.
  • VAD too patient — silence duration high. Drop it.
  • Telephony round-trip — fixed cost; can't tune from the bot config.

Measure before you optimize. Isolate model latency from TTS latency by swapping each in turn and re-testing in the Voice → Telephony Web Call.

Tool-call acknowledgements (_voice_ack)

The other big latency hide is the silent gap on tool calls. When the model dispatches a tool call (KB lookup, workflow, API call), the tool may take a second or two to return. Without anything to play, TTS goes silent — and to a voice user that silence reads as "the bot is broken."

v3 voice handles this automatically. On voice channels, the runtime decorates every tool's parameter schema with a _voice_ack field — a short spoken phrase the LLM is expected to fill in whenever it dispatches a tool ("Let me check that for you", "One moment while I pull that up"). The runtime emits the _voice_ack value to TTS immediately, then strips it from the tool's actual arguments so the tool implementation never sees it.

You don't configure _voice_ack — it's automatic on voice channels. But:

  • If the model skips the field (uncommon, but happens), the runtime falls back to your bot's static toolCallFiller. Make sure you've set a sensible fallback ("One moment.") in the voice config.
  • The acknowledgement is one short utterance per tool call, not a sentence per argument. If a tool routinely takes more than 3 seconds, follow up the ack with a second message inside the tool's response stream rather than relying on the ack alone to cover the wait.
  • Realtime Audio mode (OpenAI Realtime) handles this differently — see Voice on Nexus for the realtime-mode notes.

Multilingual voice bots

  • Set the spoken language explicitly in voice instructions. "Respond in Hindi" is more reliable than expecting language detection.
  • Cloned voices generalize across languages, but always test in the target language before going live.
  • STT accuracy varies dramatically by language. Pick the STT provider that's best for your audience's language(s) — not just the default.

Escalation paths

A voice bot without an escalation path is a customer service liability.

  • Add a Transfer Call tool. Wire it to a human destination.
  • Add a Routing Logic rule that forces escalation on explicit triggers ("agent", "human", "representative", or repeated failure).
  • Customize the escalation announcement so the user knows what's happening.

See Escalation tools.

Recording compliance

Recording rules vary by jurisdiction. Before enabling recording:

  • Confirm the legal posture for each region you operate in.
  • Add a clear consent prompt at the start of the call when required.
  • Configure the Recording action on Transfer Call tools according to policy.

When in doubt, talk to legal before launch.

Pre-launch checklist

  • Voice mode set deliberately (Text & TTS for production, Realtime only for narrow use cases).
  • Provider and voice tested on a real call, not just the playground.
  • Cloned voices (if any) tested in target languages.
  • VAD tuned for the audience's speaking pace and environment.
  • Voice instructions cover language and pace explicitly.
  • Telephony provider configured for inbound and outbound.
  • Caller ID set sensibly.
  • Escalation tool wired and tested end-to-end.
  • Fallback wired to a human-handoff workflow (not a generic apology string).
  • Recording compliance reviewed for each region.
  • Voice regression suite (5+ real test prompts) defined and run before each release.

Common mistakes

  • Switching voice providers casually. Each provider has different voice character and latency profile. Customers notice.
  • Skipping outbound testing. "It sounds great in WebRTC" doesn't mean it sounds great over PSTN.
  • Tuning VAD without measurement. Random changes usually make things worse.
  • No escalation path. A frustrated voice user without a human option churns immediately.
  • Long welcomes. Voice users tolerate even shorter intros than chat users. Get to the point.
  • Multilingual without locking the language. Don't trust auto-detection in production. State the language in voice instructions.
  • Not listening to real calls. Production recordings (where compliant) are the only ground truth. Schedule time to listen.

Return to: Voice Overview.