Voice — Best Practices

Practical guidance from teams shipping v3 voice bots. Read once before launch, revisit before any voice-related change.

Pick the right runtime mode

If your goal is…	Pick
Production voice bot, broad provider/language support, full custom voice access	Text & TTS (pipeline)
Lowest possible latency for a short demo, with a small voice catalog	Realtime Audio
Anything multilingual or with a custom-cloned voice	Text & TTS (pipeline) — Realtime can't use your custom voices

Most production deployments stay on Text & TTS for the lifetime of the bot. Don't switch modes unless you have a specific reason.

Pick the right TTS provider

Provider	Sweet spot
Yellow AI (default)	Broad coverage, full custom-voice support. Right default for most bots.
ElevenLabs	Best-in-class English voice quality. Premium voices for high-stakes brand interactions.
MiniMax	Multilingual presets including Mandarin. Good for Asia-Pacific deployments.

Don't switch providers per agent within the same bot if you can avoid it — keeps the brand voice consistent.

Pick the right STT (speech recognition)

Yellow AI is the default and right for most cases. Switch only when:

Your audience speaks a language Yellow handles less well — try Deepgram, Sarvam (Indic), or Microsoft Azure.
You're getting consistently bad transcripts despite good audio — try a different provider as a diagnostic.

Test STT with real recordings of your audience, not just your own voice. Accents, background noise, and domain vocabulary all matter.

Cloning a voice

Record clean. A 5-second clean clip beats a 30-second noisy one. Quiet room, good mic, single take.
Read the suggested sample text. It's tuned for clean clones.
Name for the voice character, not the use case. "Aria — warm support EN" beats "Customer Support Voice."
Test cloned voices in the Voice Playground and on a real phone before assigning to a production bot.

VAD tuning

Default values (threshold 0.85, prefix 300 ms, silence 500 ms) are right for most calls. Don't tune speculatively.

When you do tune:

Slow / non-native speakers → raise silence duration to 700–800 ms.
Fast-paced sales / outbound → drop silence duration to 300 ms.
Noisy environments → raise threshold to 0.9.

Tune one knob at a time, retest, and document why you changed it.

Voice instructions vs bot identity

Bot identity says what the agent does and is.
Voice instructions say how it speaks — pace, language, tone delivery.

Don't repeat identity in voice instructions. Layer, don't duplicate.

Latency

Conversational voice feels broken above ~2 seconds end-to-end (user finishes speaking → user hears reply start). Aim for under 1.5 seconds.

Where latency hides:

Slow model — large prompts, slow model choice. Trim the system prompt and try a faster model.
Slow KB lookup — too many results, too-aggressive history concatenation. Tune the KB tool.
Slow TTS — provider choice, voice choice. Some voices are slower to start than others.
VAD too patient — silence duration high. Drop it.
Telephony round-trip — fixed cost; can't tune from the bot config.

Measure before you optimize. Isolate model latency from TTS latency by swapping each in turn and re-testing in the Voice → Telephony Web Call.

Tool-call acknowledgements (`_voice_ack`)

The other big latency hide is the silent gap on tool calls. When the model dispatches a tool call (KB lookup, workflow, API call), the tool may take a second or two to return. Without anything to play, TTS goes silent — and to a voice user that silence reads as "the bot is broken."

v3 voice handles this automatically. On voice channels, the runtime decorates every tool's parameter schema with a _voice_ack field — a short spoken phrase the LLM is expected to fill in whenever it dispatches a tool ("Let me check that for you", "One moment while I pull that up"). The runtime emits the _voice_ack value to TTS immediately, then strips it from the tool's actual arguments so the tool implementation never sees it.

You don't configure _voice_ack — it's automatic on voice channels. But:

If the model skips the field (uncommon, but happens), the runtime falls back to your bot's static toolCallFiller. Make sure you've set a sensible fallback ("One moment.") in the voice config.
The acknowledgement is one short utterance per tool call, not a sentence per argument. If a tool routinely takes more than 3 seconds, follow up the ack with a second message inside the tool's response stream rather than relying on the ack alone to cover the wait.
Realtime Audio mode (OpenAI Realtime) handles this differently — see Voice on Nexus for the realtime-mode notes.

Multilingual voice bots

Set the spoken language explicitly in voice instructions. "Respond in Hindi" is more reliable than expecting language detection.
Cloned voices generalize across languages, but always test in the target language before going live.
STT accuracy varies dramatically by language. Pick the STT provider that's best for your audience's language(s) — not just the default.

Escalation paths

A voice bot without an escalation path is a customer service liability.

Add a Transfer Call tool. Wire it to a human destination.
Add a Routing Logic rule that forces escalation on explicit triggers ("agent", "human", "representative", or repeated failure).
Customize the escalation announcement so the user knows what's happening.

See Escalation tools.

Recording compliance

Recording rules vary by jurisdiction. Before enabling recording:

Confirm the legal posture for each region you operate in.
Add a clear consent prompt at the start of the call when required.
Configure the Recording action on Transfer Call tools according to policy.

When in doubt, talk to legal before launch.

Pre-launch checklist

Common mistakes

Switching voice providers casually. Each provider has different voice character and latency profile. Customers notice.
Skipping outbound testing. "It sounds great in WebRTC" doesn't mean it sounds great over PSTN.
Tuning VAD without measurement. Random changes usually make things worse.
No escalation path. A frustrated voice user without a human option churns immediately.
Long welcomes. Voice users tolerate even shorter intros than chat users. Get to the point.
Multilingual without locking the language. Don't trust auto-detection in production. State the language in voice instructions.
Not listening to real calls. Production recordings (where compliant) are the only ground truth. Schedule time to listen.

Return to: Voice Overview.

Pick the right runtime mode
Pick the right TTS provider
Pick the right STT (speech recognition)
Cloning a voice
VAD tuning
Voice instructions vs bot identity
Latency
- Tool-call acknowledgements (_voice_ack)
Multilingual voice bots
Escalation paths
Recording compliance
Pre-launch checklist
Common mistakes

Pick the right runtime mode​

Pick the right TTS provider​

Pick the right STT (speech recognition)​

Cloning a voice​

VAD tuning​

Voice instructions vs bot identity​

Latency​

Tool-call acknowledgements (_voice_ack)​

Multilingual voice bots​

Escalation paths​

Recording compliance​

Pre-launch checklist​

Common mistakes​