Voice Settings

Voice Settings is where you configure how an agent sounds and listens. It's the Voice Settings tab inside AI Agent → Profile, alongside Profile Settings, Conversation Rules, and Routing Logic.

Step 1: Open Voice Settings

In your bot, go to AI Agent → Profile and click Voice Settings in the left menu.

The page is split into four sections:

Voice Configuration (mode, provider, voice).
Voice Instructions.
Audio Settings.
Advanced VAD.

You can leave most defaults alone for a first build — the only sections you must touch are the first two.

Voice settings sub-page on the super-agent Configuration — Voice configuration (Mode, TTS provider, Voice), Voice instructions textarea, Audio settings (Noise Cancellation), and Advanced settings with VAD threshold, Prefix Padding, and Silence Duration controls

Step 2: Pick a mode

The first decision is Mode. Two options:

Mode	What you get
Text & TTS (default)	Speech recognized → text answered by your agent → text spoken back. Three independent stages, lots of provider choice.
Realtime Audio	Realtime audio API (OpenAI today; multi-provider adapters in flight for Gemini Live, Anthropic, MiniMax). Audio straight in, audio straight out — lowest end-to-end latency. Now supports registered tools (DynamicTool / MCP / KB), mid-call agent switch, greet-on-connect, per-turn traces.

How to choose:

Stick with Text & TTS when you need the widest provider catalog, broadest language coverage, and access to your custom voices.
Pick Realtime Audio when end-to-end latency matters more than fine-grained provider control. With the May 2026 update, realtime has reached parity with pipeline mode on tools, agent switching, and traces — it's a viable production option for new builds, not just demos.

You can switch between modes any time. Each mode has its own voice and provider list, so you'll re-pick a voice if you change.

Realtime mode internals: routing is driven by the existing UI-editable voiceOptions.mode === "realtime" field — no separate feature flag. Schema config (provider, audio.{input,output}SampleRate, audioFormat, greetOnConnect) is exposed through the same Voice Settings sub-page; defaults are sensible for the OpenAI provider.

Step 3: Pick a provider (Text & TTS mode)

The TTS Provider dropdown lists what's available for the speak-back stage. Currently:

Provider	Notes
Yellow AI	The default. Good general quality, broad language coverage, and full access to your custom voices.
ElevenLabs	Premium voice quality, strong English. 10 built-in presets (Rachel, Domi, Bella, Antoni, Adam, etc.).
MiniMax	Multilingual presets including Mandarin, Japanese, French, Spanish.

Speech-to-text (STT) provider is configured separately in pipeline runtime config. Yellow AI is the default STT choice.

Step 4: Pick a voice

The Voice dropdown shows every voice available for the chosen provider:

Built-in presets (Rachel, Domi, Bella, etc. for ElevenLabs; Mandarin and multilingual options for MiniMax).
Yellow's global curated voices.
Your custom voices — anything you've cloned for this bot. See Custom Voices.

Click the play icon next to a voice to preview it before saving.

Best practice: match the voice to your audience. A persuasive sales bot wants a different voice than a calm support bot. Preview a few before committing.

Step 5: (Realtime Audio mode) Pick an OpenAI voice

If you switched to Realtime Audio, the voice list narrows to OpenAI's options: Alloy, Ash, Ballad, Coral, Echo, Sage, Shimmer, Verse, Marin, Cedar.

Each has a distinct character — preview to pick.

Step 6: Write voice instructions

The Voice Instructions field (≤500 characters) is freeform text that biases the bot's spoken behavior:

"Respond only in Hindi, even if the user types in English."
"Speak slowly and clearly. Pause briefly between sentences."
"Sound friendly but concise — keep replies under two sentences."

Best practices:

Use this for delivery (pace, tone, language); use bot identity for what to say.
Don't repeat your bot identity here. Voice instructions layer on top.
For multilingual bots, this is often the cleanest place to lock the spoken language.

Step 7: Audio Settings

There's currently one option here:

Noise Cancellation — on by default. Suppresses background noise on the user's mic. Leave it on unless you have a specific reason (some studio setups can do better with raw audio).

Step 8: Advanced VAD (Voice Activity Detection)

VAD is what tells the bot "the user has finished speaking, my turn." Default values work for most calls; tune only if testing reveals a specific problem.

Setting	Default	What it does
VAD Threshold	0.85	How aggressively to detect speech vs silence. Higher = stricter (less likely to trigger on background noise but may cut off quiet speech).
Prefix Padding	300 ms	How much audio before detected speech to include. Keeps the start of words from being clipped.
Silence Duration	500 ms	How long the user must be silent before the bot considers the turn ended. Lower = snappier, more interruptions; higher = more natural pauses, but a small delay.

Tuning by use case:

Outbound sales / fast-paced chat → drop Silence Duration to ~300 ms; you'll feel more responsive.
Slow / elderly / non-native speakers → raise Silence Duration to 700–800 ms so the bot doesn't cut them off.
Noisy environments (vehicles, public spaces) → raise VAD Threshold to 0.9.

VAD section at the bottom of the Voice settings page — VAD Threshold (0.85 default), Prefix Padding (300 default), and Silence Duration controls

Step 9: Save

Click Save. The configuration applies to all conversations on this bot from the next call onward.

Best practices

Pick the mode that matches your goal, not the newest one. Pipeline (Text & TTS) is right for most production bots.
Preview voices before committing. A 10-second preview saves a lot of "this voice is wrong" feedback later.
Keep voice instructions short and behavioral. This isn't where you re-state the agent's purpose.
Don't change VAD without a measurement. Test the bot, find a specific problem, then tune. Random changes can make things worse.
For multilingual bots, set the spoken language explicitly in Voice Instructions. Don't rely on auto-detection unless you've tested it on real calls.

Continue to: Custom Voices.

Step 1: Open Voice Settings​

Step 2: Pick a mode​

Step 3: Pick a provider (Text & TTS mode)​

Step 4: Pick a voice​

Step 5: (Realtime Audio mode) Pick an OpenAI voice​

Step 6: Write voice instructions​

Step 7: Audio Settings​

Step 8: Advanced VAD (Voice Activity Detection)​

Step 9: Save​

Best practices​