Key components of a voice agent

Voice agents use advanced technologies for seamless and interactive communication via voice channels. Understanding the key components is vital for designing and implementing an effective voice agent system. These components facilitate efficient user interaction and communication.

This article highlights the basics of voice agent components.

info

Recap of the voice agent workflow

To understand the components of a voice agent, you must have a basic understanding of the voice agent architecture.

Customer initiates a call to the voice agent and makes a service request.
The customer's speech is converted into text. This text is then sent to the telephony platform, which acts as the interface between the cloud and the customer.
The text representing the customer's request is processed by the backend logic, which can be implemented through flows or custom code on the Yellow.ai Cloud platform.
Once the cloud platform understands the user's request, it generates a response in the form of text, which is then sent back to the telephony platform.
The voice agent converts the response text into speech and delivers it to the customer, allowing them to hear the response in a natural voice.

Key components of voice agent workflow

The voice agent architecture consists of three key components:

Capture user input: The agent records and captures the customer's input.
Understand user input: The agent processes and comprehends the customer's input to generate a response.
Customize agent response: The agent plays or communicates the response back to the customer.

1. Capturing user inputs

There are three methods to capture user input on the IVR channel that is summarized below.

Capturing user speech: Yellow.ai platform incorporates Speech-to-Text (STT) and Automatic Speech Recognition (ASR) technologies for accurate transcription and real-time recognition of user speech. Through partnerships with leading STT engines like Microsoft and Google, we ensure high-quality transcription. ASR leverages advanced machine learning algorithms to facilitate seamless and precise communication between users and voice agents, enabling natural and interactive interactions. You can configure parameters such as STT engine, mode, and silence parameters for optimal performance.
Capturing keypad (DTMF) input: Users can provide input through the keypad while on a call, typically used for numeric inputs. Two configurations are available:
- DTMF Digit Length: Captures fixed-length user input, like a mobile number.
- DTMF Finish Character: Used for variable-length inputs, such as an application ID. Users can press "*" or "#" to indicate the completion of input.
  
  If there is no activity for more than 10 seconds, voice agent considers it as the end of input.
Capturing keypad and speech as input: In certain cases, the voice agent can allow users to provide input by either typing on the keypad or speaking it out. The agent recognizes the first response, whether it's from keypad activity or speech, as the final input.

2. Understanding & responding to user responses

Operations on yellow.ai cloud platform must be performed to understand the user input that is available in text format and analyze and provide output in text format. For example:

Bot training by adding intent and entities.
Configuring output for failure or no-response messages.
Configuring validations against the context (intents, entities, pre-trained entities, etc).
Many more...

3. Customizing agent responses

To make the users hear the agent's responses during a call, the text generated by the agent must be converted into spoken words. This conversion is achieved using text-to-speech (TTS) technology in collaboration with leading TTS engines such as Microsoft, Google, and Amazon. These engines provide a range of capabilities to meet different business requirements, including the ability to create customized and lifelike speech.

There are three ways to convert agent-generated text into speech on the IVR channel:

Using pre-built neural voice models: TTS providers offer pre-built neural voice models that can be utilized for quick prototyping and building demo voice agents. Design module and Voice input node enable the conversion of text into speech, providing a basic understanding of how the agent's responses would sound.
Applying customization with SSML: For more advanced use cases where precise control over speech synthesis is needed, Speech Synthesis Markup Language (SSML) can be employed. SSML in design module and voice input node allows for fine-grained customization of speech, like adjustments to speech rate, pitch, style, pauses, etc. By incorporating SSML in the text, the resulting speech can be highly tailored and expressive.
Playing pre-recorded messages: In some scenarios, it may be necessary to play pre-recorded messages during a call. For example, when a custom proprietary sound or a welcome tone needs to be played. Pre-recorded messages, in .wav or .mp3 formats, can be integrated into the call flow to provide a specific audio experience.

Additional response customization on yellow.ai

Yellow AI offers extensive features for customizing agent responses. They are:

Adding variables for custom responses (Dynamic messages): Personalize interactions with the agent by incorporating variables. By storing and using variables in the design module or voice input node, you can dynamically include user-specific information in the agent's responses, creating a more tailored experience. For example, you can dynamically include the user's name or city in the responses.

Conversation 1	Conversation 2
User: I live in Bangalore. Bot: Bangalore is a great place! May I know your account number?	User: My name is Sam. Bot: Welcome to our store, Sam. What would you like to try first?

Using the translations page for multilingual agents (Multi-lingual flows): For multilingual agents, the translations feature can be employed to customize speech in different languages. This means that you can build a single flow and use the translations feature on the platform to store configurations for each language. You can provide localized experiences to users in different languages without the need to duplicate or recreate entire flows. This streamlines the development process and ensures consistency across different language versions of the voice agent.

The translation text should be in SSML format.

Recap of the voice agent workflow​

Key components of voice agent workflow​

1. Capturing user inputs​

2. Understanding & responding to user responses​

3. Customizing agent responses​

Additional response customization on yellow.ai​