Skip to main content

Key components of a voice bot

Voice bots use advanced technologies for seamless and interactive communication via voice channels. Understanding the key components is vital for designing and implementing an effective voice bot system. These components facilitate efficient user interaction and communication.

This article highlights the basics of voice bot components.


Recap of the voicebot workflow

To understand the components of a voice bot, you must have a basic understanding of the voice bot architecture.

  1. Customer initiates a call to the voice bot and makes a service request.
  2. The customer's speech is converted into text. This text is then sent to the telephony platform, which acts as the interface between the cloud and the customer.
  3. The text representing the customer's request is processed by the backend logic, which can be implemented through flows or custom code on the Cloud platform.
  4. Once the cloud platform understands the user's request, it generates a response in the form of text, which is then sent back to the telephony platform.
  5. The voice bot converts the response text into speech and delivers it to the customer, allowing them to hear the response in a natural voice.

Key components of voicebot workflow

The voice bot architecture consists of three key components:

  1. Capture user input: The bot records and captures the customer's input.
  2. Understand user input: The bot processes and comprehends the customer's input to generate a response.
  3. Customize bot response: The bot plays or communicates the response back to the customer.

1. Capturing user inputs

There are three methods to capture user input on the IVR channel that is summarized below.

  1. Capturing user speech: platform incorporates Speech-to-Text (STT) and Automatic Speech Recognition (ASR) technologies for accurate transcription and real-time recognition of user speech. Through partnerships with leading STT engines like Microsoft and Google, we ensure high-quality transcription. ASR leverages advanced machine learning algorithms to facilitate seamless and precise communication between users and voice bots, enabling natural and interactive interactions. You can configure parameters such as STT engine, mode, and silence parameters for optimal performance.

  2. Capturing keypad (DTMF) input: Users can provide input through the keypad while on a call, typically used for numeric inputs. Two configurations are available:

    • DTMF Digit Length: Captures fixed-length user input, like a mobile number.
    • DTMF Finish Character: Used for variable-length inputs, such as an application ID. Users can press "*" or "#" to indicate the completion of input.

      If there is no activity for more than 10 seconds, bot considers it as the end of input.

  3. Capturing keypad and speech as input: In certain cases, the bot can allow users to provide input by either typing on the keypad or speaking it out. The bot recognizes the first response, whether it's from keypad activity or speech, as the final input.

2. Understanding & responding to user responses

Operations on cloud platform must be performed to understand the user input that is available in text format and analyze and provide output in text format. For example:

  • Bot training by adding intent and entities.
  • Configuring output for failure or no-response messages.
  • Configuring validations against the context (intents, entities, pre-trained entities, etc).
  • Many more...

3. Customizing bot responses

To make the users hear the bot's responses during a call, the text generated by the bot must be converted into spoken words. This conversion is achieved using text-to-speech (TTS) technology in collaboration with leading TTS engines such as Microsoft, Google, and Amazon. These engines provide a range of capabilities to meet different business requirements, including the ability to create customized and lifelike speech.

There are three ways to convert bot-generated text into speech on the IVR channel:

  1. Using pre-built neural voice models: TTS providers offer pre-built neural voice models that can be utilized for quick prototyping and building demo voice bots. Design module and Voice input node enable the conversion of text into speech, providing a basic understanding of how the bot's responses would sound.
  2. Applying customization with SSML: For more advanced use cases where precise control over speech synthesis is needed, Speech Synthesis Markup Language (SSML) can be employed. SSML in design module and voice input node allows for fine-grained customization of speech, like adjustments to speech rate, pitch, style, pauses, etc. By incorporating SSML in the text, the resulting speech can be highly tailored and expressive.
  3. Playing pre-recorded messages: In some scenarios, it may be necessary to play pre-recorded messages during a call. For example, when a custom proprietary sound or a welcome tone needs to be played. Pre-recorded messages, in .wav or .mp3 formats, can be integrated into the call flow to provide a specific audio experience.

Additional response customization on

Yellow AI offers extensive features for customizing bot responses. They are:

  1. Adding variables for custom responses (Dynamic messages): Personalize interactions with the bot by incorporating variables. By storing and using variables in the design module or voice input node, you can dynamically include user-specific information in the bot's responses, creating a more tailored experience. For example, you can dynamically include the user's name or city in the responses.

    Conversation 1Conversation 2
    User: I live in Bangalore.
    Bot: Bangalore is a great place! May I know your account number?
    User: My name is Sam.
    Bot: Welcome to our store, Sam. What would you like to try first?
  1. Using the translations page for multilingual bots (Multi-lingual flows): For multilingual bots, the translations feature can be employed to customize speech in different languages. This means that you can build a single flow and use the translations feature on the platform to store configurations for each language. You can provide localized experiences to users in different languages without the need to duplicate or recreate entire flows. This streamlines the development process and ensures consistency across different language versions of the voice bot.

    The translation text should be in SSML format.