Skip to main content

Understand delays in a conversation

Delays are the time between the user response and the bot response. For example, the user response is "is there any discount?", and the delay is the time that the bot takes to process a line and respond to it.

The objective of the Telephony and Yellow cloud platform is to reduce the conversational delay and make the conversation more human-like.

Before understanding and configuring the voice bot for the best user experience in terms of minimising the conversation delay and still not cutting off the user mid-sentence, let's try to understand the way normal dialogue works.

  1. When the response is a Yes/No value: This is supposed to be an instantaneous response and probably not a very long statement.
  2. When the response is an address: This consists of multiple pauses/gaps and a longer time to complete the whole response.

Ex: Door #1 < pause > Sector-D1 < pause > Kanpur Road < pause > Lucknow

  1. When the response is a phone number: This will be a patterned delay and there will be a few pauses but the whole response is not very long.

Ex: 99-44-32-06-11 or +1-202-795-3213

1. Types of delays

Once we have understood how normal dialogue dealy works, let's have a look at other kinds of delay that are introduced in the bot conversation. After having a clear idea about the functioning of delays, we can better optimize the conversation around the same.

Delay (or perceived delay to the user) is the amount of time it takes after the user completes the query/response and the bot voice out the next response.

  1. STT delay: When the user has responded and the telephony platform has received the audio file. This delay is caused when the audio file is getting converted to the text file on the STT engine. This delay depends on the number of characters, simple name response will take lesser time to process than an address.

Ex: Audio-to-text conversation of "My name is Jake, what is my bank balance?"


The yellow cloud platform would have defined a range of duration for which the telephony platform accepts the response. For example, if the user is responding with a phone number we can ask the bot to only record it for 1 minute by setting the Recording max duration.

  1. Telephony to yellow cloud platform: The converted audio to text response will get transported from the telephony platform.
  2. NLP engine response time within cloud platform: The text received on the yellow cloud platform will be sent to the NLP engine. Internally there will run a logical function where the software understands the user text response, finds a solution to continue the flow and generate a bot response.
  3. Yellow to telephony cloud platform: The text response generated by the NLU-yellow cloud platform will get transported to the telephony platform.
  4. TTS delay: This delay occurs while the bot response text is getting converted to an audio format by the TTS engine so that it can be played as a response to the user.

TTS delay optimisation using cache memory

Use case: Assume you own an ominous that converses with 1000 users in a day and the conversation flow is mostly the same. Ex: First bot question will be "what is your name" and the next would be to inform the user - "we have introduced a 0% intro APR offer on both purchases and transfers".

These repetitive messages or the small audio files remain the same throughout all the calls, hence the TTS delay that occurs while the NLP text response is getting converted to speech is optimised using cache. This skips the TTS delay and fetches the bot response from the cache database (present in the telephony platform).

The audio files generated from the first few user conversations are stored in the database and reused for the other calls reducing the overall latency.

2. Configure delays for different use cases

Finally, after understanding the art of human dialogue and the system of voice bot, let us drill down on designing and configuring the bot in much better way. While understanding the system delays, there were 3 major parts to it:

  1. STT
  2. Telephony-Cloud Communication
  3. TTS

Telephony-Cloud stacks are very tightly integrated with each other and with a very reliable and fast NLP engine the whole delay for this section is negligible. TTS delay is already very well optimized by using the caching mechanism explained above.

Clever configuration lies is on how we optimize the STT delay. Let's understand that process below:

Understanding STT detection

Use-case: The bot asks the user "Are you sure you want to place this order?" and the user responds with Yes/No.

  1. Initial delay: This delay occurs when the user hears the bot's response and takes time to process it before replying.
  2. Information delay: The time taken for the user to speak the complete response.
  3. Pauses: Pauses taken in between each of the words are considered a delay. Pause for a Yes/No answer will be nill and there will be multiple pauses while recording an address.
  4. Final dealy: After the user has spoken the response, the duration of time the bot waits to understand that the user response is received and it must be processed as a single audio file.

To design a good voice bot you must configure these parameters at the node level based on the question being asked.

2.1 Configure for Yes/ No response

STT engine:: Microsoft and STT mode:: Streaming

ParametersDescriptionMin ValueMax Value
Recording max durationMaximum duration after which the recording stops - the user response won't be accepted beyond this time.5 seconds60 seconds
Initial silence durationAcceptable silence duration before a bot user starts speaking.5 seconds10 seconds
Final silence durationAcceptable silence duration after a bot user starts speaking and the bot will have to process the response (Final delay must be greater than expected pauses).0.1 seconds5 seconds