Understand delays in a conversation

Delays are the time between the user response and the voice agent response. For example, the user response is "is there any discount?", and the delay is the time that the voice agent takes to process a line and respond to it.

The objective of the Telephony and Yellow cloud platform is to reduce the conversational delay and make the conversation more human-like.

Before understanding and configuring the voice agent for the best user experience in terms of minimising the conversation delay and still not cutting off the user mid-sentence, let's try to understand the way normal dialogue works.

When the response is a Yes/No value: This is supposed to be an instantaneous response and probably not a very long statement.
When the response is an address: This consists of multiple pauses/gaps and a longer time to complete the whole response.

Ex: Door #1 < pause > Sector-D1 < pause > Kanpur Road < pause > Lucknow

When the response is a phone number: This will be a patterned delay and there will be a few pauses but the whole response is not very long.

Ex: 99-44-32-06-11 or +1-202-795-3213

1. Types of delays

Once we have understood how normal dialogue dealy works, let's have a look at other kinds of delay that are introduced in the voice agent conversation. After having a clear idea about the functioning of delays, we can better optimize the conversation around the same.

Delay (or perceived delay to the user) is the amount of time it takes after the user completes the query/response and the agent voice out the next response.

STT delay: When the user has responded and the telephony platform has received the audio file. This delay is caused when the audio file is getting converted to the text file on the STT engine. This delay depends on the number of characters, simple name response will take lesser time to process than an address.

Ex: Audio-to-text conversation of "My name is Jake, what is my bank balance?"

note

The yellow cloud platform would have defined a range of duration for which the telephony platform accepts the response. For example, if the user is responding with a phone number we can ask the voice agent to only record it for 1 minute by setting the Recording max duration.

Telephony to yellow cloud platform: The converted audio to text response will get transported from the telephony platform.
NLP engine response time within cloud platform: The text received on the yellow cloud platform will be sent to the NLP engine. Internally there will run a logical function where the software understands the user text response, finds a solution to continue the flow and generate a voice agent response.
Yellow to telephony cloud platform: The text response generated by the NLU-yellow cloud platform will get transported to the telephony platform.
TTS delay: This delay occurs while the agent response text is getting converted to an audio format by the TTS engine so that it can be played as a response to the user.

info

TTS delay optimisation using cache memory

Use case: Assume you own an ominous that converses with 1000 users in a day and the conversation flow is mostly the same. Ex: First agent question will be "what is your name" and the next would be to inform the user - "we have introduced a 0% intro APR offer on both purchases and transfers".

These repetitive messages or the small audio files remain the same throughout all the calls, hence the TTS delay that occurs while the NLP text response is getting converted to speech is optimised using cache. This skips the TTS delay and fetches the agent response from the cache database (present in the telephony platform).

The audio files generated from the first few user conversations are stored in the database and reused for the other calls reducing the overall latency.

2. Configure delays for different use cases

Finally, after understanding the art of human dialogue and the system of voice agent, let us drill down on designing and configuring the agent in much better way. While understanding the system delays, there were 3 major parts to it:

STT
Telephony-Cloud Communication
TTS

note

Telephony-Cloud stacks are very tightly integrated with each other and with a very reliable and fast NLP engine the whole delay for this section is negligible. TTS delay is already very well optimized by using the caching mechanism explained above.

Clever configuration lies is on how we optimize the STT delay. Let's understand that process below:

Understanding STT detection

Use-case: The agent asks the user "Are you sure you want to place this order?" and the user responds with Yes/No.

Initial delay: This delay occurs when the user hears the agent's response and takes time to process it before replying.
Information delay: The time taken for the user to speak the complete response.
Pauses: Pauses taken in between each of the words are considered a delay. Pause for a Yes/No answer will be nill and there will be multiple pauses while recording an address.
Final dealy: After the user has spoken the response, the duration of time the agent waits to understand that the user response is received and it must be processed as a single audio file.

note

To design a good voice agent you must configure these parameters at the node level based on the question being asked.

2.1 Configure for Yes/ No response

STT engine:: Microsoft and STT mode:: Streaming

Parameters	Description	Min Value	Max Value
Recording max duration	Maximum duration after which the recording stops - the user response won't be accepted beyond this time.	5 seconds	60 seconds
Initial silence duration	Acceptable silence duration before a agent user starts speaking.	5 seconds	10 seconds
Final silence duration	Acceptable silence duration after a agent user starts speaking and the agent will have to process the response (Final delay must be greater than expected pauses).	0.1 seconds	5 seconds

1. Types of delays​

2. Configure delays for different use cases​

2.1 Configure for Yes/ No response​

1. Types of delays

2. Configure delays for different use cases

2.1 Configure for Yes/ No response