Detecting speech inputs

Gather’s automatic speech recognition (ASR) feature is ideal for accepting both unstructured and structured speech input from users. Structured inputs, in the form of keywords and commands, are suited for use cases that have a finite set of distinct operations for users to choose from, such as interactive voice response (IVR). Adding speech detection to DTMF-driven IVR menus can improve conversions by offering users an easier alternative to navigate through menus, as in this first example.

Examples

Structured input with DTMF and speech

This menu accepts either a key press or a spoken command. inputType="dtmf speech" listens for both, and the input detected first is relayed to the action URL. The hints attribute biases the recognizer toward the exact phrases you expect, and speechModel="command_and_search" is tuned for short commands like these.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Gather action="https://your-domain.com/gather/menu"
            method="POST"
            inputType="dtmf speech"
            numDigits="1"
            speechModel="command_and_search"
            hints="New Appointment, Cancel Appointment"
            language="en-US"
            executionTimeout="15"
            speechEndTimeout="auto">
        <Speak>Press 1 or say New Appointment to schedule an appointment. Press 2 or say Cancel Appointment to cancel an existing appointment.</Speak>
    </Gather>
    <Speak>We didn't receive your input. Goodbye!</Speak>
    <Hangup/>
</Response>

On the action URL, read InputType to see what was detected, then branch on Digits (for dtmf) or Speech (for speech):

Action URL parameters

InputType=speech
Speech=New Appointment
SpeechConfidenceScore=0.92
Digits=

Conversational AI with speech input

Real-time transcription of fuzzy inputs such as complete sentences, on the other hand, helps to build conversational AI-driven experiences. Here inputType="speech" collects free-form speech, and interimSpeechResultsCallback streams partial transcripts to your server as the caller talks - useful for low-latency AI agents.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Gather action="https://your-domain.com/gather/conversation"
            method="POST"
            inputType="speech"
            speechModel="default"
            language="en-US"
            executionTimeout="30"
            speechEndTimeout="auto"
            interimSpeechResultsCallback="https://your-domain.com/gather/interim"
            profanityFilter="true">
        <Speak>Thanks for calling. How can I help you today?</Speak>
    </Gather>
    <Speak>Sorry, I didn't catch that. Let me connect you to an agent.</Speak>
    <Redirect>https://your-domain.com/gather/transfer</Redirect>
</Response>

An easy way to build AI conversational interfaces is by passing the transcribed speech received through the Gather XML element to AI chatbot platforms such as Google Dialogflow for NLP-based intent extraction. Also read about how the Vobiz Speak XML element’s SSML engine can be used to make your bot’s responses sound natural.

​Examples

​Structured input with DTMF and speech

​Conversational AI with speech input

Examples

Structured input with DTMF and speech

Conversational AI with speech input