Vobiz AI Voice Agent via WebSockets

This guide is for developers who want to build custom AI voice agents directly on Vobiz. It covers the high-level architecture, the XML protocol, bit-level audio transformations, and the Python reference implementation.

Resources

GitHub repository: Vobiz-Python-Voice-API-Example
Primary language: Python 3.11+
Core frameworks: FastAPI (HTTP), websockets (WSS), pyngrok (tunneling)

Architecture and connectivity

Split-server model

The reference implementation runs two servers concurrently:

FastAPI server (port 5000) - handles HTTP webhooks (/answer, /hangup) and acts as a WebSocket proxy/gateway.
Agent server (port 5001) - a dedicated websockets server that maintains call state and audio stream processing.

Flow overview

The Vobiz Cloud receives a call to your phone number.
An HTTP POST webhook is fired to your /answer route (for example, via an ngrok tunnel).
Your server returns an XML <Stream> response back to Vobiz.
Vobiz initiates a WSS upgrade request to your specified WebSocket URL.
Your WSS proxies the connection to your agent server locally to establish the session.
A bidirectional audio stream begins between Vobiz and your agent server.

Vobiz XML protocol

Vobiz uses a specialized XML structure to orchestrate calls. Return these responses from your /answer endpoint.

Binary audio stream (primary)

To initiate a bidirectional voice session, return the <Stream> tag from your /answer route. See the Stream XML reference for full details.

Handling hangups

Configure a global hangup URL in your Vobiz application settings, or specify it per-call in the REST API.

Log in to the Vobiz Console.
Navigate to Applications.
Edit your voice application.
Set the Hangup URL to https://<your-ngrok-url>/hangup.
Set Hangup Method to POST.

WebSocket event protocol

Once the WebSocket handshake succeeds, Vobiz and the agent exchange JSON frames.

Inbound events (Vobiz → agent)

Event	Description
`start`	Sent once at the beginning of the stream.
`media`	Sent every 20 ms while the caller is speaking.
`playedStream`	Sent after the agent’s audio reaches a `checkpoint`.
`stop`	Sent when the call ends or the stream is closed by Vobiz.

Outbound events (agent → Vobiz)

Event	Description
`playAudio`	Commands Vobiz to play sound to the caller.
`clearAudio`	Immediately stops all pending audio in the Vobiz buffer. Crucial for barge-in (interruption).
`checkpoint`	Inserts a marker in the stream - useful for tracking TTS delivery progress.

Audio engineering

Telephony uses the G.711 standard. Modern AI produces high-fidelity PCM audio, which must be downsampled for the phone network.

G.711 mu-law (PCMU)

Sample rate: 8000 Hz (8 kHz)
Bit depth: 8-bit
Compression: logarithmic (companding)

Conversion pipeline (`agent.py`)

Synthesis - AI TTS tools often return 16-bit PCM at 24,000 Hz or similar.
Downsampling - use linear interpolation to drop from 24 kHz to 8 kHz.
Companding - each 16-bit linear sample is converted to an 8-bit mu-law byte using bitmasks and shift operations to prioritize audible frequencies.
Packetization - audio is sent in 160-byte chunks, representing exactly 20 ms of speech.

Outbound calls

Using `make_call.py`

The outbound script automates the Vobiz Account/{id}/Call/ endpoint payload for placing outbound calls.

# In one terminal
python server.py

# In a second terminal
python make_call.py --to +DestNumber

Number configuration

Ensure your Vobiz number is associated with an application in the portal that points to your public URLs. If you use make_call.py, the answer_url in the request overrides the portal defaults for that specific call.

Troubleshooting

The AI is talking over me / not stopping

Check the utterance_end_ms value in the Deepgram configuration in agent.py. If it’s too high, silence detection is slower. Also ensure clearAudio is sent immediately upon detecting user intent.

”401 Unauthorized” in the logs

Ensure your NGROK_AUTH_TOKEN is set in your environment. pyngrok requires authentication for persistent tunnels and certain advanced features.

Why 20 ms chunks?

The global telephony standard uses 20 ms framing. Larger chunks cause jitter or robotic audio; smaller chunks create excessive network overhead for the Vobiz ingress nodes.

Next steps

Browse the VobizXML reference for stream and call control verbs.
Configure inbound applications via the Applications API.
Track calls with the recording exports.

AI Voice Platforms

Official SDKs

Browser & WebRTC

Vobiz AI Voice Agent via WebSockets

Resources

Architecture and connectivity

Split-server model

Flow overview

Vobiz XML protocol

Binary audio stream (primary)

Handling hangups

WebSocket event protocol

Inbound events (Vobiz → agent)

Outbound events (agent → Vobiz)

Audio engineering

G.711 mu-law (PCMU)

Conversion pipeline (`agent.py`)

Outbound calls

Using `make_call.py`

Number configuration

Troubleshooting

The AI is talking over me / not stopping

”401 Unauthorized” in the logs

Why 20 ms chunks?

Next steps

AI Voice Platforms

Official SDKs

Browser & WebRTC

Documentation Index

​Resources

​Architecture and connectivity

​Split-server model

​Flow overview

​Vobiz XML protocol

​Binary audio stream (primary)

​Handling hangups

​WebSocket event protocol

​Inbound events (Vobiz → agent)

​Outbound events (agent → Vobiz)

​Audio engineering

​G.711 mu-law (PCMU)

​Conversion pipeline (agent.py)

​Outbound calls

​Using make_call.py

​Number configuration

​Troubleshooting

​The AI is talking over me / not stopping

​”401 Unauthorized” in the logs

​Why 20 ms chunks?

​Next steps

Resources

Architecture and connectivity

Split-server model

Flow overview

Vobiz XML protocol

Binary audio stream (primary)

Handling hangups

WebSocket event protocol

Inbound events (Vobiz → agent)

Outbound events (agent → Vobiz)

Audio engineering

G.711 mu-law (PCMU)

Conversion pipeline (`agent.py`)

Outbound calls

Using `make_call.py`

Number configuration

Troubleshooting

The AI is talking over me / not stopping

”401 Unauthorized” in the logs

Why 20 ms chunks?

Next steps