This guide is for developers who want to build custom AI voice agents directly on Vobiz. It covers the high-level architecture, the XML protocol, bit-level audio transformations, and the Python reference implementation.Documentation Index
Fetch the complete documentation index at: https://docs.vobiz.ai/llms.txt
Use this file to discover all available pages before exploring further.
Resources
- GitHub repository: Vobiz-Python-Voice-API-Example
- Primary language: Python 3.11+
- Core frameworks: FastAPI (HTTP),
websockets(WSS),pyngrok(tunneling)
Architecture and connectivity
Split-server model
The reference implementation runs two servers concurrently:- FastAPI server (port 5000) - handles HTTP webhooks (
/answer,/hangup) and acts as a WebSocket proxy/gateway. - Agent server (port 5001) - a dedicated
websocketsserver that maintains call state and audio stream processing.
Flow overview
- The Vobiz Cloud receives a call to your phone number.
- An HTTP
POSTwebhook is fired to your/answerroute (for example, via an ngrok tunnel). - Your server returns an XML
<Stream>response back to Vobiz. - Vobiz initiates a WSS upgrade request to your specified WebSocket URL.
- Your WSS proxies the connection to your agent server locally to establish the session.
- A bidirectional audio stream begins between Vobiz and your agent server.
Vobiz XML protocol
Vobiz uses a specialized XML structure to orchestrate calls. Return these responses from your/answer endpoint.
Binary audio stream (primary)
To initiate a bidirectional voice session, return the<Stream> tag from your /answer route. See the Stream XML reference for full details.
Handling hangups
Configure a global hangup URL in your Vobiz application settings, or specify it per-call in the REST API.- Log in to the Vobiz Console.
- Navigate to Applications.
- Edit your voice application.
- Set the Hangup URL to
https://<your-ngrok-url>/hangup. - Set Hangup Method to
POST.
WebSocket event protocol
Once the WebSocket handshake succeeds, Vobiz and the agent exchange JSON frames.Inbound events (Vobiz → agent)
| Event | Description |
|---|---|
start | Sent once at the beginning of the stream. |
media | Sent every 20 ms while the caller is speaking. |
playedStream | Sent after the agent’s audio reaches a checkpoint. |
stop | Sent when the call ends or the stream is closed by Vobiz. |
Outbound events (agent → Vobiz)
| Event | Description |
|---|---|
playAudio | Commands Vobiz to play sound to the caller. |
clearAudio | Immediately stops all pending audio in the Vobiz buffer. Crucial for barge-in (interruption). |
checkpoint | Inserts a marker in the stream - useful for tracking TTS delivery progress. |
Audio engineering
Telephony uses the G.711 standard. Modern AI produces high-fidelity PCM audio, which must be downsampled for the phone network.G.711 mu-law (PCMU)
- Sample rate: 8000 Hz (8 kHz)
- Bit depth: 8-bit
- Compression: logarithmic (companding)
Conversion pipeline (agent.py)
- Synthesis - AI TTS tools often return 16-bit PCM at 24,000 Hz or similar.
- Downsampling - use linear interpolation to drop from 24 kHz to 8 kHz.
- Companding - each 16-bit linear sample is converted to an 8-bit mu-law byte using bitmasks and shift operations to prioritize audible frequencies.
- Packetization - audio is sent in 160-byte chunks, representing exactly 20 ms of speech.
Outbound calls
Using make_call.py
The outbound script automates the Vobiz Account/{id}/Call/ endpoint payload for placing outbound calls.
Number configuration
Ensure your Vobiz number is associated with an application in the portal that points to your public URLs. If you usemake_call.py, the answer_url in the request overrides the portal defaults for that specific call.
Troubleshooting
The AI is talking over me / not stopping
Check theutterance_end_ms value in the Deepgram configuration in agent.py. If it’s too high, silence detection is slower. Also ensure clearAudio is sent immediately upon detecting user intent.
”401 Unauthorized” in the logs
Ensure yourNGROK_AUTH_TOKEN is set in your environment. pyngrok requires authentication for persistent tunnels and certain advanced features.
Why 20 ms chunks?
The global telephony standard uses 20 ms framing. Larger chunks cause jitter or robotic audio; smaller chunks create excessive network overhead for the Vobiz ingress nodes.Next steps
- Browse the VobizXML reference for stream and call control verbs.
- Configure inbound applications via the Applications API.
- Track calls with the recording exports.