Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.vobiz.ai/llms.txt

Use this file to discover all available pages before exploring further.

This guide is for developers who want to build custom AI voice agents directly on Vobiz. It covers the high-level architecture, the XML protocol, bit-level audio transformations, and the Python reference implementation.

Resources

  • GitHub repository: Vobiz-Python-Voice-API-Example
  • Primary language: Python 3.11+
  • Core frameworks: FastAPI (HTTP), websockets (WSS), pyngrok (tunneling)

Architecture and connectivity

Split-server model

The reference implementation runs two servers concurrently:
  • FastAPI server (port 5000) - handles HTTP webhooks (/answer, /hangup) and acts as a WebSocket proxy/gateway.
  • Agent server (port 5001) - a dedicated websockets server that maintains call state and audio stream processing.

Flow overview

  1. The Vobiz Cloud receives a call to your phone number.
  2. An HTTP POST webhook is fired to your /answer route (for example, via an ngrok tunnel).
  3. Your server returns an XML <Stream> response back to Vobiz.
  4. Vobiz initiates a WSS upgrade request to your specified WebSocket URL.
  5. Your WSS proxies the connection to your agent server locally to establish the session.
  6. A bidirectional audio stream begins between Vobiz and your agent server.

Vobiz XML protocol

Vobiz uses a specialized XML structure to orchestrate calls. Return these responses from your /answer endpoint.

Binary audio stream (primary)

To initiate a bidirectional voice session, return the <Stream> tag from your /answer route. See the Stream XML reference for full details.

Handling hangups

Configure a global hangup URL in your Vobiz application settings, or specify it per-call in the REST API.
  1. Log in to the Vobiz Console.
  2. Navigate to Applications.
  3. Edit your voice application.
  4. Set the Hangup URL to https://<your-ngrok-url>/hangup.
  5. Set Hangup Method to POST.

WebSocket event protocol

Once the WebSocket handshake succeeds, Vobiz and the agent exchange JSON frames.

Inbound events (Vobiz → agent)

EventDescription
startSent once at the beginning of the stream.
mediaSent every 20 ms while the caller is speaking.
playedStreamSent after the agent’s audio reaches a checkpoint.
stopSent when the call ends or the stream is closed by Vobiz.

Outbound events (agent → Vobiz)

EventDescription
playAudioCommands Vobiz to play sound to the caller.
clearAudioImmediately stops all pending audio in the Vobiz buffer. Crucial for barge-in (interruption).
checkpointInserts a marker in the stream - useful for tracking TTS delivery progress.

Audio engineering

Telephony uses the G.711 standard. Modern AI produces high-fidelity PCM audio, which must be downsampled for the phone network.

G.711 mu-law (PCMU)

  • Sample rate: 8000 Hz (8 kHz)
  • Bit depth: 8-bit
  • Compression: logarithmic (companding)

Conversion pipeline (agent.py)

  1. Synthesis - AI TTS tools often return 16-bit PCM at 24,000 Hz or similar.
  2. Downsampling - use linear interpolation to drop from 24 kHz to 8 kHz.
  3. Companding - each 16-bit linear sample is converted to an 8-bit mu-law byte using bitmasks and shift operations to prioritize audible frequencies.
  4. Packetization - audio is sent in 160-byte chunks, representing exactly 20 ms of speech.

Outbound calls

Using make_call.py

The outbound script automates the Vobiz Account/{id}/Call/ endpoint payload for placing outbound calls.
# In one terminal
python server.py

# In a second terminal
python make_call.py --to +DestNumber

Number configuration

Ensure your Vobiz number is associated with an application in the portal that points to your public URLs. If you use make_call.py, the answer_url in the request overrides the portal defaults for that specific call.

Troubleshooting

The AI is talking over me / not stopping

Check the utterance_end_ms value in the Deepgram configuration in agent.py. If it’s too high, silence detection is slower. Also ensure clearAudio is sent immediately upon detecting user intent.

”401 Unauthorized” in the logs

Ensure your NGROK_AUTH_TOKEN is set in your environment. pyngrok requires authentication for persistent tunnels and certain advanced features.

Why 20 ms chunks?

The global telephony standard uses 20 ms framing. Larger chunks cause jitter or robotic audio; smaller chunks create excessive network overhead for the Vobiz ingress nodes.

Next steps