> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vobiz.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Vobiz AI Voice Agent via WebSockets

> Custom AI voice agents on Vobiz using raw WebSocket audio streams - XML protocol, PCM transformations, and Python reference server for 130+ countries.

This guide is for developers who want to build custom AI voice agents directly on Vobiz. It covers the high-level architecture, the XML protocol, bit-level audio transformations, and the Python reference implementation.

## Resources

* **GitHub repository:** [Vobiz-Python-Voice-API-Example](https://github.com/vobiz-ai/Vobiz-Python-Voice-API-Example)
* **Primary language:** Python 3.11+
* **Core frameworks:** FastAPI (HTTP), `websockets` (WSS), `pyngrok` (tunneling)

## Architecture and connectivity

### Split-server model

The reference implementation runs two servers concurrently:

* **FastAPI server (port 5000)** - handles HTTP webhooks (`/answer`, `/hangup`) and acts as a WebSocket proxy/gateway.
* **Agent server (port 5001)** - a dedicated `websockets` server that maintains call state and audio stream processing.

### Flow overview

1. The Vobiz Cloud receives a call to your phone number.
2. An HTTP `POST` webhook is fired to your `/answer` route (for example, via an ngrok tunnel).
3. Your server returns an XML `<Stream>` response back to Vobiz.
4. Vobiz initiates a WSS upgrade request to your specified WebSocket URL.
5. Your WSS proxies the connection to your agent server locally to establish the session.
6. A bidirectional audio stream begins between Vobiz and your agent server.

## Vobiz XML protocol

Vobiz uses a specialized XML structure to orchestrate calls. Return these responses from your `/answer` endpoint.

### Binary audio stream (primary)

To initiate a bidirectional voice session, return the `<Stream>` tag from your `/answer` route. See the [Stream XML reference](/xml/stream) for full details.

### Handling hangups

Configure a global hangup URL in your Vobiz application settings, or specify it per-call in the REST API.

1. Log in to the [Vobiz Console](https://console.vobiz.ai).
2. Navigate to **Applications**.
3. Edit your voice application.
4. Set the **Hangup URL** to `https://<your-ngrok-url>/hangup`.
5. Set **Hangup Method** to `POST`.

## WebSocket event protocol

Once the WebSocket handshake succeeds, Vobiz and the agent exchange JSON frames.

### Inbound events (Vobiz → agent)

| Event          | Description                                               |
| -------------- | --------------------------------------------------------- |
| `start`        | Sent once at the beginning of the stream.                 |
| `media`        | Sent every 20 ms while the caller is speaking.            |
| `playedStream` | Sent after the agent's audio reaches a `checkpoint`.      |
| `stop`         | Sent when the call ends or the stream is closed by Vobiz. |

### Outbound events (agent → Vobiz)

| Event        | Description                                                                                       |
| ------------ | ------------------------------------------------------------------------------------------------- |
| `playAudio`  | Commands Vobiz to play sound to the caller.                                                       |
| `clearAudio` | Immediately stops all pending audio in the Vobiz buffer. Crucial for **barge-in** (interruption). |
| `checkpoint` | Inserts a marker in the stream - useful for tracking TTS delivery progress.                       |

## Audio engineering

Telephony uses the **G.711** standard. Modern AI produces high-fidelity PCM audio, which must be downsampled for the phone network.

### G.711 mu-law (PCMU)

* **Sample rate:** 8000 Hz (8 kHz)
* **Bit depth:** 8-bit
* **Compression:** logarithmic (companding)

### Conversion pipeline (`agent.py`)

1. **Synthesis** - AI TTS tools often return 16-bit PCM at 24,000 Hz or similar.
2. **Downsampling** - use linear interpolation to drop from 24 kHz to 8 kHz.
3. **Companding** - each 16-bit linear sample is converted to an 8-bit mu-law byte using bitmasks and shift operations to prioritize audible frequencies.
4. **Packetization** - audio is sent in 160-byte chunks, representing exactly 20 ms of speech.

## Outbound calls

### Using `make_call.py`

The outbound script automates the Vobiz `Account/{id}/Call/` endpoint payload for placing outbound calls.

```bash theme={null}
# In one terminal
python server.py

# In a second terminal
python make_call.py --to +DestNumber
```

### Number configuration

Ensure your Vobiz number is associated with an **application** in the portal that points to your public URLs. If you use `make_call.py`, the `answer_url` in the request overrides the portal defaults for that specific call.

## Troubleshooting

### The AI is talking over me / not stopping

Check the `utterance_end_ms` value in the Deepgram configuration in `agent.py`. If it's too high, silence detection is slower. Also ensure `clearAudio` is sent immediately upon detecting user intent.

### "401 Unauthorized" in the logs

Ensure your `NGROK_AUTH_TOKEN` is set in your environment. `pyngrok` requires authentication for persistent tunnels and certain advanced features.

### Why 20 ms chunks?

The global telephony standard uses 20 ms framing. Larger chunks cause jitter or robotic audio; smaller chunks create excessive network overhead for the Vobiz ingress nodes.

## Next steps

* Browse the [VobizXML reference](/xml/stream) for stream and call control verbs.
* Configure inbound applications via the [Applications API](/applications/create-application).
* Track calls with the [recording exports](/recording/export-historical-recordings).
