Instead of routing calls through a SIP signaling layer to a platform like LiveKit, you instruct Vobiz to answer the call directly and stream the raw audio over a standard WebSocket connection straight to your server. Your server is the AI pipeline. This is the lowest-level, most direct integration path available. No third-party platform sits between Vobiz and your code. Every byte of audio is yours to process however you choose - which means maximum control, minimum cost, and full responsibility for everything that connects the pieces.Documentation Index
Fetch the complete documentation index at: https://docs.vobiz.ai/llms.txt
Use this file to discover all available pages before exploring further.
Key Insight: Voice XML streaming is not a replacement for SIP - it is a different integration layer entirely. SIP handles call routing and control at the telephony layer. WebSocket streaming handles audio delivery at the application layer. You can use both in the same architecture (e.g., SIP to route the call to Vobiz, then VoiceXML streaming to pipe audio to your Python server).
How WebSocket streaming works
When an inbound call arrives at Vobiz, the platform needs to know what to do with it. With Voice XML streaming, you configure a webhook URL that Vobiz fetches, which returns a VoiceXML document containing a<Stream> directive. This tells Vobiz: “Connect this call to my WebSocket server and start streaming audio.”
The VoiceXML directive that starts WebSocket streaming
WebSocket Audio Pipeline Flow
WebSocket message types
All messages over the WebSocket connection are JSON. There are five event types your server will receive, and one type you will send back:| Event | Direction | Description |
|---|---|---|
connected | Receive | Sent immediately when the WebSocket connection is established. Contains the WebSocket protocol version. No call-specific data yet. |
start | Receive | Sent once when the audio stream begins. Contains the StreamSid (unique stream identifier), CallSid, and any custom parameters you configured on the Stream directive. This is where you initialize your per-call state. |
media | Receive + Send | The audio data itself. Received continuously while the caller is speaking. Contains a base64-encoded payload of G.711 µ-law audio at 8kHz. To send audio to the caller, you send the same format back with the StreamSid. |
dtmf | Receive | Sent when the caller presses a phone key (0–9, *, #). Contains the digit pressed. Important: DTMF tones are detected by Vobiz and delivered as discrete events - they do NOT appear in the audio stream. You must handle this separately. |
stop | Receive | Sent when the call ends (caller hangs up, or the call is terminated programmatically). After receiving this, the WebSocket connection will close. Clean up all per-call resources here. |
G.711 µ-law: the audio encoding
Every audio payload arriving from Vobiz (and that you must send back) is encoded as G.711 µ-law (PCMU). This is not an arbitrary choice - it is the standard audio encoding of the global telephone network, and has been since the 1960s. Understanding what it is, and why your AI models cannot use it directly, is essential.Sample Rate
8,000 Hz - Telephone-quality (narrow band). Human voice is 300–3400 Hz. Sufficient for intelligibility but poor for high-fidelity synthesis.
Bit Depth
8 bits / sample - After companding (non-linear compression). Equivalent to ~12 bits of linear PCM in perceived dynamic range.
Bitrate
64 kbps - 8,000 samples/sec × 8 bits = 64,000 bits/sec. Delivered in precise 20ms frames = 160 bytes per packet.
The audio conversion pipeline
Every WebSocket voice AI implementation involves this conversion chain on both the inbound and outbound paths. Understanding each step prevents the most common bugs: distorted audio, incorrect volume levels, and desynchronized playback.Inbound - Caller → JSON → AI Pipeline
- Receive JSON -
mediaevent with base64 payload - Base64 Decode - String → bytes (µ-law)
- Decode µ-law - 8-bit µ-law → 16-bit PCM (8kHz)
- Resample - 8kHz → 16kHz/24kHz PCM
- Feed STT - Stream to STT engine explicitly
Outbound - AI Response → JSON → Caller
- Generate Audio - TTS engine returns 16kHz/24kHz PCM
- Resample - 16kHz/24kHz → 8kHz PCM
- Encode µ-law - 16-bit 8kHz PCM → 8-bit µ-law
- Base64 Encode - µ-law bytes → base64 string
- Send JSON - Transmit
{"event":"media"}over wss
Python’s built-in
audioop module handles µ-law conversion and resampling natively. Functions like audioop.ulaw2lin() and audioop.ratecv() perform companding and sample-rate conversion. Pipecat’s serializers handle all of this automatically.Advantages
- No Third-Party Platform Cost - There is no LiveKit, VAPI, or Retell layer. You pay Vobiz for the channel and the raw AI API costs (STT, LLM, TTS). At scale, this is a significant cost difference.
- Direct AI Model Integration - Deepgram, OpenAI, Anthropic, and Cartesia all natively consume and produce WebSocket audio streams. Your Vobiz stream connects almost directly without additional SDK abstractions in between.
- Full Pipeline Control - Every byte of audio flows through your code. You can implement custom VAD logic, custom barge-in states, and bespoke routing trees.
- Easiest to Debug Locally - A WebSocket server is just a web server. Test locally with ngrok and inspect messages in browser DevTools. SIP debugging requires specialized tools like Wireshark.
- Familiar Technology Stack - WebSockets are standard web technologies. Any Python, Node.js, or Go developer can work with them without needing to master legacy SIP headers and RTP setups.
- No IP Allowlisting Required - Authentication happens at the connection level via URL parameters or headers - simpler than IP-based ACLs that break when provider networks change.
Disadvantages
- One Stateful Connection Per Call - You own the state. 100 concurrent calls means 100 simultaneously open sockets, each maintaining its own context buffer. Crashing mid-call destroys that call’s state immediately.
- Barge-In Requires Implementation - Getting AI interruptions right requires orchestrating VAD thresholds, aborting in-flight TTS playback, flushing buffers, and resetting logic - all of which is complex to implement correctly.
- Turn Detection is Extremely Hard - Separating sentence pauses from true turn handovers requires locally running machine learning models (such as Silero VAD) to avoid false-positive interruptions triggered by background noise.
- Conversion Chain Complexity - The µ-law → PCM → AI conversion cycle is unforgiving about byte matching. Wrong endianness or sample-rate mismatches produce severely distorted static with no clear error logs.
- TCP vs. UDP Protocol Tradeoffs - WebSockets use reliable TCP, which guarantees delivery but introduces head-of-line blocking. If an audio packet stalls, TCP retransmission delays all subsequent packets, injecting jitter where UDP would simply drop and move on.
- No Enterprise PBX Out-of-the-Box - Large corporate phone infrastructures (Avaya, Teams) do not natively support WebSocket streams. If PBX integration is a hard requirement, you will need a SIP trunk architecture.
Pipecat integration
Pipecat (open-source, from Daily.co) is the leading Python framework for building WebSocket-based voice pipelines. It provides abstractions that make streaming architectures production-viable - handling audio conversion, pipeline orchestration, VAD, and barge-in logic for you.Pipeline Architecture
Pipecat models a voice call as a linear chain of processors. Each processor receives frames (audio, text, control markers) and passes outputs downstream - a clean mapping of voice pipeline concepts.Conceptual Pipecat pipeline
Vobiz Serializer
TheVobizFrameSerializer handles all base64 decoding, µ-law ↔ PCM conversion, and Vobiz-specific JSON message framing automatically - so your pipeline receives clean audio frames with no manual conversion code.
Built-in VAD
Integrates a pre-calibrated Silero VAD model out of the box. It suppresses false positives and triggers TTS cancellation immediately, producing smooth conversational turn-taking. Read Pipecat GuideDirect Python + Vobiz
The alternative is to run a bare ASGI/WSGI server and own every encoding layer yourself. This gives maximum flexibility but significantly increases implementation complexity.Framework Heavy Lifting
- Automatic µ-law bindings
- Base64 decode array loops
- STT module synchronicity
- Silero threshold tweaking
- Turn-detection history states
Bare-Metal Responsibilities
- Asyncio multiplexer state trees
- Buffer bit-math equations
- Latency compensation code
- In-flight manual TTS pausing
- DTMF intercept flags
When extreme control matters
Highly Custom Pipelines
Building experimental multimodal architectures not yet supported by open-source serialization frameworks.
Existing FastAPI Monolith
Adding WebSocket endpoints directly to a large existing Python API without introducing a separate orchestration tool.
Key implementation factors
Development Complexity
Medium (Pipecat) / High (Bare-Metal) - With Pipecat, a working voice agent can be built in hours. Bare-metal Python requires implementing all byte manipulation and encoding logic manually.
Cost
Minimum Overhead - Your costs are limited to Vobiz channel minutes and your AI API calls (STT, LLM, TTS). There is no intermediary platform fee. This makes the approach attractive at scale.
Time to First Call
2–4 hours (Pipecat) - Using a framework, you can validate end-to-end connectivity and get live audio transcribing within a single afternoon.
Connection Latency
Near-instant link + 20ms audio frames - Without a SIP handshake, WebSocket calls become active faster. End-to-end latency then depends on your AI API response times.
Scaling Model
One process per call - Each active call holds an open WebSocket connection with its own state. Horizontal scaling under high load is more operationally intensive than a typical SIP architecture.
Common developer pitfalls
When to choose WebSocket streaming
Strict Cost Objectives
Eliminates intermediary platform fees, keeping costs to Vobiz channel minutes and direct AI API usage.
Custom AI Architecture
Gives you full ownership of every processing step - audio encoding, VAD thresholds, pipeline branching, and state management.
Pipecat-Based Builds
Pipecat is designed around WebSocket transport, making it a natural fit for this integration path.
Rapid Prototyping
A WebSocket server is easy to run locally with ngrok, making it straightforward to get a working demo live quickly.
No Live Transfer Required
If your use case involves single, self-contained conversations without live call transfers, WebSocket streaming is sufficient.
Web-Centric Team
Python and JavaScript developers can work with WebSockets comfortably without needing to learn SIP or RTP.
What developers usually do next
SIP Trunking
Deep dive on the telephony-layer alternative
SIP vs WebSockets
Full decision matrix - 10 factors, real latency numbers
Pipecat Integration
Open-source Python framework for WSS pipelines
Direct WebSocket Setup
Bare-metal Python WebSocket handler against Vobiz