These are the two fundamental architectures for connecting phone calls to AI voice agents. The choice is not about which is “better” - it depends entirely on what you are building, your platform choices, and your priorities between deployment speed and total ownership.Documentation Index
Fetch the complete documentation index at: https://docs.vobiz.ai/llms.txt
Use this file to discover all available pages before exploring further.
SIP Trunking
Telephony industry standard - robust, enterprise-ready, and mandatory when you need call transfer, PBX integration, or managed platforms like LiveKit, VAPI, and Retell AI.
WebSocket Streaming
Developer-native path - highly cost-effective at scale, more direct, and ideal for custom AI pipelines built with Pipecat or bare-metal code.
Architectural Comparison
The two approaches operate at completely different layers of the stack. SIP is a telephony-layer protocol. WebSocket streaming is an application-layer transport.SIP Trunking Path
WebSocket Streaming Path
Important architectural nuance: These two architectures are not completely mutually exclusive. You can use a generic SIP Trunk provider to route a call to Vobiz, and then use a Vobiz VoiceXML
<Stream> directive to pipe that exact call to your custom WebSocket server. SIP handles the initial routing; WebSocket handles the audio layer.Full Decision Matrix
| Evaluation Factor | SIP Trunking | WebSocket Streaming |
|---|---|---|
| Developer Profile | Telephony admins, DevOps, or teams comfortable with SIP/RTP concepts | Python / Node.js engineers - WebSocket + async patterns are familiar |
| Setup Complexity | High - trunk config, IP ACLs, SIP URI routing, codec negotiation, firewall rules | Low/Medium - VoiceXML webhook + WebSocket server handling JSON |
| Call Setup Latency | 1–5 seconds (SIP INVITE handshake + PSTN routing overhead) | Near-instant (WebSocket TCP handshake + Vobiz webhook fetch) |
| Audio Transport Latency | Lower - UDP/RTP has no retransmission. Dropped packets are skipped, preserving real-time flow. | Slightly higher - TCP guarantees delivery. Retransmitted packets can add jitter on poor networks. |
| Audio Quality Support | G.711 8kHz or G.722 16kHz (HD wideband, if chosen carrier supports) | G.711 µ-law 8kHz (PSTN floor, same as standard SIP) |
| Infrastructure Cost | Vobiz trunk rate + AI platform fee (LiveKit/VAPI/Retell markup) | Vobiz channel rate + raw AI API costs only. No platform markup. |
| Live Call Transfer | Supported - blind and warm transfer via SIP REFER | Supported - Vobiz handles call transfer on the WebSocket path as well |
| Enterprise PBX Integration | Native - Avaya, Cisco UCM, Teams Direct Routing demand SIP | Not applicable - no standard bridge to existing PBX infra |
| Turn-Taking / Interruption | Abstracted - handled completely by the managed platform | Manual - you must build VAD + async pipeline cancellation |
| Horizontal Scaling | Carrier-layer - add trunk channels without touching server infra | Process-layer - you must scale WebSocket workers/containers |
Platform Compatibility Matrix
| Platform | SIP Trunking | WebSocket Streaming | Role in Architecture |
|---|---|---|---|
| LiveKit | ✅ Primary | - | Complete AI voice platform. SIP trunk terminates into LiveKit SIP Service. AI agent runs as LiveKit participant. |
| VAPI | ✅ Primary | - | Managed AI voice platform. BYO SIP trunk or direct SIP URI. PSTN calls route exclusively through SIP trunking. |
| Retell AI | ✅ Primary | - | Managed AI voice platform. Elastic SIP trunk or Register Phone Call API (SIP URI dialing). |
| ElevenLabs | ✅ Primary | - | Conversational AI platform with native SIP integration. Connects directly to PSTN phone calls via SIP trunking. |
| Pipecat | - | ✅ Primary | Open-source Python pipeline framework. Designed exclusively around WebSocket transport. No native SIP support. |
| Direct Python (Vobiz) | - | ✅ Primary | Bare-metal WebSocket handler against Vobiz streaming API. Maximum control, maximum ownership. |
| Bolna | Supported | ✅ Primary | Managed voice AI orchestration layer. Can integrate via WebSocket streams or via SIP trunk configuration. |
| Ultravox | Supported | ✅ Primary | Real-time AI voice platform. Primary integration via WebSocket audio; SIP via intermediary transport. |
Cost Analysis
The Vobiz channel rate is identical for both paths in spirit - the difference comes from whether you add a managed AI platform layer on top (SIP path) or own the pipeline yourself (WebSocket path). All pricing below is in INR.SIP Trunking Cost Stack
| Item | Cost | Notes |
|---|---|---|
| Vobiz SIP channel | ₹0.45/min | 45 paise per minute, inbound + outbound |
| Phone number (DID) | ₹500/month | Per active Vobiz number |
| Managed AI platform | Their pricing | LiveKit, VAPI, Retell, ElevenLabs each charge their own per-minute or subscription rate. Main cost driver at scale. |
| STT (e.g. Deepgram) | Included or API | VAPI/Retell include STT; LiveKit needs your own API key |
| LLM (e.g. GPT-4o) | API key required | Pass-through or bundled per-minute rate |
| TTS (e.g. ElevenLabs) | API key required | Per character or per minute |
WebSocket Streaming Cost Stack
| Item | Cost | Notes |
|---|---|---|
| Vobiz channel rate | ₹0.65/min | 65 paise per minute, inbound + outbound |
| Phone number (DID) | ₹500/month | Same as SIP |
| No managed AI platform | ₹0 | You build the pipeline yourself. This is the key saving. |
| STT (e.g. Deepgram) | Direct API rate | Pay STT provider directly |
| LLM (e.g. GPT-4o-mini) | Direct API rate | Pay OpenAI / Anthropic / Google directly |
| TTS (e.g. Cartesia / ElevenLabs) | Direct API rate | Choose your TTS provider |
| Server compute | Cloud infra | One process per concurrent call |
Latency Analysis
SIP has lower audio transport latency than WebSocket streaming. SIP uses UDP/RTP - a fire-and-forget protocol that never retransmits dropped packets, keeping audio delivery strictly real-time. WebSocket runs over TCP, which guarantees delivery by retransmitting lost packets - useful for data, but a source of jitter for live audio on poor networks. If latency is the only factor you care about, SIP wins. But latency is rarely why developers choose WebSocket streaming. They choose it for the ecosystem - direct access to AI frameworks (Pipecat), raw STT/LLM/TTS APIs, full pipeline control, and lower cost.| Path | Vobiz Telephony Layer | Notes |
|---|---|---|
| Both paths | < 50ms | Audio delivery from PSTN to your server |
| SIP | UDP/RTP | No retransmission; dropped packets skipped |
| WebSocket | TCP | Retransmissions add jitter on poor networks |
When to Choose SIP
- You need to integrate with enterprise PBX (Avaya, Cisco UCM, Microsoft Teams Direct Routing)
- You’re using a managed AI platform (LiveKit, VAPI, Retell, ElevenLabs)
- You need maximum audio quality (G.722 wideband)
- Live call transfer must work without custom code
- You want carrier-layer scaling (add trunk channels without touching infra)
When to Choose WebSockets
- You’re building a custom AI pipeline (Pipecat, direct Python, Node.js)
- You want direct access to STT/LLM/TTS APIs
- Total cost matters more than platform abstraction
- You want full control of the pipeline (interruption, VAD, turn-taking)
- You’re comfortable scaling WebSocket workers yourself
Migration Path
You don’t have to commit to one architecture forever:- Start with SIP + managed platform for fastest time-to-market
- Validate product-market fit with low engineering investment
- Migrate to WebSocket streaming once volume justifies engineering ownership
- Use Vobiz
<Stream>directive to bridge: SIP trunk routes call → WebSocket pipes audio to your server