Key takeaways
- WebRTC is an open standard for peer-to-peer real-time audio, video, and data directly between browsers and apps, no plugins required.
- A connection = capture media (
getUserMedia) → exchange an SDP offer/answer over your own signaling channel → find a network path with ICE/STUN/TURN → stream encrypted media over SRTP. - The three APIs that matter:
RTCPeerConnection(the connection),MediaStream(the mic/camera tracks), andRTCDataChannel(arbitrary data). - WebRTC mandates encryption: media is always SRTP, key-exchanged over DTLS. There is no unencrypted mode.
- WebRTC does not reach the public phone network (PSTN) by itself and needs a TURN relay when peer-to-peer fails, that is exactly the gap an infrastructure layer like Vobiz fills.
What is WebRTC?
WebRTC is a free, open-source project and a set of W3C and IETF standards that give browsers and mobile applications real-time communication over simple APIs. The one distinction that matters: unlike older real-time stacks that needed a Flash plugin or a native SIP softphone, WebRTC ships inside the browser. Any modern browser (Chrome, Firefox, Safari, Edge) can capture a mic and camera and open an encrypted, low-latency media connection to another peer using nothing but JavaScript. It is two things at once. To web developers it is a JavaScript API surface defined by the W3C, most importantlyRTCPeerConnection. To network engineers it is a bundle of IETF protocols, RTP/SRTP for media, ICE for connectivity, DTLS for keying, SDP for negotiation, formalized in RFC 8825. When people say “WebRTC,” they usually mean both layers working together to move audio, video, or data peer-to-peer in real time.
How WebRTC works (step by step)
A WebRTC session looks deceptively simple to the user, “click and you’re talking”, but underneath it runs a precise handshake. Here is the full sequence.1. Capture media with getUserMedia
The browser asks the user for permission and grabs the microphone and/or camera throughnavigator.mediaDevices.getUserMedia(), which returns a MediaStream. For a voice agent, this is just the mic. Those tracks are added to the connection and become what the other side will hear or see.
2. Create an RTCPeerConnection
Each peer creates anRTCPeerConnection, the object that manages the whole call: the media tracks, the encryption, the network path, and the codecs. You configure it with a list of ICE servers (the STUN/TURN servers it can use to find a route, more below).
3. Negotiate with an SDP offer and answer
The two peers have to agree on what they will send (codecs, resolutions, encryption parameters, media directions). They do this by exchanging a Session Description Protocol (SDP) document. The caller generates an offer (createOffer()), the callee replies with an answer (createAnswer()), and each applies the other’s description. SDP is plain text and lists every codec, fingerprint, and ICE candidate the peer supports.
4. Signaling, the part WebRTC leaves to you
Here is the most-missed point in every beginner’s WebRTC project: WebRTC does not define how the offer and answer get from one peer to the other. That transport, called signaling, is your job. Most apps push SDP and ICE candidates over a WebSocket, an HTTP endpoint, or a message broker. WebRTC handles the media; you build (or buy) the signaling channel that introduces the two peers.5. NAT traversal with ICE, STUN, and TURN
Almost no device sits on a public IP, they live behind home routers, corporate firewalls, and carrier-grade NAT. To find a path between two such peers, WebRTC uses Interactive Connectivity Establishment (ICE), which gathers candidate addresses and tests them until one works:- STUN (Session Traversal Utilities for NAT) lets a peer discover its own public IP and port as seen from the internet, so two peers can try to connect directly. STUN is cheap and works for the majority of connections.
- TURN (Traversal Using Relays around NAT) is the fallback. When two peers genuinely cannot reach each other directly (symmetric NAT, strict firewalls), a TURN server relays all the media between them. TURN works almost everywhere but costs bandwidth and adds a hop, so it is used only when STUN fails, typically a meaningful minority of real-world calls.
6. Secure the path: DTLS and SRTP
Once ICE picks a route, the peers run a DTLS handshake over it to exchange keys, then encrypt every media packet with Secure Real-time Transport Protocol (SRTP). This is not optional, WebRTC mandates that all media and data are encrypted; there is no plaintext mode. The DTLS fingerprints exchanged in the SDP are what prevent a man-in-the-middle from hijacking the keying.7. Stream media (and, optionally, data)
With a route found and keys exchanged, audio and video flow as SRTP-protected RTP packets, peer-to-peer where possible, relayed through TURN where not. A jitter buffer on each side smooths out packet timing, and the connection continuously adapts bitrate to the available network. If the app also opened anRTCDataChannel, arbitrary messages (chat, game state, file chunks, agent control signals) ride the same encrypted transport.
The core WebRTC APIs
For all the protocol machinery underneath, the developer-facing surface is small. Three objects do most of the work:| API | What it does |
|---|---|
RTCPeerConnection | The heart of WebRTC. Manages the peer connection: SDP negotiation, ICE candidate gathering, encryption (DTLS/SRTP), codec selection, and the flow of media tracks. |
MediaStream / getUserMedia | Represents the audio and video tracks captured from the user’s mic and camera (or screen). These tracks are what you add to the connection to be sent. |
RTCDataChannel | A bidirectional channel for arbitrary data over the same encrypted transport, used for chat, file transfer, telemetry, or control messages alongside the call. It runs over SCTP and can be configured reliable or unreliable, ordered or unordered. |
getUserMedia() for the mic → add the track to a new RTCPeerConnection() → createOffer() / createAnswer() → trade SDP and ICE candidates over your signaling channel → media flows. Everything else (NAT traversal, encryption, retransmission) the browser handles for you.
WebRTC codecs
WebRTC peers negotiate codecs in the SDP and pick the best one both sides support. The browser ships a mandatory set so interoperability is guaranteed.| Type | Codec | Notes |
|---|---|---|
| Audio | Opus | The default and the one that matters for voice. Adaptive 6–510 kbps, wideband/full-band, built-in noise and packet-loss handling. Mandatory in WebRTC. |
| Audio | G.711 (PCMU/PCMA) | 64 kbps, narrowband. Mandatory for interop, mainly used when bridging to the legacy phone network. |
| Video | VP8 | Royalty-free, mandatory to implement. The classic WebRTC baseline. |
| Video | H.264 | Mandatory to implement; widely hardware-accelerated; needed for interop with many SIP/telecom systems. |
| Video | VP9 | Better compression than VP8, supports scalable (SVC) encoding. |
| Video | AV1 | Newest, best compression, increasingly supported, heavier to encode. |
WebRTC vs SIP and WebSockets
These three get conflated, but they solve different problems and often work together.| WebRTC | SIP | WebSockets | |
|---|---|---|---|
| What it is | Browser/app media engine (audio, video, data) | Signaling protocol to set up calls | Persistent two-way browser↔server channel |
| Carries media? | Yes (SRTP) | No (signals only; media is RTP) | Not natively (carries any bytes you frame) |
| Reaches the PSTN? | Not by itself | Yes, the telecom standard | No |
| Encryption | Mandatory (DTLS/SRTP) | Optional (TLS + SRTP) | Optional (WSS/TLS) |
| Typical use | In-browser/in-app calling | Trunking, carrier interconnect | Streaming audio frames to a server pipeline |
WebRTC for voice AI
Voice AI is pulling WebRTC into the foreground because the most natural place for many agents to live is in the browser or app the user is already in, no phone call required. A support widget, an in-app assistant, or a web demo can capture the mic withgetUserMedia and stream it straight to the agent’s pipeline. What matters here is different from a classic video call:
- Latency budget. A natural conversational turn has to fit under roughly one second across capture + transport + STT + LLM + TTS. WebRTC’s peer-to-peer media and tight jitter buffers help, but every extra hop (and every TURN relay) eats into that budget.
- Audio fidelity. Opus at wideband/24 kHz gives the speech model more to work with than narrowband telephony audio, which directly improves recognition accuracy.
- Barge-in. A real conversation lets the human interrupt. That requires genuinely bidirectional, streaming media, not record-then-respond.
- The PSTN bridge. Most agents also need to take or place real phone calls. WebRTC handles the web edge; reaching a phone number still requires a SIP trunk or Voice API behind it. The web mic and the phone line have to meet in the middle.
The honest trade-offs
WebRTC is powerful, but it is not magic, and a production deployment runs into real limits:- No PSTN on its own. WebRTC connects browsers and apps to each other. It cannot dial a phone number without a media server or SIP gateway bridging it to the carrier network.
- TURN costs real money. When peer-to-peer fails, all media relays through your TURN servers, that is bandwidth you pay for, and a hop that adds latency. At scale, TURN is a genuine infrastructure line item, not a footnote.
- Signaling is your problem. WebRTC deliberately leaves signaling undefined. You have to build and operate a reliable channel to exchange SDP and ICE candidates, and keep it up.
- NAT and firewall variability. ICE handles most networks, but strict corporate firewalls and symmetric NAT can still force relays or, rarely, fail, which is why a well-provisioned TURN fleet matters.
- Server-side scaling. Pure peer-to-peer breaks down beyond a couple of participants or when you need recording, transcription, or an AI pipeline in the path, that calls for a media server (SFU/MCU) or a streaming bridge.
How Vobiz handles WebRTC
Vobiz is the telephony infrastructure layer under voice AI, it does not build the agent; it powers the agents you build (Vapi, Retell, ElevenLabs, Pipecat, LiveKit, and more). For WebRTC specifically, that means handling the parts WebRTC leaves to you and bridging the web edge to the phone network:- WebRTC across web, iOS, and Android with live PSTN. Vobiz supports WebRTC application setup on browser and mobile, and bridges those sessions to real phone numbers, so a mic in a web page can talk to (or as) a phone call.
- The PSTN bridge built in. Reach the phone network through Vobiz SIP trunking and the Voice API, DID provisioning in 130+ countries and outbound connectivity to 190+, so your WebRTC edge connects to actual numbers.
- AI-native media. Bidirectional WebSocket audio streaming at 24 kHz with native noise cancellation and barge-in, piped into your STT → LLM → TTS loop via
<Stream>and the audio streams API. - Built for the latency budget. Sub-80 ms single-hop, event-driven telephony with direct carrier connect (vs 300–400 ms on legacy CPaaS), so the transport leg of the conversation stays small.
- Secure by default. SRTP media encryption and TLS 1.3 signaling, matching WebRTC’s own mandatory-encryption posture end to end.
- It powers your stack, not a locked-in agent. Voice-AI builders like Bolna, fintechs like Razorpay and Acko, and enterprises like KPMG run on Vobiz infrastructure, you keep your agent; Vobiz provides the rails.
Frequently asked questions
What does WebRTC stand for?
What does WebRTC stand for?
WebRTC stands for Web Real-Time Communication. It is an open standard (W3C APIs plus IETF protocols) that lets browsers and mobile apps stream audio, video, and data peer-to-peer in real time without plugins.
Is WebRTC peer-to-peer?
Is WebRTC peer-to-peer?
By design, yes, media flows directly between peers whenever the network allows it. When two peers cannot reach each other directly (strict NAT or firewalls), a TURN server relays the media instead. Setup (signaling) always goes through a server you provide.
What is the difference between WebRTC and SIP?
What is the difference between WebRTC and SIP?
WebRTC is a browser/app media engine that carries the actual audio and video (over SRTP). SIP is a signaling protocol that sets up calls and is the standard for reaching the public phone network. WebRTC alone cannot dial a phone number; it is often bridged to SIP to do so.
Do I need STUN and TURN servers for WebRTC?
Do I need STUN and TURN servers for WebRTC?
Almost always. STUN helps peers discover their public address for a direct connection and is needed for most calls. TURN is the relay fallback for when a direct path is impossible, and you pay for its bandwidth. Production WebRTC needs both configured.
Can WebRTC make phone calls?
Can WebRTC make phone calls?
Not on its own, WebRTC connects browsers and apps, not the phone network. To call a real number you bridge WebRTC to a SIP trunk or Voice API, which is what an infrastructure layer like Vobiz provides alongside WebRTC support on web, iOS, and Android.
Is WebRTC encrypted?
Is WebRTC encrypted?
Yes, always. WebRTC mandates encryption: media uses SRTP with keys exchanged over a DTLS handshake, and there is no unencrypted mode. The DTLS fingerprints in the SDP protect the keying from man-in-the-middle attacks.
Further reading on Vobiz
- What is VoIP? · What is SIP? · What is a Voice API?
- SIP vs WebSockets · Streaming over WebSockets · Audio streaming
- WebRTC application setup ·
<Stream>element · Voice platform overview
Sources
- W3C, “WebRTC: Real-Time Communication in Browsers”.
- WebRTC project, “Real-time communication for the web”.
- MDN Web Docs, “WebRTC API”.
- IETF, “Overview: Real-Time Protocols for Browser-Based Applications” (RFC 8825).
- IETF, “Interactive Connectivity Establishment (ICE)” (RFC 8445).
- IETF, “The Secure Real-time Transport Protocol (SRTP)” (RFC 3711).
- IETF, “Definition of the Opus Audio Codec” (RFC 6716).
Build on Vobiz
Provision a number and bridge your WebRTC app to the phone network in minutes.