What is WebRTC? How It Works, APIs, Codecs & Voice AI (2026 Guide)

June 18, 2026 · By Piyush Sahoo WebRTC (Web Real-Time Communication) is the technology that lets a browser or mobile app capture a microphone and camera and stream that audio and video directly to another peer, with no plugin, no download, and sub-second latency. It is the engine under video calls in your browser tab, click-to-call buttons on websites, and the new wave of voice AI agents that pick up a mic in the page instead of a phone line. If you are building real-time voice or video into software, WebRTC is where the call begins. This guide goes well past the one-line definition: how a WebRTC connection is built step by step, the three core JavaScript APIs, the codecs it negotiates, how it punches through NATs and firewalls with ICE/STUN/TURN, where it differs from SIP and WebSockets, the honest trade-offs (no phone network on its own), and what changes when WebRTC is carrying a voice AI agent that also needs to reach real phone numbers.

Key takeaways

WebRTC is an open standard for peer-to-peer real-time audio, video, and data directly between browsers and apps, no plugins required.
A connection = capture media (getUserMedia) → exchange an SDP offer/answer over your own signaling channel → find a network path with ICE/STUN/TURN → stream encrypted media over SRTP.
The three APIs that matter: RTCPeerConnection (the connection), MediaStream (the mic/camera tracks), and RTCDataChannel (arbitrary data).
WebRTC mandates encryption: media is always SRTP, key-exchanged over DTLS. There is no unencrypted mode.
WebRTC does not reach the public phone network (PSTN) by itself and needs a TURN relay when peer-to-peer fails, that is exactly the gap an infrastructure layer like Vobiz fills.

What is WebRTC?

WebRTC is a free, open-source project and a set of W3C and IETF standards that give browsers and mobile applications real-time communication over simple APIs. The one distinction that matters: unlike older real-time stacks that needed a Flash plugin or a native SIP softphone, WebRTC ships inside the browser. Any modern browser (Chrome, Firefox, Safari, Edge) can capture a mic and camera and open an encrypted, low-latency media connection to another peer using nothing but JavaScript. It is two things at once. To web developers it is a JavaScript API surface defined by the W3C, most importantly RTCPeerConnection. To network engineers it is a bundle of IETF protocols, RTP/SRTP for media, ICE for connectivity, DTLS for keying, SDP for negotiation, formalized in RFC 8825. When people say “WebRTC,” they usually mean both layers working together to move audio, video, or data peer-to-peer in real time.

How WebRTC works (step by step)

A WebRTC session looks deceptively simple to the user, “click and you’re talking”, but underneath it runs a precise handshake. Here is the full sequence.

1. Capture media with getUserMedia

The browser asks the user for permission and grabs the microphone and/or camera through navigator.mediaDevices.getUserMedia(), which returns a MediaStream. For a voice agent, this is just the mic. Those tracks are added to the connection and become what the other side will hear or see.

2. Create an RTCPeerConnection

Each peer creates an RTCPeerConnection, the object that manages the whole call: the media tracks, the encryption, the network path, and the codecs. You configure it with a list of ICE servers (the STUN/TURN servers it can use to find a route, more below).

3. Negotiate with an SDP offer and answer

The two peers have to agree on what they will send (codecs, resolutions, encryption parameters, media directions). They do this by exchanging a Session Description Protocol (SDP) document. The caller generates an offer (createOffer()), the callee replies with an answer (createAnswer()), and each applies the other’s description. SDP is plain text and lists every codec, fingerprint, and ICE candidate the peer supports.

4. Signaling, the part WebRTC leaves to you

Here is the most-missed point in every beginner’s WebRTC project: WebRTC does not define how the offer and answer get from one peer to the other. That transport, called signaling, is your job. Most apps push SDP and ICE candidates over a WebSocket, an HTTP endpoint, or a message broker. WebRTC handles the media; you build (or buy) the signaling channel that introduces the two peers.

5. NAT traversal with ICE, STUN, and TURN

Almost no device sits on a public IP, they live behind home routers, corporate firewalls, and carrier-grade NAT. To find a path between two such peers, WebRTC uses Interactive Connectivity Establishment (ICE), which gathers candidate addresses and tests them until one works:

STUN (Session Traversal Utilities for NAT) lets a peer discover its own public IP and port as seen from the internet, so two peers can try to connect directly. STUN is cheap and works for the majority of connections.
TURN (Traversal Using Relays around NAT) is the fallback. When two peers genuinely cannot reach each other directly (symmetric NAT, strict firewalls), a TURN server relays all the media between them. TURN works almost everywhere but costs bandwidth and adds a hop, so it is used only when STUN fails, typically a meaningful minority of real-world calls.

6. Secure the path: DTLS and SRTP

Once ICE picks a route, the peers run a DTLS handshake over it to exchange keys, then encrypt every media packet with Secure Real-time Transport Protocol (SRTP). This is not optional, WebRTC mandates that all media and data are encrypted; there is no plaintext mode. The DTLS fingerprints exchanged in the SDP are what prevent a man-in-the-middle from hijacking the keying.

7. Stream media (and, optionally, data)

With a route found and keys exchanged, audio and video flow as SRTP-protected RTP packets, peer-to-peer where possible, relayed through TURN where not. A jitter buffer on each side smooths out packet timing, and the connection continuously adapts bitrate to the available network. If the app also opened an RTCDataChannel, arbitrary messages (chat, game state, file chunks, agent control signals) ride the same encrypted transport.

The core WebRTC APIs

For all the protocol machinery underneath, the developer-facing surface is small. Three objects do most of the work:

API	What it does
`RTCPeerConnection`	The heart of WebRTC. Manages the peer connection: SDP negotiation, ICE candidate gathering, encryption (DTLS/SRTP), codec selection, and the flow of media tracks.
`MediaStream` / `getUserMedia`	Represents the audio and video tracks captured from the user’s mic and camera (or screen). These tracks are what you add to the connection to be sent.
`RTCDataChannel`	A bidirectional channel for arbitrary data over the same encrypted transport, used for chat, file transfer, telemetry, or control messages alongside the call. It runs over SCTP and can be configured reliable or unreliable, ordered or unordered.

A minimal voice call is roughly: getUserMedia() for the mic → add the track to a new RTCPeerConnection() → createOffer() / createAnswer() → trade SDP and ICE candidates over your signaling channel → media flows. Everything else (NAT traversal, encryption, retransmission) the browser handles for you.

WebRTC codecs

WebRTC peers negotiate codecs in the SDP and pick the best one both sides support. The browser ships a mandatory set so interoperability is guaranteed.

Type	Codec	Notes
Audio	Opus	The default and the one that matters for voice. Adaptive 6–510 kbps, wideband/full-band, built-in noise and packet-loss handling. Mandatory in WebRTC.
Audio	G.711 (PCMU/PCMA)	64 kbps, narrowband. Mandatory for interop, mainly used when bridging to the legacy phone network.
Video	VP8	Royalty-free, mandatory to implement. The classic WebRTC baseline.
Video	H.264	Mandatory to implement; widely hardware-accelerated; needed for interop with many SIP/telecom systems.
Video	VP9	Better compression than VP8, supports scalable (SVC) encoding.
Video	AV1	Newest, best compression, increasingly supported, heavier to encode.

For voice and especially voice AI, Opus is the codec to care about, its wideband, adaptive audio gives speech-to-text far more signal than the 8 kHz G.711 ceiling, and it degrades gracefully under packet loss.

WebRTC vs SIP and WebSockets

These three get conflated, but they solve different problems and often work together.

	WebRTC	SIP	WebSockets
What it is	Browser/app media engine (audio, video, data)	Signaling protocol to set up calls	Persistent two-way browser↔server channel
Carries media?	Yes (SRTP)	No (signals only; media is RTP)	Not natively (carries any bytes you frame)
Reaches the PSTN?	Not by itself	Yes, the telecom standard	No
Encryption	Mandatory (DTLS/SRTP)	Optional (TLS + SRTP)	Optional (WSS/TLS)
Typical use	In-browser/in-app calling	Trunking, carrier interconnect	Streaming audio frames to a server pipeline

In practice they combine. A common voice-AI architecture: a browser uses WebRTC to capture the mic and stream it, a server bridges that into a SIP trunk to reach a phone number, and a separate WebSocket streams raw audio into the STT → LLM → TTS pipeline. WebRTC is for the edge (browser/app), SIP is for the phone network, and WebSockets are for server-side media transport. For a deeper treatment of the last two, see SIP vs WebSockets and the audio streaming docs.

WebRTC for voice AI

Voice AI is pulling WebRTC into the foreground because the most natural place for many agents to live is in the browser or app the user is already in, no phone call required. A support widget, an in-app assistant, or a web demo can capture the mic with getUserMedia and stream it straight to the agent’s pipeline. What matters here is different from a classic video call:

Latency budget. A natural conversational turn has to fit under roughly one second across capture + transport + STT + LLM + TTS. WebRTC’s peer-to-peer media and tight jitter buffers help, but every extra hop (and every TURN relay) eats into that budget.
Audio fidelity. Opus at wideband/24 kHz gives the speech model more to work with than narrowband telephony audio, which directly improves recognition accuracy.
Barge-in. A real conversation lets the human interrupt. That requires genuinely bidirectional, streaming media, not record-then-respond.
The PSTN bridge. Most agents also need to take or place real phone calls. WebRTC handles the web edge; reaching a phone number still requires a SIP trunk or Voice API behind it. The web mic and the phone line have to meet in the middle.

This is why “WebRTC vs telephony” is a false choice for AI builders, you usually need both, bridged: WebRTC for the app, a carrier path for the phone.

The honest trade-offs

WebRTC is powerful, but it is not magic, and a production deployment runs into real limits:

No PSTN on its own. WebRTC connects browsers and apps to each other. It cannot dial a phone number without a media server or SIP gateway bridging it to the carrier network.
TURN costs real money. When peer-to-peer fails, all media relays through your TURN servers, that is bandwidth you pay for, and a hop that adds latency. At scale, TURN is a genuine infrastructure line item, not a footnote.
Signaling is your problem. WebRTC deliberately leaves signaling undefined. You have to build and operate a reliable channel to exchange SDP and ICE candidates, and keep it up.
NAT and firewall variability. ICE handles most networks, but strict corporate firewalls and symmetric NAT can still force relays or, rarely, fail, which is why a well-provisioned TURN fleet matters.
Server-side scaling. Pure peer-to-peer breaks down beyond a couple of participants or when you need recording, transcription, or an AI pipeline in the path, that calls for a media server (SFU/MCU) or a streaming bridge.

How Vobiz handles WebRTC

Vobiz is the telephony infrastructure layer under voice AI, it does not build the agent; it powers the agents you build (Vapi, Retell, ElevenLabs, Pipecat, LiveKit, and more). For WebRTC specifically, that means handling the parts WebRTC leaves to you and bridging the web edge to the phone network:

WebRTC across web, iOS, and Android with live PSTN. Vobiz supports WebRTC application setup on browser and mobile, and bridges those sessions to real phone numbers, so a mic in a web page can talk to (or as) a phone call.
The PSTN bridge built in. Reach the phone network through Vobiz SIP trunking and the Voice API, DID provisioning in 130+ countries and outbound connectivity to 190+, so your WebRTC edge connects to actual numbers.
AI-native media. Bidirectional WebSocket audio streaming at 24 kHz with native noise cancellation and barge-in, piped into your STT → LLM → TTS loop via <Stream> and the audio streams API.
Built for the latency budget. Sub-80 ms single-hop, event-driven telephony with direct carrier connect (vs 300–400 ms on legacy CPaaS), so the transport leg of the conversation stays small.
Secure by default. SRTP media encryption and TLS 1.3 signaling, matching WebRTC’s own mandatory-encryption posture end to end.
It powers your stack, not a locked-in agent. Voice-AI builders like Bolna, fintechs like Razorpay and Acko, and enterprises like KPMG run on Vobiz infrastructure, you keep your agent; Vobiz provides the rails.

Frequently asked questions

What does WebRTC stand for?

WebRTC stands for Web Real-Time Communication. It is an open standard (W3C APIs plus IETF protocols) that lets browsers and mobile apps stream audio, video, and data peer-to-peer in real time without plugins.

Is WebRTC peer-to-peer?

By design, yes, media flows directly between peers whenever the network allows it. When two peers cannot reach each other directly (strict NAT or firewalls), a TURN server relays the media instead. Setup (signaling) always goes through a server you provide.

What is the difference between WebRTC and SIP?

WebRTC is a browser/app media engine that carries the actual audio and video (over SRTP). SIP is a signaling protocol that sets up calls and is the standard for reaching the public phone network. WebRTC alone cannot dial a phone number; it is often bridged to SIP to do so.

Do I need STUN and TURN servers for WebRTC?

Almost always. STUN helps peers discover their public address for a direct connection and is needed for most calls. TURN is the relay fallback for when a direct path is impossible, and you pay for its bandwidth. Production WebRTC needs both configured.

Can WebRTC make phone calls?

Not on its own, WebRTC connects browsers and apps, not the phone network. To call a real number you bridge WebRTC to a SIP trunk or Voice API, which is what an infrastructure layer like Vobiz provides alongside WebRTC support on web, iOS, and Android.

Is WebRTC encrypted?

Yes, always. WebRTC mandates encryption: media uses SRTP with keys exchanged over a DTLS handshake, and there is no unencrypted mode. The DTLS fingerprints in the SDP protect the keying from man-in-the-middle attacks.

Sources

W3C, “WebRTC: Real-Time Communication in Browsers”.
WebRTC project, “Real-time communication for the web”.
MDN Web Docs, “WebRTC API”.
IETF, “Overview: Real-Time Protocols for Browser-Based Applications” (RFC 8825).
IETF, “Interactive Connectivity Establishment (ICE)” (RFC 8445).
IETF, “The Secure Real-time Transport Protocol (SRTP)” (RFC 3711).
IETF, “Definition of the Opus Audio Codec” (RFC 6716).

Build on Vobiz

Provision a number and bridge your WebRTC app to the phone network in minutes.

​What is WebRTC?

​How WebRTC works (step by step)

​1. Capture media with getUserMedia

​2. Create an RTCPeerConnection

​3. Negotiate with an SDP offer and answer

​4. Signaling, the part WebRTC leaves to you

​5. NAT traversal with ICE, STUN, and TURN

​6. Secure the path: DTLS and SRTP

​7. Stream media (and, optionally, data)

​The core WebRTC APIs

​WebRTC codecs

​WebRTC vs SIP and WebSockets

​WebRTC for voice AI

​The honest trade-offs

​How Vobiz handles WebRTC

​Frequently asked questions

​Further reading on Vobiz

​Sources