PrerequisitesTo use stream events, you must:
- Set
bidirectional="true"on the<Stream>element. - Have an active WebSocket connection established by Vobiz.
- Send events as JSON messages through the WebSocket.
1. Bidirectional XML setup
Bidirectional streaming is the prerequisite for sending any command back to Vobiz. Configure it on the<Stream> XML element:
Enable bidirectional audio streaming
<Stream> attributes.
2. How events flow
Two directions, seven events total. Each event is documented in detail in the Typical event sequence below. App → Vobiz - commands you send to control the call| Event | Purpose | Details |
|---|---|---|
playAudio | Queue a 20 ms audio chunk for playback to the caller. | → docs |
checkpoint | Mark end of an utterance; ack arrives as playedStream. | → docs |
clearAudio | Drop queued playback audio on barge-in. | → docs |
stop | Terminate the stream from your side. | → Server-initiated stop |
| Event | Purpose |
|---|---|
start | Stream connection established; carries callId, streamId, mediaFormat. Fires once. |
media | Inbound 20 ms audio frame. Fires ~50 times per second per track. |
playedStream | Ack that audio up to your most recent checkpoint finished playing. |
clearedAudio | Ack that your clearAudio flushed the playback queue. |
3. Typical event sequence
A complete interactive turn - Vobiz opens the stream, your app greets the caller, the caller speaks, your app barges in with a new response.Vobiz → App · start
Fires once, immediately after Vobiz upgrades the WebSocket. Use it to set up per-call state (transcribers, recording paths, etc). The call/stream identifiers live inside the nested
start object - not at the top level.| Field | Notes |
|---|---|
start.callId | Matches the CallUUID returned by the REST call-create response and the hangup_url webhook. |
start.streamId | Required on every Server → Vobiz command (clearAudio, checkpoint, stop; see playAudio note below). |
start.accountId | Internal numeric Vobiz account ID. Corresponds to the public ParentAuthID (MA_…) that appears on lifecycle webhooks. |
start.tracks | ["inbound"] when bidirectional="true". With bidirectional="false" you can also receive the outbound leg via audioTrack="both". |
start.mediaFormat | Mirrors the contentType and rate negotiated on the <Stream> element. Sample rate is one of 8000 / 16000 / 24000. |
start.mediaFormat mirrors the rate you negotiated on the <Stream> element. The example above is from a 16 kHz capture; <Stream contentType="audio/x-l16;rate=8000"> would produce "sampleRate": 8000 here, and 24 kHz would produce 24000 - whatever you configured is reflected back verbatim.App → Vobiz · playAudio (greeting)
Queue a 20 ms chunk of audio for playback. Keep chunks small (~20 ms = 160 bytes of μ-law @ 8 kHz, 320 bytes of L16 @ 8 kHz) so barge-in via
media.contentType and media.sampleRate must match the format Vobiz negotiated for this call (see start.mediaFormat above).clearAudio is responsive.clearAudio, checkpoint, and stop all carry streamId. playAudio does not in any of our captured frames - the WebSocket only ever carries a single stream, so it can be inferred. Including streamId on playAudio is harmless if you prefer consistency across outbound commands.App → Vobiz · checkpoint
Send right after the last
playAudio chunk of an utterance. Vobiz replies with playedStream once it has actually delivered the queued audio to the caller.Vobiz → App · playedStream
Acknowledgment that the audio queued before the checkpoint finished playing to the caller. The payload is just The
event + name - there is no streamId field.name echoes the name you set in the matching checkpoint.Vobiz → App · media (caller audio)
One frame every 20 ms while the caller is on the line (~50 per second per track).
media.payload is base64-encoded raw audio in the encoding declared by start.mediaFormat.| Field | Notes |
|---|---|
sequenceNumber | Monotonic across the whole stream - starts at 0 on the start event and increments per message. |
media.track | inbound (caller) or outbound (callee). |
media.timestamp | Stream timestamp in ms on the Vobiz clock - not your server clock. |
media.chunk | Per-stream monotonic chunk index. |
media.payload | For L16 the bytes are network byte order (big-endian) - swap to little-endian before writing to a WAV file. |
App → Vobiz · clearAudio (barge-in)
Drops everything queued in Vobiz that hasn’t been streamed to the caller yet. Use this the moment your VAD detects the caller speaking over the bot.
App → Vobiz · playAudio (new response)
Send the fresh response. Repeat steps 2–4 (
playAudio → checkpoint → playedStream) for each utterance.App → Vobiz · stop (end the stream)
When your agent is done, send a There is no inbound
stop packet. The stream stops immediately and Vobiz proceeds to the next XML element in your response. If there is no next element, Vobiz hangs up the call with HangupCauseCode=4010 (“End Of XML Instructions”).stop ack - the WebSocket close itself confirms it. Full webhook flow and the Hangup payload are in Server-initiated stop below.4. Ending the stream
Detecting end of stream
When the call ends, the lastmedia frame arrives and the WebSocket closes. There is no in-band JSON stop event from Vobiz. The end-of-stream signals are, in order of arrival:
- The WebSocket
closeevent - the canonical signal, universal across every termination path. - (Server-initiated stops only)
Event=StopStreamPOSTed tostatusCallbackUrl. Event=HangupPOSTed tohangup_url- the authoritative “call is over” signal regardless of who ended the call.
sequenceNumber: 274) immediately followed by the WebSocket close that flushes a buffered WAV recording to disk:
Last media frame, then socket close
0xFF bytes (which is L16 -1, i.e. silence as the carrier wound the call down) - that is a property of the audio at that instant, not an end-of-stream indicator.
Empirical capture
Four independent calls captured across 2026-05-12 and 2026-05-13 with<Stream bidirectional="true">:
| callId | end trigger | start | media | inbound stop | StopStream cb | Hangup cb |
|---|---|---|---|---|---|---|
5401fd2e-…92e7 | killed mid-call | 1 | 1040 | 0 | ❌ not observed | ✅ |
cf8ae0ac-…5353 | caller hangup | 1 | 244 | 0 | ❌ not observed | ✅ |
ac3490a0-…f2e3 | server-initiated stop | 1 | 605 | 0 | ✅ | ✅ |
14ac7f05-…73af | server-initiated stop | 1 | 604 | 0 | ✅ | ✅ |
Event=StopStream is observed only when the server initiates the termination. Event=Hangup is observed in every case and is the safest authoritative end-of-call signal.
For a runnable reference that demonstrates this flow end-to-end (mid-stream frames → final frame → WAV flush on close), see the Bun Media Stream Server.
Server-initiated stop
You can terminate the stream from your side by sending astop command over the WebSocket. The stream stops immediately and Vobiz proceeds to the next XML element in your response:
- If there is a next XML element (e.g.
<Speak>,<Dial>,<Redirect>), Vobiz executes it. The call continues without<Stream>. - If there is no next element, Vobiz hangs up the call. The
Hangupwebhook will reportHangupCauseCode=4010(“End Of XML Instructions”) andHangupSource=Vobiz. You do not need to follow<Stream>with<Hangup/>for this - it’s automatic.
Producer snippet
stop:
| Step | What Vobiz does |
|---|---|
| 1 | Stream stops immediately; WebSocket closes |
| 2 | Next XML element executes - or, if there is no next element, the call hangs up automatically |
| 3 | Event=StopStream POSTed to statusCallbackUrl |
| 4 | Event=Hangup POSTed to hangup_url (only if the call ended - i.e. no further XML elements after <Stream>) |
stop, the WS closes and the lifecycle webhooks (if applicable) follow. There is no matching inbound stop JSON event; the WebSocket close itself is your acknowledgment.
The REST equivalent of this is POST /audio-streams/.../stop.
5. Node.js handler
A minimal reference handler that wires up the four most common code paths: receivingstart, queueing playAudio + checkpoint, processing inbound media, and reacting to playedStream. Use it as a skeleton; replace the bodies with your STT/LLM/TTS pipeline.
Sending events from your WebSocket server
6. Reproduce the capture
The captures referenced throughout this page (call counts, webhook sequence, final-frame log) come from a Bun reference server you can run yourself. The Bun Media Stream Server page has the full walkthrough - in short:start frame, ~50 media frames per second, an outbound stop packet at the 12-second mark, the WS-CLOSE, and the two lifecycle webhooks (StopStream, Hangup) - all timestamped in server.log.