Skip to main content
PrerequisitesTo use stream events, you must:
  • Set bidirectional="true" on the <Stream> element.
  • Have an active WebSocket connection established by Vobiz.
  • Send events as JSON messages through the WebSocket.

1. Bidirectional XML setup

Bidirectional streaming is the prerequisite for sending any command back to Vobiz. Configure it on the <Stream> XML element:
Enable bidirectional audio streaming
<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Stream
bidirectional="true"
keepCallAlive="true"
contentType="audio/x-l16;rate=8000">
wss://stream.vobiz.ai/stream
    </Stream>
</Response>
See Initiate a Stream for the full set of <Stream> attributes.

2. How events flow

Two directions, seven events total. Each event is documented in detail in the Typical event sequence below. App → Vobiz - commands you send to control the call
EventPurposeDetails
playAudioQueue a 20 ms audio chunk for playback to the caller.→ docs
checkpointMark end of an utterance; ack arrives as playedStream.→ docs
clearAudioDrop queued playback audio on barge-in.→ docs
stopTerminate the stream from your side.→ Server-initiated stop
Vobiz → App - events your handler receives
EventPurpose
startStream connection established; carries callId, streamId, mediaFormat. Fires once.
mediaInbound 20 ms audio frame. Fires ~50 times per second per track.
playedStreamAck that audio up to your most recent checkpoint finished playing.
clearedAudioAck that your clearAudio flushed the playback queue.
There is no inbound stop event. Vobiz does not send { "event": "stop" } when the call ends - the WebSocket simply closes. Treat the WebSocket close event as your end-of-stream signal. See Detecting end of stream for the empirical evidence.stop exists only as an outbound command - see Server-initiated stop.
playedStream is conditional. It is only emitted if the audio queued before the matching checkpoint played to completion. If playback fails or is interrupted (e.g. by a clearAudio), you will not receive the ack.

3. Typical event sequence

A complete interactive turn - Vobiz opens the stream, your app greets the caller, the caller speaks, your app barges in with a new response.
1

Vobiz → App · start

Fires once, immediately after Vobiz upgrades the WebSocket. Use it to set up per-call state (transcribers, recording paths, etc). The call/stream identifiers live inside the nested start object - not at the top level.
{
  "sequenceNumber": 0,
  "event": "start",
  "start": {
    "callId": "5401fd2e-6344-40df-a22c-c8ffea7a92e7",
    "streamId": "c4dfd815-a92a-4140-ab85-5ff28c004116",
    "accountId": "500025",
    "tracks": ["inbound"],
    "mediaFormat": {
      "encoding": "audio/x-l16",
      "sampleRate": 16000
    }
  },
  "extra_headers": "{}"
}
FieldNotes
start.callIdMatches the CallUUID returned by the REST call-create response and the hangup_url webhook.
start.streamIdRequired on every Server → Vobiz command (clearAudio, checkpoint, stop; see playAudio note below).
start.accountIdInternal numeric Vobiz account ID. Corresponds to the public ParentAuthID (MA_…) that appears on lifecycle webhooks.
start.tracks["inbound"] when bidirectional="true". With bidirectional="false" you can also receive the outbound leg via audioTrack="both".
start.mediaFormatMirrors the contentType and rate negotiated on the <Stream> element. Sample rate is one of 8000 / 16000 / 24000.
start.mediaFormat mirrors the rate you negotiated on the <Stream> element. The example above is from a 16 kHz capture; <Stream contentType="audio/x-l16;rate=8000"> would produce "sampleRate": 8000 here, and 24 kHz would produce 24000 - whatever you configured is reflected back verbatim.
2

App → Vobiz · playAudio (greeting)

Queue a 20 ms chunk of audio for playback. media.contentType and media.sampleRate must match the format Vobiz negotiated for this call (see start.mediaFormat above).
{
  "event": "playAudio",
  "media": {
    "contentType": "audio/x-mulaw",
    "sampleRate": 8000,
    "payload": "base64-encoded-audio..."
  }
}
Keep chunks small (~20 ms = 160 bytes of μ-law @ 8 kHz, 320 bytes of L16 @ 8 kHz) so barge-in via clearAudio is responsive.
clearAudio, checkpoint, and stop all carry streamId. playAudio does not in any of our captured frames - the WebSocket only ever carries a single stream, so it can be inferred. Including streamId on playAudio is harmless if you prefer consistency across outbound commands.
3

App → Vobiz · checkpoint

Send right after the last playAudio chunk of an utterance. Vobiz replies with playedStream once it has actually delivered the queued audio to the caller.
{
  "event": "checkpoint",
  "streamId": "c4dfd815-a92a-4140-ab85-5ff28c004116",
  "name": "response-3"
}
4

Vobiz → App · playedStream

Acknowledgment that the audio queued before the checkpoint finished playing to the caller. The payload is just event + name - there is no streamId field.
{
  "event": "playedStream",
  "name": "response-3"
}
The name echoes the name you set in the matching checkpoint.
5

Vobiz → App · media (caller audio)

One frame every 20 ms while the caller is on the line (~50 per second per track). media.payload is base64-encoded raw audio in the encoding declared by start.mediaFormat.
{
  "sequenceNumber": 2,
  "streamId": "c4dfd815-a92a-4140-ab85-5ff28c004116",
  "event": "media",
  "media": {
    "track": "inbound",
    "timestamp": "1778597597091",
    "chunk": 2,
    "payload": "base64-user-audio..."
  },
  "extra_headers": "{}"
}
FieldNotes
sequenceNumberMonotonic across the whole stream - starts at 0 on the start event and increments per message.
media.trackinbound (caller) or outbound (callee).
media.timestampStream timestamp in ms on the Vobiz clock - not your server clock.
media.chunkPer-stream monotonic chunk index.
media.payloadFor L16 the bytes are network byte order (big-endian) - swap to little-endian before writing to a WAV file.
6

App → Vobiz · clearAudio (barge-in)

Drops everything queued in Vobiz that hasn’t been streamed to the caller yet. Use this the moment your VAD detects the caller speaking over the bot.
{
  "event": "clearAudio",
  "streamId": "c4dfd815-a92a-4140-ab85-5ff28c004116"
}
7

Vobiz → App · clearedAudio

Acknowledgment that the queued playback audio was flushed.
{
  "event": "clearedAudio",
  "streamId": "c4dfd815-a92a-4140-ab85-5ff28c004116"
}
8

App → Vobiz · playAudio (new response)

Send the fresh response. Repeat steps 2–4 (playAudiocheckpointplayedStream) for each utterance.
{
  "event": "playAudio",
  "media": {
    "contentType": "audio/x-mulaw",
    "sampleRate": 8000,
    "payload": "base64-new-audio..."
  }
}
9

App → Vobiz · stop (end the stream)

When your agent is done, send a stop packet. The stream stops immediately and Vobiz proceeds to the next XML element in your response. If there is no next element, Vobiz hangs up the call with HangupCauseCode=4010 (“End Of XML Instructions”).
{
  "event": "stop",
  "streamId": "c4dfd815-a92a-4140-ab85-5ff28c004116"
}
There is no inbound stop ack - the WebSocket close itself confirms it. Full webhook flow and the Hangup payload are in Server-initiated stop below.

4. Ending the stream

Detecting end of stream

When the call ends, the last media frame arrives and the WebSocket closes. There is no in-band JSON stop event from Vobiz. The end-of-stream signals are, in order of arrival:
  1. The WebSocket close event - the canonical signal, universal across every termination path.
  2. (Server-initiated stops only) Event=StopStream POSTed to statusCallbackUrl.
  3. Event=Hangup POSTed to hangup_url - the authoritative “call is over” signal regardless of who ended the call.
Here is a real final frame from a live call (sequenceNumber: 274) immediately followed by the WebSocket close that flushes a buffered WAV recording to disk:
Last media frame, then socket close
{
  sequenceNumber: 274,
  streamId: "7f169b6e-130d-46a3-b135-e1cc342c1ca2",
  event: "media",
  media: {
    track: "inbound",
    timestamp: "1778574098786",
    chunk: 274,
    payload: "/////////////////////////////////////////////////////////8="
  },
  extra_headers: "{}"
}
WAV recording saved: /home/user/bun/recordings/call-2026-05-12T08-21-33-347Z.wav
The payload here happens to be all 0xFF bytes (which is L16 -1, i.e. silence as the carrier wound the call down) - that is a property of the audio at that instant, not an end-of-stream indicator.
Do not detect end-of-call by inspecting media.payload (e.g. looking for all-/ base64 or all-0xFF bytes). That pattern is just silence at 8 kHz L16 - it can appear mid-call during any pause.Do not wait for an inbound { "event": "stop" } on the WebSocket - Vobiz does not emit one.The StopStream status callback is only observed when the server initiates the stop (it does fire reliably in that case). It does not fire when the caller hangs up or the call is killed mid-stream - fall back to the WebSocket close event and the Hangup webhook in those cases.Flush any in-memory recording/transcript buffers from your WebSocket close handler.

Empirical capture

Four independent calls captured across 2026-05-12 and 2026-05-13 with <Stream bidirectional="true">:
callIdend triggerstartmediainbound stopStopStream cbHangup cb
5401fd2e-…92e7killed mid-call110400❌ not observed
cf8ae0ac-…5353caller hangup12440❌ not observed
ac3490a0-…f2e3server-initiated stop16050
14ac7f05-…73afserver-initiated stop16040
The WebSocket close event is universal. Event=StopStream is observed only when the server initiates the termination. Event=Hangup is observed in every case and is the safest authoritative end-of-call signal. For a runnable reference that demonstrates this flow end-to-end (mid-stream frames → final frame → WAV flush on close), see the Bun Media Stream Server.

Server-initiated stop

You can terminate the stream from your side by sending a stop command over the WebSocket. The stream stops immediately and Vobiz proceeds to the next XML element in your response:
  • If there is a next XML element (e.g. <Speak>, <Dial>, <Redirect>), Vobiz executes it. The call continues without <Stream>.
  • If there is no next element, Vobiz hangs up the call. The Hangup webhook will report HangupCauseCode=4010 (“End Of XML Instructions”) and HangupSource=Vobiz. You do not need to follow <Stream> with <Hangup/> for this - it’s automatic.
{
  "event": "stop",
  "streamId": "c4dfd815-a92a-4140-ab85-5ff28c004116"
}
Producer snippet
stop_event = {
    "event": "stop",
    "streamId": self.stream_id,
}
await websocket.send_text(json.dumps(stop_event))
What happens after you send the stop:
StepWhat Vobiz does
1Stream stops immediately; WebSocket closes
2Next XML element executes - or, if there is no next element, the call hangs up automatically
3Event=StopStream POSTed to statusCallbackUrl
4Event=Hangup POSTed to hangup_url (only if the call ended - i.e. no further XML elements after <Stream>)
You don’t need to wait for any WebSocket reply - once you’ve sent the stop, the WS closes and the lifecycle webhooks (if applicable) follow. There is no matching inbound stop JSON event; the WebSocket close itself is your acknowledgment. The REST equivalent of this is POST /audio-streams/.../stop.

5. Node.js handler

A minimal reference handler that wires up the four most common code paths: receiving start, queueing playAudio + checkpoint, processing inbound media, and reacting to playedStream. Use it as a skeleton; replace the bodies with your STT/LLM/TTS pipeline.
Sending events from your WebSocket server
const WebSocket = require('ws');

let streamId = null;

wss.on('connection', (ws) => {
  ws.on('message', (message) => {
    const data = JSON.parse(message);

    if (data.event === 'start') {
      streamId = data.start.streamId;
      console.log('Stream started:', streamId);

      // Play a greeting audio
      sendPlayAudio(ws, greetingAudioBase64);

      // Send checkpoint to track when greeting finishes
      sendCheckpoint(ws, streamId, 'greeting-complete');
    }

    if (data.event === 'playedStream') {
      console.log('Checkpoint reached:', data.name);
      // Greeting played successfully - continue with next action
    }

    if (data.event === 'media') {
      // Process incoming audio
      const audioData = Buffer.from(data.media.payload, 'base64');
      // ... analyze, transcribe, etc.
    }
  });

  ws.on('close', () => {
    // End-of-stream signal - flush recordings/transcripts here.
    console.log('Stream ended:', streamId);
  });
});

function sendPlayAudio(ws, audioBase64) {
  ws.send(JSON.stringify({
    event: 'playAudio',
    media: {
      contentType: 'audio/x-l16',
      sampleRate: 8000,
      payload: audioBase64
    }
  }));
}

function sendCheckpoint(ws, streamId, checkpointName) {
  ws.send(JSON.stringify({
    event: 'checkpoint',
    streamId: streamId,
    name: checkpointName
  }));
}

function sendClearAudio(ws, streamId) {
  ws.send(JSON.stringify({
    event: 'clearAudio',
    streamId: streamId
  }));
}

function sendStop(ws, streamId) {
  ws.send(JSON.stringify({
    event: 'stop',
    streamId: streamId
  }));
}

6. Reproduce the capture

The captures referenced throughout this page (call counts, webhook sequence, final-frame log) come from a Bun reference server you can run yourself. The Bun Media Stream Server page has the full walkthrough - in short:
cd emaple/bun
AUTO_STOP_AFTER_MS=12000 bun start
# In another shell, trigger an outbound call via the Vobiz REST API
# with answer_url pointing at the ngrok URL fronting this server.
grep -E 'EVENT-(RECEIVED|SENT)|WEBHOOK-RECEIVED|WS-(OPEN|CLOSE)' server.log
This produces a start frame, ~50 media frames per second, an outbound stop packet at the 12-second mark, the WS-CLOSE, and the two lifecycle webhooks (StopStream, Hangup) - all timestamped in server.log.