Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.vobiz.ai/llms.txt

Use this file to discover all available pages before exploring further.

A minimal Bun server that answers a Vobiz call with a <Stream> XML, accepts the inbound WebSocket, and writes each call’s audio to a local WAV file. Use it as a sink-only reference when bringing up a new agent stack — replace the WAV writer with your own STT/LLM/TTS pipeline once frames are flowing.

Run the example

bun install && bun start — server listens on port 3000 by default.

What it does

The server exposes three endpoints:
EndpointPurpose
GET /Returns VobizXML containing a <Stream> element pointing at /stream.
WS /streamReceives start / media / stop JSON frames; writes audio to recordings/call-<timestamp>.wav.
POST /webhookReceives status callbacks (StartStream, MediaError, StopStream, …) and appends each to webhook-events.log.
It is sink-only — it does not send playAudio / checkpoint / clearAudio / stop packets back to Vobiz. The full bidirectional protocol is documented in Stream Events, Play Audio, Checkpoint Event, and Clear Audio; extend index.js with those once you wire in an agent.

Run it

bun install
bun start
To change the port:
PORT=3333 bun index.js
Expose the local server with ngrok and update the two URL constants in index.js:
ngrok http 3000
const PUBLIC_WS_URL = "wss://your-ngrok-domain.ngrok-free.app/stream";
const WEBHOOK_URL  = "https://your-ngrok-domain.ngrok-free.app/webhook";

VobizXML answer response

GET / returns:
<Response>
  <Stream
    bidirectional="true"
    statusCallbackUrl="https://your-ngrok-domain.ngrok-free.app/webhook"
    keepCallAlive="true">
    wss://your-ngrok-domain.ngrok-free.app/stream
  </Stream>
</Response>
AttributePurpose
bidirectional="true"Caller audio streams to your server and your server can stream audio back.
keepCallAlive="true"Keeps the call up while no further XML is executing — required for long-running agent sessions.
statusCallbackUrlVobiz POSTs lifecycle events here (StartStream, MediaError, StopStream, …).
See Initiate a Stream for the full set of <Stream> attributes.

What the handler does with each frame

  1. GET / returns the <Response><Stream>…</Stream></Response> XML.
  2. On WebSocket /stream open, allocates recordings/call-<ISO timestamp>.wav.
  3. On every media frame, base64-decodes media.payload, byte-swaps L16 → WAV PCM little-endian, and appends to the buffer.
  4. On close, flushes the buffered PCM to disk as a mono 16-bit WAV.
  5. On POST /webhook, appends each status event to webhook-events.log as one JSON line.

Sample console output

A live call produces one media frame every 20 ms. A typical mid-stream frame looks like:
Mid-stream media frame
{
  sequenceNumber: 206,
  streamId: "7f169b6e-130d-46a3-b135-e1cc342c1ca2",
  event: "media",
  media: {
    track: "inbound",
    timestamp: "1778574097426",
    chunk: 206,
    payload: "+P/4/wgACAAIAAgA+P/o//j/+P/4//j/CAAIAAgACAAIAAgACAAIAPj/..."
  },
  extra_headers: "{}"
}

The last media frame

When the caller hangs up, the final media frame arrives, the WebSocket closes, and the buffered PCM is flushed to a WAV. There is no special marker inside the last frame — its shape is identical to every other media frame; stream end is signalled by the stop event (and the subsequent socket close), not by the frame contents. In this real example, sequenceNumber: 274 was the final frame — its payload happens to be all 0xFF bytes (L16 -1, i.e. silence as the carrier wound the call down), but that is a property of the audio at that instant, not an end-of-stream indicator:
Final media frame, immediately followed by socket close + WAV flush
{
  sequenceNumber: 274,
  streamId: "7f169b6e-130d-46a3-b135-e1cc342c1ca2",
  event: "media",
  media: {
    track: "inbound",
    timestamp: "1778574098786",
    chunk: 274,
    payload: "/////////////////////////////////////////////////////////8="
  },
  extra_headers: "{}"
}
WAV recording saved: /home/user/bun/recordings/call-2026-05-12T08-21-33-347Z.wav
Do not detect end-of-call by inspecting media.payload (e.g. looking for all-/ base64 or all-0xFF bytes). That pattern is just silence at 8 kHz L16 — it can appear mid-call during any pause. The only reliable end-of-stream signals are:
  1. The stop event on the WebSocket — see Stream events.
  2. The WebSocket close event itself.
  3. The StopStream status callback POSTed to your statusCallbackUrl.
The example server flushes the WAV in its close handler, which is why the “WAV recording saved” line is the last log entry of a call.

Audio format

The example’s <Stream> element negotiates:
  • Encoding: L16 (linear 16-bit PCM, network byte order)
  • Sample rate: 8000 Hz
  • Channels: Mono
  • Chunk size: 20 ms (320 bytes per frame at 8 kHz L16)
For agent use cases, switch to μ-law by adding contentType="audio/x-mulaw;rate=8000" to the <Stream> tag — the JSON envelope is unchanged, only the bytes inside media.payload differ.
PropertyL16μ-law
contentTypeaudio/l16 (this example)audio/x-mulaw;rate=8000
Bit depth16-bit8-bit
Bytes per 20 ms frame320160
Byte orderNetwork (big-endian) — swap to little-endian for WAV PCMn/a

Output files

recordings/call-YYYY-MM-DDTHH-MM-SS-MMMZ.wav
webhook-events.log
Each WebSocket connection produces a timestamped WAV. Webhook events are appended as one JSON line per event with a receivedAt timestamp.

Extending it

To turn this sink into a full agent, send these frames back over the same socket — see the linked protocol pages for the exact JSON shape:
  • playAudio — queue a 20 ms chunk for playback.
  • checkpoint — mark end of an utterance; Vobiz replies with playedStream once it’s actually played.
  • clearAudio — drop queued audio on barge-in; Vobiz replies with clearedAudio.
  • stop — terminate the call leg from the WebSocket side without a second REST round-trip. Equivalent REST option: POST /audio-streams/.../stop.