Bun Media Stream Server

A minimal Bun server that answers a Vobiz call with a <Stream> XML, accepts the inbound WebSocket, and writes each call’s audio to a local WAV file. Use it as a sink-only reference when bringing up a new agent stack — replace the WAV writer with your own STT/LLM/TTS pipeline once frames are flowing.

Run the example

bun install && bun start — server listens on port 3000 by default.

What it does

The server exposes three endpoints:

Endpoint	Purpose
`GET /`	Returns VobizXML containing a `<Stream>` element pointing at `/stream`.
`WS /stream`	Receives `start` / `media` / `stop` JSON frames; writes audio to `recordings/call-<timestamp>.wav`.
`POST /webhook`	Receives status callbacks (`StartStream`, `MediaError`, `StopStream`, …) and appends each to `webhook-events.log`.

It is sink-only — it does not send playAudio / checkpoint / clearAudio / stop packets back to Vobiz. The full bidirectional protocol is documented in Stream Events, Play Audio, Checkpoint Event, and Clear Audio; extend index.js with those once you wire in an agent.

Run it

bun install
bun start

To change the port:

PORT=3333 bun index.js

Expose the local server with ngrok and update the two URL constants in index.js:

ngrok http 3000

const PUBLIC_WS_URL = "wss://your-ngrok-domain.ngrok-free.app/stream";
const WEBHOOK_URL  = "https://your-ngrok-domain.ngrok-free.app/webhook";

VobizXML answer response

GET / returns:

<Response>
  <Stream
    bidirectional="true"
    statusCallbackUrl="https://your-ngrok-domain.ngrok-free.app/webhook"
    keepCallAlive="true">
    wss://your-ngrok-domain.ngrok-free.app/stream
  </Stream>
</Response>

Attribute	Purpose
`bidirectional="true"`	Caller audio streams to your server and your server can stream audio back.
`keepCallAlive="true"`	Keeps the call up while no further XML is executing — required for long-running agent sessions.
`statusCallbackUrl`	Vobiz POSTs lifecycle events here (`StartStream`, `MediaError`, `StopStream`, …).

See Initiate a Stream for the full set of <Stream> attributes.

What the handler does with each frame

GET / returns the <Response><Stream>…</Stream></Response> XML.
On WebSocket /stream open, allocates recordings/call-<ISO timestamp>.wav.
On every media frame, base64-decodes media.payload, byte-swaps L16 → WAV PCM little-endian, and appends to the buffer.
On close, flushes the buffered PCM to disk as a mono 16-bit WAV.
On POST /webhook, appends each status event to webhook-events.log as one JSON line.

Sample console output

A live call produces one media frame every 20 ms. A typical mid-stream frame looks like:

Mid-stream media frame

{
  sequenceNumber: 206,
  streamId: "7f169b6e-130d-46a3-b135-e1cc342c1ca2",
  event: "media",
  media: {
    track: "inbound",
    timestamp: "1778574097426",
    chunk: 206,
    payload: "+P/4/wgACAAIAAgA+P/o//j/+P/4//j/CAAIAAgACAAIAAgACAAIAPj/..."
  },
  extra_headers: "{}"
}

The last media frame

When the caller hangs up, the final media frame arrives, the WebSocket closes, and the buffered PCM is flushed to a WAV. There is no special marker inside the last frame — its shape is identical to every other media frame; stream end is signalled by the stop event (and the subsequent socket close), not by the frame contents. In this real example, sequenceNumber: 274 was the final frame — its payload happens to be all 0xFF bytes (L16 -1, i.e. silence as the carrier wound the call down), but that is a property of the audio at that instant, not an end-of-stream indicator:

Final media frame, immediately followed by socket close + WAV flush

{
  sequenceNumber: 274,
  streamId: "7f169b6e-130d-46a3-b135-e1cc342c1ca2",
  event: "media",
  media: {
    track: "inbound",
    timestamp: "1778574098786",
    chunk: 274,
    payload: "/////////////////////////////////////////////////////////8="
  },
  extra_headers: "{}"
}
WAV recording saved: /home/user/bun/recordings/call-2026-05-12T08-21-33-347Z.wav

Do not detect end-of-call by inspecting media.payload (e.g. looking for all-/ base64 or all-0xFF bytes). That pattern is just silence at 8 kHz L16 — it can appear mid-call during any pause. The only reliable end-of-stream signals are:

The stop event on the WebSocket — see Stream events.
The WebSocket close event itself.
The StopStream status callback POSTed to your statusCallbackUrl.

The example server flushes the WAV in its close handler, which is why the “WAV recording saved” line is the last log entry of a call.

Audio format

The example’s <Stream> element negotiates:

Encoding: L16 (linear 16-bit PCM, network byte order)
Sample rate: 8000 Hz
Channels: Mono
Chunk size: 20 ms (320 bytes per frame at 8 kHz L16)

For agent use cases, switch to μ-law by adding contentType="audio/x-mulaw;rate=8000" to the <Stream> tag — the JSON envelope is unchanged, only the bytes inside media.payload differ.

Property	L16	μ-law
`contentType`	`audio/l16` (this example)	`audio/x-mulaw;rate=8000`
Bit depth	16-bit	8-bit
Bytes per 20 ms frame	320	160
Byte order	Network (big-endian) — swap to little-endian for WAV PCM	n/a

Output files

recordings/call-YYYY-MM-DDTHH-MM-SS-MMMZ.wav
webhook-events.log

Each WebSocket connection produces a timestamped WAV. Webhook events are appended as one JSON line per event with a receivedAt timestamp.

Extending it

To turn this sink into a full agent, send these frames back over the same socket — see the linked protocol pages for the exact JSON shape:

playAudio — queue a 20 ms chunk for playback.
checkpoint — mark end of an utterance; Vobiz replies with playedStream once it’s actually played.
clearAudio — drop queued audio on barge-in; Vobiz replies with clearedAudio.
stop — terminate the call leg from the WebSocket side without a second REST round-trip. Equivalent REST option: POST /audio-streams/.../stop.

Initiate a Stream — full <Stream> XML reference.
Stream events — every inbound JSON event.
Vobiz + Pipecat — same idea, Python + Pipecat pipeline instead of Bun.
WebSockets integration — protocol-level overview.

Guides

LiveKit Templates

Server Templates

Industries

Compare Platforms

Bun Media Stream Server

Run the example

What it does

Run it

VobizXML answer response

What the handler does with each frame

Sample console output

The last media frame

Audio format

Output files

Extending it

Guides

LiveKit Templates

Server Templates

Industries

Compare Platforms

Documentation Index

Run the example

​What it does

​Run it

​VobizXML answer response

​What the handler does with each frame

​Sample console output

​The last media frame

​Audio format

​Output files

​Extending it

​Related

What it does

Run it

VobizXML answer response

What the handler does with each frame

Sample console output

The last media frame

Audio format

Output files

Extending it

Related