<Stream> XML, accepts the inbound WebSocket, and writes each call’s audio to a local WAV file. Use it as a sink-only reference when bringing up a new agent stack - replace the WAV writer with your own STT/LLM/TTS pipeline once frames are flowing.
Run the example
bun install && bun start - server listens on port 3000 by default.What it does
The server exposes three endpoints:| Endpoint | Purpose |
|---|---|
GET / | Returns VobizXML containing a <Stream> element pointing at /stream. |
WS /stream | Receives start / media / stop JSON frames; writes audio to recordings/call-<timestamp>.wav. |
POST /webhook | Receives status callbacks (StartStream, MediaError, StopStream, …) and appends each to webhook-events.log. |
playAudio / checkpoint / clearAudio / stop packets back to Vobiz. The full bidirectional protocol is documented in Stream Events, Play Audio, Checkpoint Event, and Clear Audio; extend index.js with those once you wire in an agent.
Run it
index.js:
VobizXML answer response
GET / returns:
| Attribute | Purpose |
|---|---|
bidirectional="true" | Caller audio streams to your server and your server can stream audio back. |
keepCallAlive="true" | Keeps the call up while no further XML is executing - required for long-running agent sessions. |
statusCallbackUrl | Vobiz POSTs lifecycle events here (StartStream, MediaError, StopStream, …). |
<Stream> attributes.
What the handler does with each frame
GET /returns the<Response><Stream>…</Stream></Response>XML.- On WebSocket
/streamopen, allocatesrecordings/call-<ISO timestamp>.wav. - On every
mediaframe, base64-decodesmedia.payload, byte-swaps L16 → WAV PCM little-endian, and appends to the buffer. - On close, flushes the buffered PCM to disk as a mono 16-bit WAV.
- On
POST /webhook, appends each status event towebhook-events.logas one JSON line.
Sample console output
A live call produces onemedia frame every 20 ms. A typical mid-stream frame looks like:
Mid-stream media frame
The last media frame
When the caller hangs up, the finalmedia frame arrives, the WebSocket closes, and the buffered PCM is flushed to a WAV. There is no special marker inside the last frame - its shape is identical to every other media frame; stream end is signalled by the stop event (and the subsequent socket close), not by the frame contents.
In this real example, sequenceNumber: 274 was the final frame - its payload happens to be all 0xFF bytes (L16 -1, i.e. silence as the carrier wound the call down), but that is a property of the audio at that instant, not an end-of-stream indicator:
Final media frame, immediately followed by socket close + WAV flush
Audio format
The example’s<Stream> element negotiates:
- Encoding: L16 (linear 16-bit PCM, network byte order)
- Sample rate: 8000 Hz
- Channels: Mono
- Chunk size: 20 ms (320 bytes per frame at 8 kHz L16)
contentType="audio/x-mulaw;rate=8000" to the <Stream> tag - the JSON envelope is unchanged, only the bytes inside media.payload differ.
| Property | L16 | μ-law |
|---|---|---|
contentType | audio/l16 (this example) | audio/x-mulaw;rate=8000 |
| Bit depth | 16-bit | 8-bit |
| Bytes per 20 ms frame | 320 | 160 |
| Byte order | Network (big-endian) - swap to little-endian for WAV PCM | n/a |
Output files
receivedAt timestamp.
Extending it
To turn this sink into a full agent, send these frames back over the same socket - see the linked protocol pages for the exact JSON shape:playAudio- queue a 20 ms chunk for playback.checkpoint- mark end of an utterance; Vobiz replies withplayedStreamonce it’s actually played.clearAudio- drop queued audio on barge-in; Vobiz replies withclearedAudio.stop- terminate the call leg from the WebSocket side without a second REST round-trip. Equivalent REST option:POST /audio-streams/.../stop.
Related
- Initiate a Stream - full
<Stream>XML reference. - Stream events - every inbound JSON event.
- Vobiz + Pipecat - same idea, Python + Pipecat pipeline instead of Bun.
- WebSockets integration - protocol-level overview.