Testing Voice Agents
Scenario lets you write end-to-end tests for agents that listen to audio, think, and respond with either text or audio. This page shows three common patterns and how to test them.
Overview
- Audio ➜ Text – the agent receives an audio file (e.g. WAV) and replies with a textual answer.
- Audio ➜ Audio – the agent listens and replies in audio (great for voice assistants).
- Voice-to-Voice Conversation – both the user simulator and the agent speak over multiple turns (came out before OpenAI’s real-time audio API and still works today).
Use-case comparison
Scenario | Input | Expected Output | Typical Judge Model |
---|---|---|---|
Audio ➜ Text | file part (audio) + optional prompt | Text | gpt-4o-audio-preview or any GPT-4-level text model |
Audio ➜ Audio | file part (audio) + optional prompt | Audio (voice response) | gpt-4o-audio-preview (handles audio) |
Voice-to-Voice Conversation | Multiple turns, both sides send/receive audio | Audio dialogue | Same as above; judge runs after conversation |
Prerequisites & Setup
Before running the examples you’ll need a couple of things in place:
- Node.js ≥ 18 – the OpenAI
gpt-4o
voice model uses modernfetch
. - OPENAI_API_KEY – export an API key that has access to
gpt-4o-audio-preview
:
export OPENAI_API_KEY="sk-…"
Code Walk-through
import scenario, {
AgentAdapter,
AgentInput,
AgentRole,
} from "@langwatch/scenario";
import { describe, it, expect } from "vitest";
import OpenAI from "openai";
import { openai } from "@ai-sdk/openai";
import { ChatCompletionMessageParam } from "openai/resources/chat/completions.mjs";
import { encodeAudioToBase64, getFixturePath } from "./helpers";
import { CoreUserMessage } from "ai";
import { convertCoreMessagesToOpenAIMessages } from "./helpers/convert-core-messages-to-openai";
class AudioAgent extends AgentAdapter {
role: AgentRole = AgentRole.AGENT;
private openai = new OpenAI();
call = async (input: AgentInput) => {
// Convert Core messages → OpenAI shape so the voice-model accepts them
const messages = convertCoreMessagesToOpenAIMessages(input.messages);
const response = await this.respond(messages);
// Scenario expects **text**, so we extract the transcript only
const transcript = response.choices[0].message?.audio?.transcript;
if (typeof transcript === "string") return transcript;
throw new Error("Agent failed to generate a response");
};
private async respond(messages: ChatCompletionMessageParam[]) {
return this.openai.chat.completions.create({
model: "gpt-4o-audio-preview",
modalities: ["text", "audio"],
audio: { voice: "alloy", format: "wav" },
messages,
store: false,
});
}
}
const setId = "multimodal-audio-test";
describe("Multimodal Audio to Text Tests", () => {
it("should handle audio input", async () => {
const data = encodeAudioToBase64(
getFixturePath("male_or_female_voice.wav")
);
const audioMessage = {
role: "user",
content: [
{ type: "text", text: "Answer the question in the audio…" },
{ type: "file", mimeType: "audio/wav", data },
],
} satisfies CoreUserMessage;
const audioJudge = scenario.judgeAgent({
model: openai("gpt-4o-audio-preview"),
criteria: [
"Agent correctly guesses it's a male voice",
"Agent repeats the question",
"Agent says what format the input was in (audio or text)",
],
});
const result = await scenario.run({
name: "multimodal audio analysis",
description: "User sends audio, agent transcribes & analyses",
agents: [new AudioAgent(), scenario.userSimulatorAgent(), audioJudge],
script: [
scenario.message(audioMessage),
scenario.agent(),
scenario.judge(),
],
setId,
});
expect(result.success).toBe(true);
});
});
Listen to a real conversation
Below is the actual audio produced by the voice-to-voice test above. Click play to hear the agent and user simulator exchanging ideas.
Helper utilities & caveats
The examples above rely on several helper utilities that handle the complexity of working with OpenAI's voice models:
-
encodeAudioToBase64 - Converts audio files to base64-encoded strings for transmission in messages. Used to prepare audio fixtures for testing.
-
getFixturePath - Utility to resolve paths to test fixtures (like audio files) relative to the test directory.
-
convertCoreMessagesToOpenAIMessages - Converts Scenario's CoreMessage format to OpenAI's ChatCompletion format, handling audio file detection and transformation into the
input_audio
shape that GPT-4o expects. -
OpenAiVoiceAgent - Abstract base class that handles the OpenAI voice API calls, message conversion, and response processing. Used by both the agent and user simulator in voice-to-voice conversations.
-
messageRoleReversal - Utility function that swaps user ↔ assistant roles in messages (excluding tool calls) so the voice user simulator can speak as the user.
-
saveConversationAudio - Saves audio responses to files for debugging and analysis.
-
concatenateWavFiles - Combines multiple WAV files into a single conversation recording.
-
getAudioSegments - Extracts individual audio segments from a conversation for analysis.
These helpers handle the technical details like audio encoding, file management, message format conversion, and managing the OpenAI API's specific requirements for voice models.
💡 Caveat: When the assistant responds with audio, some judge models may ignore the audio chunk unless the role is set to "user"
. Pass forceUserRole: true
to OpenAiVoiceAgent
when you hit that edge-case.
Troubleshooting & FAQs
Judge ignores assistant audio
Some judge models drop the audio chunk when it comes from the assistant role. Pass forceUserRole: true
to OpenAiVoiceAgent
—this wraps the audio in a "user"
message so the judge evaluates it correctly.
Tests time-out or hang
Voice models are slower than text-only ones. Bump Vitest’s timeout (--timeout 60000
) or reduce concurrency (VITEST_MAX_WORKERS=1
) when running in CI.
“Unsupported media type” errors
convertCoreMessagesToOpenAIMessages
currently supports WAV/MP3.
CI machines without audio hardware All examples work headlessly—no speakers or microphone required.
Python equivalent? Coming soon!
💸 Cost tip: each voice request is billed the same as a standard GPT-4 audio-preview call. Keep fixture lengths and turn counts reasonable.
Complete Sources
Browse the full tests in the repo:
Need more? See the fixtures guide and agent docs.