Skip to content

Testing Voice Agents

Evaluate your voice agent with Scenario

Scenario lets you write end-to-end tests for agents that listen to audio, think, and respond with either text or audio. This page shows three common patterns and how to test them.

Overview

  1. Audio ➜ Text – the agent receives an audio file (e.g. WAV) and replies with a textual answer.
  2. Audio ➜ Audio – the agent listens and replies in audio (great for voice assistants).
  3. Voice-to-Voice Conversation – both the user simulator and the agent speak over multiple turns (came out before OpenAI’s real-time audio API and still works today).

Use-case comparison

ScenarioInputExpected OutputTypical Judge Model
Audio ➜ Textfile part (audio) + optional promptTextgpt-4o-audio-preview or any GPT-4-level text model
Audio ➜ Audiofile part (audio) + optional promptAudio (voice response)gpt-4o-audio-preview (handles audio)
Voice-to-Voice ConversationMultiple turns, both sides send/receive audioAudio dialogueSame as above; judge runs after conversation

Prerequisites & Setup

Before running the examples you’ll need a couple of things in place:

  1. Node.js ≥ 18 – the OpenAI gpt-4o voice model uses modern fetch.
  2. OPENAI_API_KEY – export an API key that has access to gpt-4o-audio-preview:
export OPENAI_API_KEY="sk-…"

Code Walk-through

audio ➜ text (javascript)
import scenario, {
  AgentAdapter,
  AgentInput,
  AgentRole,
} from "@langwatch/scenario";
import { describe, it, expect } from "vitest";
import OpenAI from "openai";
import { openai } from "@ai-sdk/openai";
import { ChatCompletionMessageParam } from "openai/resources/chat/completions.mjs";
import { encodeAudioToBase64, getFixturePath } from "./helpers";
import { CoreUserMessage } from "ai";
import { convertCoreMessagesToOpenAIMessages } from "./helpers/convert-core-messages-to-openai";
 
class AudioAgent extends AgentAdapter {
  role: AgentRole = AgentRole.AGENT;
  private openai = new OpenAI();
 
  call = async (input: AgentInput) => {
    // Convert Core messages → OpenAI shape so the voice-model accepts them
    const messages = convertCoreMessagesToOpenAIMessages(input.messages);
    const response = await this.respond(messages);
 
    // Scenario expects **text**, so we extract the transcript only
    const transcript = response.choices[0].message?.audio?.transcript;
    if (typeof transcript === "string") return transcript;
    throw new Error("Agent failed to generate a response");
  };
 
  private async respond(messages: ChatCompletionMessageParam[]) {
    return this.openai.chat.completions.create({
      model: "gpt-4o-audio-preview",
      modalities: ["text", "audio"],
      audio: { voice: "alloy", format: "wav" },
      messages,
      store: false,
    });
  }
}
 
const setId = "multimodal-audio-test";
 
describe("Multimodal Audio to Text Tests", () => {
  it("should handle audio input", async () => {
    const data = encodeAudioToBase64(
      getFixturePath("male_or_female_voice.wav")
    );
 
    const audioMessage = {
      role: "user",
      content: [
        { type: "text", text: "Answer the question in the audio…" },
        { type: "file", mimeType: "audio/wav", data },
      ],
    } satisfies CoreUserMessage;
 
    const audioJudge = scenario.judgeAgent({
      model: openai("gpt-4o-audio-preview"),
      criteria: [
        "Agent correctly guesses it's a male voice",
        "Agent repeats the question",
        "Agent says what format the input was in (audio or text)",
      ],
    });
 
    const result = await scenario.run({
      name: "multimodal audio analysis",
      description: "User sends audio, agent transcribes & analyses",
      agents: [new AudioAgent(), scenario.userSimulatorAgent(), audioJudge],
      script: [
        scenario.message(audioMessage),
        scenario.agent(),
        scenario.judge(),
      ],
      setId,
    });
 
    expect(result.success).toBe(true);
  });
});

Listen to a real conversation

Below is the actual audio produced by the voice-to-voice test above. Click play to hear the agent and user simulator exchanging ideas.

Helper utilities & caveats

The examples above rely on several helper utilities that handle the complexity of working with OpenAI's voice models:

  • encodeAudioToBase64 - Converts audio files to base64-encoded strings for transmission in messages. Used to prepare audio fixtures for testing.

  • getFixturePath - Utility to resolve paths to test fixtures (like audio files) relative to the test directory.

  • convertCoreMessagesToOpenAIMessages - Converts Scenario's CoreMessage format to OpenAI's ChatCompletion format, handling audio file detection and transformation into the input_audio shape that GPT-4o expects.

  • OpenAiVoiceAgent - Abstract base class that handles the OpenAI voice API calls, message conversion, and response processing. Used by both the agent and user simulator in voice-to-voice conversations.

  • messageRoleReversal - Utility function that swaps user ↔ assistant roles in messages (excluding tool calls) so the voice user simulator can speak as the user.

  • saveConversationAudio - Saves audio responses to files for debugging and analysis.

  • concatenateWavFiles - Combines multiple WAV files into a single conversation recording.

  • getAudioSegments - Extracts individual audio segments from a conversation for analysis.

These helpers handle the technical details like audio encoding, file management, message format conversion, and managing the OpenAI API's specific requirements for voice models.

💡 Caveat: When the assistant responds with audio, some judge models may ignore the audio chunk unless the role is set to "user". Pass forceUserRole: true to OpenAiVoiceAgent when you hit that edge-case.

Troubleshooting & FAQs

Judge ignores assistant audio Some judge models drop the audio chunk when it comes from the assistant role. Pass forceUserRole: true to OpenAiVoiceAgent—this wraps the audio in a "user" message so the judge evaluates it correctly.

Tests time-out or hang Voice models are slower than text-only ones. Bump Vitest’s timeout (--timeout 60000) or reduce concurrency (VITEST_MAX_WORKERS=1) when running in CI.

“Unsupported media type” errors convertCoreMessagesToOpenAIMessages currently supports WAV/MP3.

CI machines without audio hardware All examples work headlessly—no speakers or microphone required.

Python equivalent? Coming soon!

💸 Cost tip: each voice request is billed the same as a standard GPT-4 audio-preview call. Keep fixture lengths and turn counts reasonable.

Complete Sources

Browse the full tests in the repo:

Need more? See the fixtures guide and agent docs.