Audio → Audio Testing

Test agents that listen to audio input and reply with audio responses. This pattern is ideal for voice assistants, conversational AI, and any agent that needs to respond in a natural spoken voice.

Prerequisites & Setup

Common requirements:

OPENAI_API_KEY with access to gpt-4o-audio-preview

export OPENAI_API_KEY="sk-…"

Language-specific setup:

Python ≥ 3.10 required for modern async/await.

Install dependencies:

uv pip install scenario-sdk openai

Code Example

test_audio_to_audio

# Source: https://github.com/langwatch/scenario/blob/main/python/examples/test_audio_to_audio.py
"""
Multimodal Audio to Audio Tests
 
This test suite demonstrates how to test an agent that:
- Receives audio input (from a WAV file fixture)
- Processes the audio content
- Responds with audio output
 
This is perfect for voice assistants, conversational AI, or any agent
that needs to communicate naturally using voice.
"""
 
import os
from typing import ClassVar, Literal, TypedDict, cast
import pytest
import scenario
from scenario.types import AgentRole
from openai.types.chat import ChatCompletionMessageParam
from helpers import encode_audio_to_base64, wrap_judge_for_audio, OpenAiVoiceAgent
 
 
# Type definitions for multimodal messages with file content
class TextContentPart(TypedDict):
    type: Literal["text"]
    text: str
 
 
class FileContentPart(TypedDict):
    type: Literal["file"]
    mediaType: str
    data: str
 
 
class MultimodalMessage(TypedDict):
    role: Literal["user", "assistant", "system"]
    content: list[TextContentPart | FileContentPart]
 
 
class AudioToAudioAgent(OpenAiVoiceAgent):
    """
    Agent that accepts audio input and responds with audio
 
    Uses OpenAI's gpt-4o-audio-preview model which can:
    - Process audio input
    - Generate audio responses with voice
    - Maintain conversational context
    """
 
    role: ClassVar[AgentRole] = AgentRole.AGENT
 
    def __init__(self):
        super().__init__(
            system_prompt="""You are a helpful assistant that can analyze audio input and respond with audio output.
            You must respond with audio output.
            """,
            voice="alloy",
            force_user_role=True,  # Required for audio responses per OpenAI API
        )
 
 
# Use setId to group together for visualizing in the UI
SET_ID = "multimodal-audio-to-audio-test"
 
 
@pytest.mark.asyncio
async def test_audio_to_audio():
    """
    Test agent that receives audio input and responds with audio
 
    This test:
    1. Loads an audio fixture with a spoken question
    2. Sends the audio to the agent
    3. Agent analyzes the audio and responds with audio
    4. Judge evaluates the audio response (after transcription)
    """
    # Initialize the voice agent
    my_agent = AudioToAudioAgent()
 
    # Get path to audio fixture
    fixture_path = os.path.join(
        os.path.dirname(__file__), "fixtures", "male_or_female_voice.wav"
    )
 
    # Encode audio file to base64 for transmission
    audio_data = encode_audio_to_base64(fixture_path)
 
    # Create multimodal message with text prompt and audio file
    audio_message: MultimodalMessage = {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": """
                Answer the question in the audio.
                If you're not sure, you're required to take a best guess.
                After you've guessed, you must repeat the question and say what format the input was in (audio or text)
                """,
            },
            {
                "type": "file",
                "mediaType": "audio/wav",
                "data": audio_data,
            },
        ],
    }
 
    # Create judge agent to evaluate the response
    # Wrap with audio handler to transcribe audio before judging
    audio_judge = wrap_judge_for_audio(
        scenario.JudgeAgent(
            model="openai/gpt-4o",
            criteria=[
                "The agent correctly guesses it's a male voice",
                "The agent repeats the question",
                "The agent says what format the input was in (audio or text)",
            ],
        )
    )
 
    # Run the scenario
    result = await scenario.run(
        name="multimodal audio to audio",
        description="User sends audio file, agent analyzes and responds with audio",
        agents=[
            my_agent,
            scenario.UserSimulatorAgent(model="openai/gpt-4o"),
            audio_judge,
        ],
        script=[
            # Cast needed: MultimodalMessage is scenario's extension of ChatCompletionMessageParam
            # that supports file content parts, which are handled internally
            scenario.message(cast(ChatCompletionMessageParam, audio_message)),
            scenario.agent(),
            scenario.judge(),
        ],
        set_id=SET_ID,
    )
 
    try:
        print("AUDIO TO AUDIO RESULT:", result)
        assert result.success
    except Exception as error:
        print("Audio to audio test failed:", result)
        raise error

Helper Utilities

wrap_judge_for_audio - Wraps a judge agent to automatically transcribe audio messages to text before evaluation using OpenAI Whisper.
OpenAiVoiceAgent - Base class for voice-enabled agents that handles API calls and message conversion.
save_conversation_audio - Saves audio responses to files for debugging and analysis.

Complete Source

Browse the full test in the repo:

Related Guides

Audio → Text Testing - Agent responds with text instead of audio
Voice-to-Voice Conversations - Multi-turn voice conversations
Testing Voice Agents Overview - Main voice testing guide
Fixtures Guide - Learn about test fixtures