Testing Voice Agents
Scenario lets you write end-to-end tests for agents that listen to audio, think, and respond with either text or audio.
Video Demo
This video shows a complete example of a black box test for a voice-to-voice conversation between an agent and a user simulator.
Testing Approaches
Choose the approach that matches your voice agent's architecture:
Audio → Text
Agent receives audio and replies with text. Perfect for transcription or audio-based Q&A.
Audio → Audio
Agent listens and replies in audio. Ideal for voice assistants and conversational AI.
Voice-to-Voice Conversation
Full multi-turn conversations where both user simulator and agent speak. Test complex dialogues.
Use-case comparison
| Scenario | Input | Expected Output | Typical Judge Model |
|---|---|---|---|
| Audio → Text | file part (audio) + optional prompt | Text | gpt-4o-audio-preview or any GPT-4-level text model |
| Audio → Audio | file part (audio) + optional prompt | Audio (voice response) | gpt-4o-audio-preview (handles audio) |
| Voice-to-Voice Conversation | Multiple turns, both sides send/receive audio | Audio dialogue | Same as above; judge runs after conversation |
General Troubleshooting
Judge ignores assistant audio
Use wrapper utilities (wrap_judge_for_audio in Python or wrapJudgeForAudio in TypeScript) to automatically transcribe audio for judge evaluation.
Tests time-out or hang
Voice models are slower. For TypeScript: --timeout 60000 or VITEST_MAX_WORKERS=1. For Python: @pytest.mark.timeout(120).
"Unsupported media type" errors
Helpers support WAV and MP3. Ensure audio files use these formats.
CI machines without audio hardware
All examples work headlessly—no speakers or microphone required.
💸 Cost tip: Each voice request bills as a standard GPT-4 audio-preview call. Keep fixture lengths and turn counts reasonable.
Related Resources
- Fixtures Guide - Learn about test fixtures
- Agent Integration - Integrate with different agent frameworks
- Multimodal Overview - Other multimodal testing approaches
