Testing Voice Agents

Scenario lets you write end-to-end tests for agents that listen to audio, think, and respond with either text or audio.

Video Demo

This video shows a complete example of a black box test for a voice-to-voice conversation between an agent and a user simulator.

Testing Approaches

Choose the approach that matches your voice agent's architecture:

Audio → Text

Agent receives audio and replies with text. Perfect for transcription or audio-based Q&A.

Audio → Audio

Agent listens and replies in audio. Ideal for voice assistants and conversational AI.

Voice-to-Voice Conversation

Full multi-turn conversations where both user simulator and agent speak. Test complex dialogues.

Use-case comparison

Scenario	Input	Expected Output	Typical Judge Model
Audio → Text	`file` part (audio) + optional prompt	Text	`gpt-4o-audio-preview` or any GPT-4-level text model
Audio → Audio	`file` part (audio) + optional prompt	Audio (voice response)	`gpt-4o-audio-preview` (handles audio)
Voice-to-Voice Conversation	Multiple turns, both sides send/receive audio	Audio dialogue	Same as above; judge runs after conversation

General Troubleshooting

Judge ignores assistant audio
Use wrapper utilities (wrap_judge_for_audio in Python or wrapJudgeForAudio in TypeScript) to automatically transcribe audio for judge evaluation.

Tests time-out or hang
Voice models are slower. For TypeScript: --timeout 60000 or VITEST_MAX_WORKERS=1. For Python: @pytest.mark.timeout(120).

"Unsupported media type" errors
Helpers support WAV and MP3. Ensure audio files use these formats.

CI machines without audio hardware
All examples work headlessly—no speakers or microphone required.

💸 Cost tip: Each voice request bills as a standard GPT-4 audio-preview call. Keep fixture lengths and turn counts reasonable.

Related Resources

Fixtures Guide - Learn about test fixtures
Agent Integration - Integrate with different agent frameworks
Multimodal Overview - Other multimodal testing approaches