Scenario Basics
Overview
Scenario is designed to test AI agents through simulation testing - a methodology to test agents end-to-end, by simulating different situations and user interactions, and evaluating responses against a defined criteria or custom assertions.
Core Components
1. Scenarios
A scenario defines the test case - the situation, context, and expected behavior you want to validate:
result = await scenario.run(
name="customer support inquiry",
description="""
User has a billing issue with their subscription. They are frustrated
but not angry. The agent should help resolve the issue professionally
and escalate if needed.
""",
# ... agents and other configuration
)
2. Agents
Three types of agents can participate in a scenario:
- Agent Under Test: Your AI agent that you want to test
- User Simulator Agent: Generates simulated user messages based on the scenario
- Judge Agent: Evaluates the conversation against success criteria
agents=[
MyAgent(), # Your agent
scenario.UserSimulatorAgent(), # Simulates user behavior
scenario.JudgeAgent(criteria=[ # Evaluates success
"Agent asks for user account number or email",
"Agent addresses the billing issue",
"Agent provides a timeline for issue resolution"
])
]
3. Evaluation
There are two ways to evaluate a scenario:
- Automatically, by the judge agent
- Manually, by specifying assertions on scripted scenarios
scenario.JudgeAgent(criteria=[
"Agent asks for user account number or email",
"Agent addresses the billing issue",
"Agent provides a timeline for issue resolution"
])
assert state.has_tool_call("get_billing_info") # manual assertion
The Simulation Loop
Understanding how the simulation works helps you write better scenarios:
Step 1: User Simulator Generates Message
Based on the scenario description, the user simulator creates a realistic opening message:
# Scenario description guides the user simulator
description="User is frustrated with slow internet and needs technical help"
# User simulator might generate:
# "my internet is slow"
Step 2: Agent Under Test Responds
Your agent receives the conversation history and generates a response:
class TechSupportAgent(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
# Agent sees: [{"role": "user", "content": "my internet is slow"}]
return await my_tech_support_agent.process(input.messages)
Step 3: Judge Evaluates
The judge agent reviews the conversation and decides whether to:
- Continue: The conversation should proceed
- Succeed: All criteria are met, end with success
- Fail: Criteria are not met, end with failure
# Judge considers criteria like:
# - "Agent asks if user has tried to turn it off and on again"
# - "Agent provides specific troubleshooting steps"
Step 4: Next Turn or End
If the judge decides to continue, the next turn starts from Step 1. The user simulator generates a follow-up message based on the agent's response and the ongoing conversation context.
Testing Approaches
Scenario supports two main testing approaches:
Automatic Simulation
Let the agents interact naturally until the judge decides the outcome:
result = await scenario.run(
name="automatic conversation",
description="User wants help with a technical issue",
agents=[
TechSupportAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=["Agent resolves the technical issue"])
],
max_turns=10 # Optional limit
)
Scripted Control
Control the exact flow of conversation with custom scripts:
result = await scenario.run(
name="scripted interaction",
description="Test specific conversation flow",
agents=[
MyAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=["Agent handles edge case properly"])
],
script=[
scenario.user("I have a complex request"),
scenario.agent(),
lambda state: (
raise Exception("Complex handler was not used")
if not state.has_tool_call("complex_handler")
else None
),
scenario.proceed(turns=2),
scenario.succeed("Edge case handled correctly")
]
)
Turns vs Steps
Understanding the difference between turns and steps is crucial:
Turns
A turn represents one complete cycle of user → agent → judge evaluation:
# Turn 1: User asks question → Agent responds → Judge evaluates
# Turn 2: User follows up → Agent clarifies → Judge evaluates
# Turn 3: User confirms → Agent concludes → Judge decides success
Steps
A step is any individual action within a turn:
# Within one turn, there might be multiple steps:
# Step 1: User message
# Step 2: Agent makes tool call
# Step 3: Judge decides to continue the conversation
# Step 4: User follows up
# Step 5: Agent responds to user
# Step 6: Judge evaluates
You can control both with on_turn
and on_step
:
result = await scenario.run(
name="controlled conversation",
description="User needs help with account settings",
agents=[...],
max_turns=5, # Limit conversation length
script=[
scenario.proceed(
turns=2,
on_turn=lambda state: print(f"Completed turn {state.current_turn}")
on_step=lambda state: print(f"Completed step {state.current_step}")
)
]
)
The User Simulator Agent
The user simulator is an AI agent that role-plays as a user based on your scenario description.
Default Behavior
By default, the user simulator:
- Writes like a user would
- Responds to agent messages
- Follow the scenario description
# Default user simulator
scenario.UserSimulatorAgent()
Customizing the User Simulator
You can customize the user simulator's behavior:
scenario.UserSimulatorAgent(
model="openai/o3", # Use different model
system_prompt="""
<role>
You are pretending to be a user, you are testing an AI Agent (shown as the user role) based on a scenario.
Approach this naturally, as a human user would, with very short inputs, few words, all lowercase, imperative, not periods, like when they google or talk to chatgpt.
</role>
<goal>
Your goal (assistant) is to interact with the Agent Under Test (user) as if you were a human user.
</goal>
<scenario>
You are trying to get a refund for a purchase you made.
You are a busy executive who speaks concisely and directly.
You get impatient with long explanations and prefer bullet points.
You often interrupt to ask specific questions.
</scenario>
<rules>
- DO NOT carry over any requests yourself, YOU ARE NOT the assistant today, you are the user
</rules>
"""
)
User Simulator Strategies
The user simulator automatically adapts its strategy based on your scenario description:
# Scenario: "User is confused about their bill"
# → User simulator will ask unclear questions, express confusion
# Scenario: "User is an expert developer reporting a bug"
# → User simulator will use technical language, provide detailed info
# Scenario: "User is elderly and not tech-savvy"
# → User simulator will ask basic questions, need more guidance
The Judge Agent
The judge agent evaluates conversations against your success criteria.
Writing Effective Criteria
Good criteria are:
- Specific: Clearly describe what success looks like
- Measurable: Can be objectively evaluated
- Relevant: Related to your agent's purpose
- Achievable: Realistic given the agent's capabilities
# Good criteria
scenario.JudgeAgent(criteria=[
"Agent asks for the user's account number or email",
"Agent explains the billing issue in simple terms",
"Agent offers at least two resolution options",
"Agent provides a timeline for issue resolution"
])
# Avoid vague criteria
scenario.JudgeAgent(criteria=[
"Agent is helpful", # Too vague
"Agent solves everything", # Too broad
"Agent is perfect" # Unrealistic
])
Multiple Evaluation Points
The judge evaluates after each agent response, allowing it to:
- End the conversation early if criteria are met
- Fail immediately if something goes wrong
- Continue if more interaction is needed
# Judge evaluation happens after each agent response:
# Turn 1: Agent asks clarifying question → Judge: "Continue, need more info"
# Turn 2: Agent provides solution → Judge: "Success, all criteria met"
Customizing the Judge
You can customize judge behavior:
scenario.JudgeAgent(
criteria=["Agent provides accurate information"],
model="openai/o3", # Use different model
system_prompt="""
<role>
You are an LLM as a judge watching a simulated conversation as it plays out live to determine if the agent under test meets the criteria or not.
</role>
<goal>
Your goal is to determine if you already have enough information to make a verdict of the scenario below, or if the conversation should continue for longer.
If you do have enough information, use the finish_test tool to determine if all the criteria have been met, if not, use the continue_test tool to let the next step play out.
</goal>
<scenario>
{description}
</scenario>
<criteria>
{"\n".join(criteria)}
</criteria>
<rules>
- Be strict, do not let the conversation continue if the agent already broke one of the "do not" or "should not" criterias.
- DO NOT make any judgment calls that are not explicitly listed in the success or failure criteria, withhold judgement if necessary
</rules>
"""
)
Scenario Organization
Scenario sets are useful for:
- Grouping related tests for better organization
- Filtering events in monitoring and analytics systems
- Running targeted test suites based on categories
- Generating reports for specific areas of functionality
Grouping Your Sets and Batches
While optional, we strongly recommend setting stable identifiers for your scenarios, sets, and batches for better organization and tracking in LangWatch.
- set_id: Groups related scenarios into a test suite. This corresponds to the "Simulation Set" in the UI.
- batch_run_id: Groups all scenarios that were run together in a single execution (e.g., a single CI job). This is automatically generated but can be overridden.
import os
result = await scenario.run(
name="my first scenario",
description="A simple test to see if the agent responds.",
set_id="my-test-suite",
agents=[
scenario.Agent(my_agent),
scenario.UserSimulatorAgent(),
]
)
You can also set the batch_run_id
using environment variables for CI/CD integration:
import os
# Set batch ID for CI/CD integration
os.environ["SCENARIO_BATCH_RUN_ID"] = os.environ.get("GITHUB_RUN_ID", "local-run")
result = await scenario.run(
name="my first scenario",
description="A simple test to see if the agent responds.",
set_id="my-test-suite",
agents=[
scenario.Agent(my_agent),
scenario.UserSimulatorAgent(),
]
)
The batch_run_id
is automatically generated for each test run, but you can also set it globally using the SCENARIO_BATCH_RUN_ID
environment variable.
Event Tracking
Next Steps
Dive deeper into specific aspects of Scenario:
- Writing Scenarios - Master the art of creating effective tests
- Scripted Simulations - Take full control of conversation flow
- Cache - Make your tests deterministic and faster
- Debug Mode - Debug your agents interactively