Scenario Basics

Learn the core concepts and capabilities of the Scenario agent testing framework

Overview

Scenario is designed to test AI agents through simulation testing - a methodology to test agents end-to-end, by simulating different situations and user interactions, and evaluating responses against a defined criteria or custom assertions.

Core Components

1. Scenarios

A scenario defines the test case - the situation, context, and expected behavior you want to validate:

python

result = await scenario.run(
    name="customer support inquiry",
    description="""
        User has a billing issue with their subscription. They are frustrated
        but not angry. The agent should help resolve the issue professionally
        and escalate if needed.
    """,
    # ... agents and other configuration
)
 
assert result.success

2. Agents

Three types of agents can participate in a scenario:

Agent Under Test: Your AI agent that you want to test
User Simulator Agent: Generates simulated user messages based on the scenario
Judge Agent: Evaluates the conversation against success criteria

python

agents=[
    MyAgent(),                     # Your agent
    scenario.UserSimulatorAgent(), # Simulates user behavior
    scenario.JudgeAgent(criteria=[ # Evaluates success
        "Agent asks for user account number or email",
        "Agent addresses the billing issue",
        "Agent provides a timeline for issue resolution"
    ])
]

3. Evaluation

There are two ways to evaluate a scenario:

Automatically, by the judge agent
Manually, by specifying assertions on scripted scenarios

python

scenario.JudgeAgent(criteria=[
    "Agent asks for user account number or email",
    "Agent addresses the billing issue",
    "Agent provides a timeline for issue resolution"
])
 
assert state.has_tool_call("get_billing_info") # manual assertion

The Simulation Loop

Understanding how the simulation works helps you write better scenarios:

Step 1: User Simulator Generates Message

Based on the scenario description, the user simulator creates a realistic opening message:

python

# Scenario description guides the user simulator
description="User is frustrated with slow internet and needs technical help"
 
# User simulator might generate:
# "my internet is slow"

Step 2: Agent Under Test Responds

Your agent receives the conversation history and generates a response:

python

class TechSupportAgent(scenario.AgentAdapter):
    async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
        # Agent sees: [{"role": "user", "content": "my internet is slow"}]
        return await my_tech_support_agent.process(input.messages)

Step 3: Judge Evaluates

The judge agent reviews the conversation and decides whether to:

Continue: The conversation should proceed
Succeed: All criteria are met, end with success
Fail: Criteria are not met, end with failure

python

# Judge considers criteria like:
# - "Agent asks if user has tried to turn it off and on again"
# - "Agent provides specific troubleshooting steps"

Step 4: Next Turn or End

If the judge decides to continue, the next turn starts from Step 1. The user simulator generates a follow-up message based on the agent's response and the ongoing conversation context.

Testing Approaches

Scenario supports two main testing approaches:

Automatic Simulation

Let the agents interact naturally until the judge decides the outcome:

python

result = await scenario.run(
    name="automatic conversation",
    description="User wants help with a technical issue",
    agents=[
        TechSupportAgent(),
        scenario.UserSimulatorAgent(temperature=0.7),
        scenario.JudgeAgent(criteria=["Agent resolves the technical issue"])
    ],
    max_turns=10  # Optional limit
)
 
assert result.success

Scripted Control

Control the exact flow of conversation with custom scripts:

python

result = await scenario.run(
    name="scripted interaction",
    description="Test specific conversation flow",
    agents=[
        MyAgent(),
        scenario.UserSimulatorAgent(),
        scenario.JudgeAgent(criteria=["Agent handles edge case properly"])
    ],
    script=[
        scenario.user("I have a complex request"),
        scenario.agent(),
        lambda state: (
            raise Exception("Complex handler was not used")
            if not state.has_tool_call("complex_handler")
            else None
        ),
        scenario.proceed(turns=2),
        scenario.succeed("Edge case handled correctly")
    ]
)
 
assert result.success

Turns vs Steps

Understanding the difference between turns and steps is crucial:

Turns

A turn represents one complete cycle of user → agent → judge evaluation:

# Turn 1: User asks question → Agent responds → Judge evaluates
# Turn 2: User follows up → Agent clarifies → Judge evaluates
# Turn 3: User confirms → Agent concludes → Judge decides success

Steps

A step is any individual action within a turn:

# Within one turn, there might be multiple steps:
# Step 1: User message
# Step 2: Agent makes tool call
# Step 3: Judge decides to continue the conversation
# Step 4: User follows up
# Step 5: Agent responds to user
# Step 6: Judge evaluates

You can control both with onTurn^[ts]^[py] and onStep^[ts]^[py] :

python

result = await scenario.run(
    name="controlled conversation",
    description="User needs help with account settings",
    agents=[...],
    max_turns=5,  # Limit conversation length
    script=[
        scenario.proceed(
            turns=2,
            on_turn=lambda state: print(f"Completed turn {state.current_turn}")
            on_step=lambda state: print(f"Completed step {state.current_step}")
        )
    ]
)
 
assert result.success

The User Simulator Agent

The user simulator is an AI agent that role-plays as a user based on your scenario description.

Default Behavior

By default, the user simulator:

Writes like a user would
Responds to agent messages
Follow the scenario description

# Default user simulator
scenario.UserSimulatorAgent()

Customizing the User Simulator

You can customize the user simulator's behavior:

python

scenario.UserSimulatorAgent(
    model="openai/o3",  # Use different model
    temperature=0.7, # Increase the temperature to vary the user messages
    system_prompt="""
        <role>
        You are pretending to be a user, you are testing an AI Agent (shown as the user role) based on a scenario.
        Approach this naturally, as a human user would, with very short inputs, few words, all lowercase, imperative, not periods, like when they google or talk to chatgpt.
        </role>
 
        <goal>
        Your goal (assistant) is to interact with the Agent Under Test (user) as if you were a human user.
        </goal>
 
        <scenario>
        You are trying to get a refund for a purchase you made.
        You are a busy executive who speaks concisely and directly.
        You get impatient with long explanations and prefer bullet points.
        You often interrupt to ask specific questions.
        </scenario>
 
        <rules>
        - DO NOT carry over any requests yourself, YOU ARE NOT the assistant today, you are the user
        </rules>
    """
)

User Simulator Strategies

The user simulator automatically adapts its strategy based on your scenario description:

# Scenario: "User is confused about their bill"
# → User simulator will ask unclear questions, express confusion
 
# Scenario: "User is an expert developer reporting a bug"
# → User simulator will use technical language, provide detailed info
 
# Scenario: "User is elderly and not tech-savvy"
# → User simulator will ask basic questions, need more guidance

The Judge Agent

The judge agent evaluates conversations against your success criteria.

Writing Effective Criteria

Good criteria are:

Specific: Clearly describe what success looks like
Measurable: Can be objectively evaluated
Relevant: Related to your agent's purpose
Achievable: Realistic given the agent's capabilities

# Good criteria
scenario.JudgeAgent(criteria=[
    "Agent asks for the user's account number or email",
    "Agent explains the billing issue in simple terms",
    "Agent offers at least two resolution options",
    "Agent provides a timeline for issue resolution"
])
 
# Avoid vague criteria
scenario.JudgeAgent(criteria=[
    "Agent is helpful",  # Too vague
    "Agent solves everything",  # Too broad
    "Agent is perfect"  # Unrealistic
])

Multiple Evaluation Points

The judge evaluates after each agent response, allowing it to:

End the conversation early if criteria are met
Fail immediately if something goes wrong
Continue if more interaction is needed

# Judge evaluation happens after each agent response:
# Turn 1: Agent asks clarifying question → Judge: "Continue, need more info"
# Turn 2: Agent provides solution → Judge: "Success, all criteria met"

Customizing the Judge

You can customize judge behavior:

python

scenario.JudgeAgent(
    criteria=["Agent provides accurate information"],
    model="openai/o3",  # Use different model
    system_prompt="""
        <role>
        You are an LLM as a judge watching a simulated conversation as it plays out live to determine if the agent under test meets the criteria or not.
        </role>
 
        <goal>
        Your goal is to determine if you already have enough information to make a verdict of the scenario below, or if the conversation should continue for longer.
        If you do have enough information, use the finish_test tool to determine if all the criteria have been met, if not, use the continue_test tool to let the next step play out.
        </goal>
 
        <scenario>
        {description}
        </scenario>
 
        <criteria>
        {"\n".join(criteria)}
        </criteria>
 
        <rules>
        - Be strict, do not let the conversation continue if the agent already broke one of the "do not" or "should not" criterias.
        - DO NOT make any judgment calls that are not explicitly listed in the success or failure criteria, withhold judgement if necessary
        </rules>
    """
)

Scenario Organization

Scenario sets are useful for:

Grouping related tests for better organization
Filtering events in monitoring and analytics systems
Running targeted test suites based on categories
Generating reports for specific areas of functionality

Grouping Your Sets and Batches

While optional, we strongly recommend setting stable identifiers for your scenarios, sets, and batches for better organization and tracking in LangWatch.

set_id: Groups related scenarios into a test suite. This corresponds to the "Simulation Set" in the UI.
SCENARIO_BATCH_RUN_ID: Env variable that groups all scenarios that were run together in a single execution (e.g., a single CI job). This is automatically generated but can be overridden.

python

import os
 
result = await scenario.run(
    name="my first scenario",
    description="A simple test to see if the agent responds.",
    set_id="my-test-suite",
    agents=[
        scenario.Agent(my_agent),
        scenario.UserSimulatorAgent(),
    ]
)
 
assert result.success

Event Tracking

Next Steps

Dive deeper into specific aspects of Scenario:

Writing Scenarios - Master the art of creating effective tests
Scripted Simulations - Take full control of conversation flow
Cache - Make your tests deterministic and faster
Debug Mode - Debug your agents interactively