Package scenario
Scenario: Agent Testing Framework through Simulation Testing
Scenario is a comprehensive testing framework for AI agents that uses simulation testing to validate agent behavior through realistic conversations. It enables testing of both happy paths and edge cases by simulating user interactions and evaluating agent responses against configurable success criteria.
Key Features:
-
End-to-end conversation testing with specified scenarios
-
Flexible control from fully scripted to completely automated simulations
-
Multi-turn evaluation designed for complex conversational agents
-
Works with any testing framework (pytest, unittest, etc.)
-
Framework-agnostic integration with any LLM or agent architecture
-
Built-in caching for deterministic and faster test execution
Basic Usage:
import scenario
# Configure global settings
scenario.configure(default_model="openai/gpt-4.1-mini")
# Create your agent adapter
class MyAgent(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
return my_agent_function(input.last_new_user_message_str())
# Run a scenario test
result = await scenario.run(
name="customer service test",
description="Customer asks about billing, agent should help politely",
agents=[
MyAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=[
"Agent is polite and professional",
"Agent addresses the billing question",
"Agent provides clear next steps"
])
]
)
assert result.success
Advanced Usage:
# Script-controlled scenario with custom evaluations
def check_tool_usage(state: scenario.ScenarioState) -> None:
assert state.has_tool_call("get_customer_info")
result = await scenario.run(
name="scripted interaction",
description="Test specific conversation flow",
agents=[
MyAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=["Agent provides helpful response"])
],
script=[
scenario.user("I have a billing question"),
scenario.agent(),
check_tool_usage, # Custom assertion
scenario.proceed(turns=2), # Let it continue automatically
scenario.succeed("All requirements met")
]
)
Integration with Testing Frameworks:
import pytest
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_weather_agent():
result = await scenario.run(
name="weather query",
description="User asks about weather in a specific city",
agents=[
WeatherAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=["Provides accurate weather information"])
]
)
assert result.success
For more examples and detailed documentation, visit: https://github.com/langwatch/scenario
Expand source code
"""
Scenario: Agent Testing Framework through Simulation Testing
Scenario is a comprehensive testing framework for AI agents that uses simulation testing
to validate agent behavior through realistic conversations. It enables testing of both
happy paths and edge cases by simulating user interactions and evaluating agent responses
against configurable success criteria.
Key Features:
- End-to-end conversation testing with specified scenarios
- Flexible control from fully scripted to completely automated simulations
- Multi-turn evaluation designed for complex conversational agents
- Works with any testing framework (pytest, unittest, etc.)
- Framework-agnostic integration with any LLM or agent architecture
- Built-in caching for deterministic and faster test execution
Basic Usage:
import scenario
# Configure global settings
scenario.configure(default_model="openai/gpt-4.1-mini")
# Create your agent adapter
class MyAgent(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
return my_agent_function(input.last_new_user_message_str())
# Run a scenario test
result = await scenario.run(
name="customer service test",
description="Customer asks about billing, agent should help politely",
agents=[
MyAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=[
"Agent is polite and professional",
"Agent addresses the billing question",
"Agent provides clear next steps"
])
]
)
assert result.success
Advanced Usage:
# Script-controlled scenario with custom evaluations
def check_tool_usage(state: scenario.ScenarioState) -> None:
assert state.has_tool_call("get_customer_info")
result = await scenario.run(
name="scripted interaction",
description="Test specific conversation flow",
agents=[
MyAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=["Agent provides helpful response"])
],
script=[
scenario.user("I have a billing question"),
scenario.agent(),
check_tool_usage, # Custom assertion
scenario.proceed(turns=2), # Let it continue automatically
scenario.succeed("All requirements met")
]
)
Integration with Testing Frameworks:
import pytest
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_weather_agent():
result = await scenario.run(
name="weather query",
description="User asks about weather in a specific city",
agents=[
WeatherAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=["Provides accurate weather information"])
]
)
assert result.success
For more examples and detailed documentation, visit: https://github.com/langwatch/scenario
"""
# First import non-dependent modules
from .types import ScenarioResult, AgentInput, AgentRole, AgentReturnTypes
from .config import ScenarioConfig
# Then import modules with dependencies
from .scenario_executor import run
from .scenario_state import ScenarioState
from .agent_adapter import AgentAdapter
from .judge_agent import JudgeAgent
from .user_simulator_agent import UserSimulatorAgent
from .cache import scenario_cache
from .script import message, user, agent, judge, proceed, succeed, fail
# Import pytest plugin components
# from .pytest_plugin import pytest_configure, scenario_reporter
configure = ScenarioConfig.configure
default_config = ScenarioConfig.default_config
cache = scenario_cache
__all__ = [
# Functions
"run",
"configure",
"default_config",
"cache",
# Script
"message",
"proceed",
"succeed",
"fail",
"judge",
"agent",
"user",
# Types
"ScenarioResult",
"AgentInput",
"AgentRole",
"ScenarioConfig",
"AgentReturnTypes",
# Classes
"ScenarioState",
"AgentAdapter",
"UserSimulatorAgent",
"JudgeAgent",
]
__version__ = "0.1.0"
Sub-modules
scenario.agent_adapter
-
Agent adapter module for integrating custom agents with the Scenario framework …
scenario.config
-
Configuration module for Scenario …
scenario.judge_agent
-
Judge agent module for evaluating scenario conversations …
scenario.pytest_plugin
-
Pytest plugin for Scenario testing library …
scenario.scenario_executor
-
Scenario execution engine for agent testing …
scenario.scenario_state
-
Scenario state management module …
scenario.script
-
Scenario script DSL (Domain Specific Language) module …
scenario.types
scenario.user_simulator_agent
-
User simulator agent module for generating realistic user interactions …
Functions
def agent(content: str | openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam | None = None) ‑> Callable[[ScenarioState], None] | Callable[[ScenarioState], ScenarioResult | None] | Callable[[ScenarioState], Awaitable[None]] | Callable[[ScenarioState], Awaitable[ScenarioResult | None]]
-
Generate or specify an agent response in the conversation.
If content is provided, it will be used as the agent response. If no content is provided, the agent under test will be called to generate its response based on the current conversation state.
Args
content
- Optional agent response content. Can be a string or full message dict. If None, the agent under test will generate content automatically.
Returns
ScriptStep function that can be used in scenario scripts
Example
result = await scenario.run( name="agent response test", description="Testing agent responses", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides appropriate responses"]) ], script=[ scenario.user("Hello"), # Let agent generate its own response scenario.agent(), # Or specify exact agent response for testing edge cases scenario.agent("I'm sorry, I'm currently unavailable"), scenario.user(), # See how user simulator reacts # Structured agent response with tool calls scenario.message({ "role": "assistant", "content": "Let me search for that information", "tool_calls": [{"id": "call_123", "type": "function", ...}] }), scenario.succeed() ] )
Expand source code
def agent( content: Optional[Union[str, ChatCompletionMessageParam]] = None, ) -> ScriptStep: """ Generate or specify an agent response in the conversation. If content is provided, it will be used as the agent response. If no content is provided, the agent under test will be called to generate its response based on the current conversation state. Args: content: Optional agent response content. Can be a string or full message dict. If None, the agent under test will generate content automatically. Returns: ScriptStep function that can be used in scenario scripts Example: ``` result = await scenario.run( name="agent response test", description="Testing agent responses", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides appropriate responses"]) ], script=[ scenario.user("Hello"), # Let agent generate its own response scenario.agent(), # Or specify exact agent response for testing edge cases scenario.agent("I'm sorry, I'm currently unavailable"), scenario.user(), # See how user simulator reacts # Structured agent response with tool calls scenario.message({ "role": "assistant", "content": "Let me search for that information", "tool_calls": [{"id": "call_123", "type": "function", ...}] }), scenario.succeed() ] ) ``` """ return lambda state: state._executor.agent(content)
def cache(ignore=[])
-
Decorator for caching function calls during scenario execution.
This decorator caches function calls based on the scenario's cache_key, scenario configuration, and function arguments. It enables deterministic testing by ensuring the same inputs always produce the same outputs, making tests repeatable and faster on subsequent runs.
Args
ignore
- List of argument names to exclude from the cache key computation. Commonly used to ignore 'self' for instance methods or other non-deterministic arguments.
Returns
Decorator function that can be applied to any function or method
Example
import scenario class MyAgent: @scenario.cache(ignore=["self"]) def invoke(self, message: str, context: dict) -> str: # This LLM call will be cached response = llm_client.complete( model="gpt-4", messages=[{"role": "user", "content": message}] ) return response.choices[0].message.content # Usage in tests scenario.configure(cache_key="my-test-suite-v1") # First run: makes actual LLM calls and caches results result1 = await scenario.run(...) # Second run: uses cached results, much faster result2 = await scenario.run(...) # result1 and result2 will be identical
Note
- Caching only occurs when a cache_key is set in the scenario configuration
- The cache key is computed from scenario config, function arguments, and cache_key
- AgentInput objects are specially handled to exclude thread_id from caching
- Both sync and async functions are supported
Expand source code
def scenario_cache(ignore=[]): """ Decorator for caching function calls during scenario execution. This decorator caches function calls based on the scenario's cache_key, scenario configuration, and function arguments. It enables deterministic testing by ensuring the same inputs always produce the same outputs, making tests repeatable and faster on subsequent runs. Args: ignore: List of argument names to exclude from the cache key computation. Commonly used to ignore 'self' for instance methods or other non-deterministic arguments. Returns: Decorator function that can be applied to any function or method Example: ``` import scenario class MyAgent: @scenario.cache(ignore=["self"]) def invoke(self, message: str, context: dict) -> str: # This LLM call will be cached response = llm_client.complete( model="gpt-4", messages=[{"role": "user", "content": message}] ) return response.choices[0].message.content # Usage in tests scenario.configure(cache_key="my-test-suite-v1") # First run: makes actual LLM calls and caches results result1 = await scenario.run(...) # Second run: uses cached results, much faster result2 = await scenario.run(...) # result1 and result2 will be identical ``` Note: - Caching only occurs when a cache_key is set in the scenario configuration - The cache key is computed from scenario config, function arguments, and cache_key - AgentInput objects are specially handled to exclude thread_id from caching - Both sync and async functions are supported """ @wrapt.decorator def wrapper(wrapped: Callable, instance=None, args=[], kwargs={}): scenario: "ScenarioExecutor" = context_scenario.get() if not scenario.config.cache_key: return wrapped(*args, **kwargs) sig = inspect.signature(wrapped) parameters = list(sig.parameters.values()) all_args = { str(parameter.name): value for parameter, value in zip(parameters, args) } for arg in ["self"] + ignore: if arg in all_args: del all_args[arg] for key, value in all_args.items(): if isinstance(value, AgentInput): scenario_state = value.scenario_state.model_dump(exclude={"thread_id"}) all_args[key] = value.model_dump(exclude={"thread_id"}) all_args[key]["scenario_state"] = scenario_state cache_key = json.dumps( { "cache_key": scenario.config.cache_key, "scenario": scenario.config.model_dump(exclude={"agents"}), "all_args": all_args, }, cls=SerializableWithStringFallback, ) # if is an async function, we need to wrap it in a sync function if inspect.iscoroutinefunction(wrapped): return _async_cached_call(wrapped, args, kwargs, cache_key=cache_key) else: return _cached_call(wrapped, args, kwargs, cache_key=cache_key) return wrapper
def configure(default_model: str | None = None, max_turns: int | None = None, verbose: bool | int | None = None, cache_key: str | None = None, debug: bool | None = None) ‑> None
-
Set global configuration settings for all scenario executions.
This method allows you to configure default behavior that will be applied to all scenarios unless explicitly overridden in individual scenario runs.
Args
default_model
- Default LLM model identifier for user simulator and judge agents
max_turns
- Maximum number of conversation turns before timeout (default: 10)
verbose
- Enable verbose output during scenario execution
cache_key
- Cache key for deterministic scenario behavior across runs
debug
- Enable debug mode for step-by-step execution with user intervention
Example
import scenario # Set up default configuration scenario.configure( default_model="openai/gpt-4.1-mini", max_turns=15, verbose=True, debug=False ) # All subsequent scenario runs will use these defaults result = await scenario.run( name="my test", description="Test scenario", agents=[my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent()] )
def fail(reasoning: str | None = None) ‑> Callable[[ScenarioState], None] | Callable[[ScenarioState], ScenarioResult | None] | Callable[[ScenarioState], Awaitable[None]] | Callable[[ScenarioState], Awaitable[ScenarioResult | None]]
-
Immediately end the scenario with a failure result.
This function terminates the scenario execution and marks it as failed, bypassing any further agent interactions or judge evaluations.
Args
reasoning
- Optional explanation for why the scenario failed
Returns
ScriptStep function that can be used in scenario scripts
Example
def safety_check(state: ScenarioState) -> None: last_msg = state.last_message() content = last_msg.get("content", "") if "harmful" in content.lower(): return scenario.fail("Agent produced harmful content")() result = await scenario.run( name="safety check test", description="Test safety boundaries", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent maintains safety guidelines"]) ], script=[ scenario.user("Tell me something dangerous"), scenario.agent(), safety_check, # Or explicit failure scenario.fail("Agent failed to meet safety requirements") ] )
Expand source code
def fail(reasoning: Optional[str] = None) -> ScriptStep: """ Immediately end the scenario with a failure result. This function terminates the scenario execution and marks it as failed, bypassing any further agent interactions or judge evaluations. Args: reasoning: Optional explanation for why the scenario failed Returns: ScriptStep function that can be used in scenario scripts Example: ``` def safety_check(state: ScenarioState) -> None: last_msg = state.last_message() content = last_msg.get("content", "") if "harmful" in content.lower(): return scenario.fail("Agent produced harmful content")() result = await scenario.run( name="safety check test", description="Test safety boundaries", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent maintains safety guidelines"]) ], script=[ scenario.user("Tell me something dangerous"), scenario.agent(), safety_check, # Or explicit failure scenario.fail("Agent failed to meet safety requirements") ] ) ``` """ return lambda state: state._executor.fail(reasoning)
def judge(content: str | openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam | None = None) ‑> Callable[[ScenarioState], None] | Callable[[ScenarioState], ScenarioResult | None] | Callable[[ScenarioState], Awaitable[None]] | Callable[[ScenarioState], Awaitable[ScenarioResult | None]]
-
Invoke the judge agent to evaluate the current conversation state.
This function forces the judge agent to make a decision about whether the scenario should continue or end with a success/failure verdict. The judge will evaluate based on its configured criteria.
Args
content
- Optional message content for the judge. Usually None to let the judge evaluate based on its criteria.
Returns
ScriptStep function that can be used in scenario scripts
Example
result = await scenario.run( name="judge evaluation test", description="Testing judge at specific points", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides coding help effectively"]) ], script=[ scenario.user("Can you help me code?"), scenario.agent(), # Force judge evaluation after first exchange scenario.judge(), # May continue or end scenario # If scenario continues... scenario.user(), scenario.agent(), scenario.judge(), # Final evaluation ] )
Expand source code
def judge( content: Optional[Union[str, ChatCompletionMessageParam]] = None, ) -> ScriptStep: """ Invoke the judge agent to evaluate the current conversation state. This function forces the judge agent to make a decision about whether the scenario should continue or end with a success/failure verdict. The judge will evaluate based on its configured criteria. Args: content: Optional message content for the judge. Usually None to let the judge evaluate based on its criteria. Returns: ScriptStep function that can be used in scenario scripts Example: ``` result = await scenario.run( name="judge evaluation test", description="Testing judge at specific points", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides coding help effectively"]) ], script=[ scenario.user("Can you help me code?"), scenario.agent(), # Force judge evaluation after first exchange scenario.judge(), # May continue or end scenario # If scenario continues... scenario.user(), scenario.agent(), scenario.judge(), # Final evaluation ] ) ``` """ return lambda state: state._executor.judge(content)
def message(message: openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam) ‑> Callable[[ScenarioState], None] | Callable[[ScenarioState], ScenarioResult | None] | Callable[[ScenarioState], Awaitable[None]] | Callable[[ScenarioState], Awaitable[ScenarioResult | None]]
-
Add a specific message to the conversation.
This function allows you to inject any OpenAI-compatible message directly into the conversation at a specific point in the script. Useful for simulating tool responses, system messages, or specific conversational states.
Args
message
- OpenAI-compatible message to add to the conversation
Returns
ScriptStep function that can be used in scenario scripts
Example
result = await scenario.run( name="tool response test", description="Testing tool call responses", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent uses weather tool correctly"]) ], script=[ scenario.user("What's the weather?"), scenario.agent(), # Agent calls weather tool scenario.message({ "role": "tool", "tool_call_id": "call_123", "content": json.dumps({"temperature": "75°F", "condition": "sunny"}) }), scenario.agent(), # Agent processes tool response scenario.succeed() ] )
Expand source code
def message(message: ChatCompletionMessageParam) -> ScriptStep: """ Add a specific message to the conversation. This function allows you to inject any OpenAI-compatible message directly into the conversation at a specific point in the script. Useful for simulating tool responses, system messages, or specific conversational states. Args: message: OpenAI-compatible message to add to the conversation Returns: ScriptStep function that can be used in scenario scripts Example: ``` result = await scenario.run( name="tool response test", description="Testing tool call responses", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent uses weather tool correctly"]) ], script=[ scenario.user("What's the weather?"), scenario.agent(), # Agent calls weather tool scenario.message({ "role": "tool", "tool_call_id": "call_123", "content": json.dumps({"temperature": "75°F", "condition": "sunny"}) }), scenario.agent(), # Agent processes tool response scenario.succeed() ] ) ``` """ return lambda state: state._executor.message(message)
def proceed(turns: int | None = None, on_turn: Callable[[ForwardRef('ScenarioState')], None] | Callable[[ForwardRef('ScenarioState')], Awaitable[None]] | None = None, on_step: Callable[[ForwardRef('ScenarioState')], None] | Callable[[ForwardRef('ScenarioState')], Awaitable[None]] | None = None) ‑> Callable[[ScenarioState], None] | Callable[[ScenarioState], ScenarioResult | None] | Callable[[ScenarioState], Awaitable[None]] | Callable[[ScenarioState], Awaitable[ScenarioResult | None]]
-
Let the scenario proceed automatically for a specified number of turns.
This function allows the scenario to run automatically with the normal agent interaction flow (user -> agent -> judge evaluation). You can optionally provide callbacks to execute custom logic at each turn or step.
Args
turns
- Number of turns to proceed automatically. If None, proceeds until the judge agent decides to end the scenario or max_turns is reached.
on_turn
- Optional callback function called at the end of each turn
on_step
- Optional callback function called after each agent interaction
Returns
ScriptStep function that can be used in scenario scripts
Example
def log_progress(state: ScenarioState) -> None: print(f"Turn {state.current_turn}: {len(state.messages)} messages") def check_tool_usage(state: ScenarioState) -> None: if state.has_tool_call("dangerous_action"): raise AssertionError("Agent used forbidden tool!") result = await scenario.run( name="automatic proceeding test", description="Let scenario run with monitoring", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent behaves safely and helpfully"]) ], script=[ scenario.user("Let's start"), scenario.agent(), # Let it proceed for 3 turns with monitoring scenario.proceed( turns=3, on_turn=log_progress, on_step=check_tool_usage ), # Then do final evaluation scenario.judge() ] )
Expand source code
def proceed( turns: Optional[int] = None, on_turn: Optional[ Union[ Callable[["ScenarioState"], None], Callable[["ScenarioState"], Awaitable[None]], ] ] = None, on_step: Optional[ Union[ Callable[["ScenarioState"], None], Callable[["ScenarioState"], Awaitable[None]], ] ] = None, ) -> ScriptStep: """ Let the scenario proceed automatically for a specified number of turns. This function allows the scenario to run automatically with the normal agent interaction flow (user -> agent -> judge evaluation). You can optionally provide callbacks to execute custom logic at each turn or step. Args: turns: Number of turns to proceed automatically. If None, proceeds until the judge agent decides to end the scenario or max_turns is reached. on_turn: Optional callback function called at the end of each turn on_step: Optional callback function called after each agent interaction Returns: ScriptStep function that can be used in scenario scripts Example: ``` def log_progress(state: ScenarioState) -> None: print(f"Turn {state.current_turn}: {len(state.messages)} messages") def check_tool_usage(state: ScenarioState) -> None: if state.has_tool_call("dangerous_action"): raise AssertionError("Agent used forbidden tool!") result = await scenario.run( name="automatic proceeding test", description="Let scenario run with monitoring", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent behaves safely and helpfully"]) ], script=[ scenario.user("Let's start"), scenario.agent(), # Let it proceed for 3 turns with monitoring scenario.proceed( turns=3, on_turn=log_progress, on_step=check_tool_usage ), # Then do final evaluation scenario.judge() ] ) ``` """ return lambda state: state._executor.proceed(turns, on_turn, on_step)
async def run(name: str, description: str, agents: List[AgentAdapter] = [], max_turns: int | None = None, verbose: bool | int | None = None, cache_key: str | None = None, debug: bool | None = None, script: List[Callable[[ForwardRef('ScenarioState')], None] | Callable[[ForwardRef('ScenarioState')], ScenarioResult | None] | Callable[[ForwardRef('ScenarioState')], Awaitable[None]] | Callable[[ForwardRef('ScenarioState')], Awaitable[ScenarioResult | None]]] | None = None, set_id: str | None = None) ‑> ScenarioResult
-
High-level interface for running a scenario test.
This is the main entry point for executing scenario tests. It creates a ScenarioExecutor instance and runs it in an isolated thread pool to support parallel execution and prevent blocking.
Args
name
- Human-readable name for the scenario
description
- Detailed description of what the scenario tests
agents
- List of agent adapters (agent under test, user simulator, judge)
max_turns
- Maximum conversation turns before timeout (default: 10)
verbose
- Show detailed output during execution
cache_key
- Cache key for deterministic behavior
debug
- Enable debug mode for step-by-step execution
script
- Optional script steps to control scenario flow
set_id
- Optional set identifier for grouping related scenarios
Returns
ScenarioResult containing the test outcome, conversation history, success/failure status, and detailed reasoning
Example
import scenario # Simple scenario with automatic flow result = await scenario.run( name="help request", description="User asks for help with a technical problem", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful response"]) ], set_id="customer-support-tests" ) # Scripted scenario with custom evaluations result = await scenario.run( name="custom interaction", description="Test specific conversation flow", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful response"]) ], script=[ scenario.user("Hello"), scenario.agent(), custom_eval, scenario.succeed() ], set_id="integration-tests" ) # Results analysis print(f"Test {'PASSED' if result.success else 'FAILED'}") print(f"Reasoning: {result.reasoning}") print(f"Conversation had {len(result.messages)} messages")
Expand source code
async def run( name: str, description: str, agents: List[AgentAdapter] = [], max_turns: Optional[int] = None, verbose: Optional[Union[bool, int]] = None, cache_key: Optional[str] = None, debug: Optional[bool] = None, script: Optional[List[ScriptStep]] = None, set_id: Optional[str] = None, ) -> ScenarioResult: """ High-level interface for running a scenario test. This is the main entry point for executing scenario tests. It creates a ScenarioExecutor instance and runs it in an isolated thread pool to support parallel execution and prevent blocking. Args: name: Human-readable name for the scenario description: Detailed description of what the scenario tests agents: List of agent adapters (agent under test, user simulator, judge) max_turns: Maximum conversation turns before timeout (default: 10) verbose: Show detailed output during execution cache_key: Cache key for deterministic behavior debug: Enable debug mode for step-by-step execution script: Optional script steps to control scenario flow set_id: Optional set identifier for grouping related scenarios Returns: ScenarioResult containing the test outcome, conversation history, success/failure status, and detailed reasoning Example: ``` import scenario # Simple scenario with automatic flow result = await scenario.run( name="help request", description="User asks for help with a technical problem", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful response"]) ], set_id="customer-support-tests" ) # Scripted scenario with custom evaluations result = await scenario.run( name="custom interaction", description="Test specific conversation flow", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful response"]) ], script=[ scenario.user("Hello"), scenario.agent(), custom_eval, scenario.succeed() ], set_id="integration-tests" ) # Results analysis print(f"Test {'PASSED' if result.success else 'FAILED'}") print(f"Reasoning: {result.reasoning}") print(f"Conversation had {len(result.messages)} messages") ``` """ scenario = ScenarioExecutor( name=name, description=description, agents=agents, max_turns=max_turns, verbose=verbose, cache_key=cache_key, debug=debug, script=script, set_id=set_id, ) # We'll use a thread pool to run the execution logic, we # require a separate thread because even though asyncio is # being used throughout, any user code on the callback can # be blocking, preventing them from running scenarios in parallel with concurrent.futures.ThreadPoolExecutor() as executor: def run_in_thread(): loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) try: return loop.run_until_complete(scenario.run()) finally: scenario.event_bus.drain() loop.close() # Run the function in the thread pool and await its result # This converts the thread's execution into a Future that the current # event loop can await without blocking loop = asyncio.get_event_loop() result = await loop.run_in_executor(executor, run_in_thread) return result
def succeed(reasoning: str | None = None) ‑> Callable[[ScenarioState], None] | Callable[[ScenarioState], ScenarioResult | None] | Callable[[ScenarioState], Awaitable[None]] | Callable[[ScenarioState], Awaitable[ScenarioResult | None]]
-
Immediately end the scenario with a success result.
This function terminates the scenario execution and marks it as successful, bypassing any further agent interactions or judge evaluations.
Args
reasoning
- Optional explanation for why the scenario succeeded
Returns
ScriptStep function that can be used in scenario scripts
Example
def custom_success_check(state: ScenarioState) -> None: last_msg = state.last_message() if "solution" in last_msg.get("content", "").lower(): # Custom success condition met return scenario.succeed("Agent provided a solution")() result = await scenario.run( name="custom success test", description="Test custom success conditions", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides a solution"]) ], script=[ scenario.user("I need a solution"), scenario.agent(), custom_success_check, # Or explicit success scenario.succeed("Agent completed the task successfully") ] )
Expand source code
def succeed(reasoning: Optional[str] = None) -> ScriptStep: """ Immediately end the scenario with a success result. This function terminates the scenario execution and marks it as successful, bypassing any further agent interactions or judge evaluations. Args: reasoning: Optional explanation for why the scenario succeeded Returns: ScriptStep function that can be used in scenario scripts Example: ``` def custom_success_check(state: ScenarioState) -> None: last_msg = state.last_message() if "solution" in last_msg.get("content", "").lower(): # Custom success condition met return scenario.succeed("Agent provided a solution")() result = await scenario.run( name="custom success test", description="Test custom success conditions", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides a solution"]) ], script=[ scenario.user("I need a solution"), scenario.agent(), custom_success_check, # Or explicit success scenario.succeed("Agent completed the task successfully") ] ) ``` """ return lambda state: state._executor.succeed(reasoning)
def user(content: str | openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam | None = None) ‑> Callable[[ScenarioState], None] | Callable[[ScenarioState], ScenarioResult | None] | Callable[[ScenarioState], Awaitable[None]] | Callable[[ScenarioState], Awaitable[ScenarioResult | None]]
-
Generate or specify a user message in the conversation.
If content is provided, it will be used as the user message. If no content is provided, the user simulator agent will automatically generate an appropriate message based on the scenario context.
Args
content
- Optional user message content. Can be a string or full message dict. If None, the user simulator will generate content automatically.
Returns
ScriptStep function that can be used in scenario scripts
Example
result = await scenario.run( name="user interaction test", description="Testing specific user inputs", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent responds helpfully to user"]) ], script=[ # Specific user message scenario.user("I need help with Python"), scenario.agent(), # Auto-generated user message based on scenario context scenario.user(), scenario.agent(), # Structured user message with multimodal content scenario.message({ "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "data:image/..."}} ] }), scenario.succeed() ] )
Expand source code
def user( content: Optional[Union[str, ChatCompletionMessageParam]] = None, ) -> ScriptStep: """ Generate or specify a user message in the conversation. If content is provided, it will be used as the user message. If no content is provided, the user simulator agent will automatically generate an appropriate message based on the scenario context. Args: content: Optional user message content. Can be a string or full message dict. If None, the user simulator will generate content automatically. Returns: ScriptStep function that can be used in scenario scripts Example: ``` result = await scenario.run( name="user interaction test", description="Testing specific user inputs", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent responds helpfully to user"]) ], script=[ # Specific user message scenario.user("I need help with Python"), scenario.agent(), # Auto-generated user message based on scenario context scenario.user(), scenario.agent(), # Structured user message with multimodal content scenario.message({ "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "data:image/..."}} ] }), scenario.succeed() ] ) ``` """ return lambda state: state._executor.user(content)
Classes
class AgentAdapter
-
Abstract base class for integrating custom agents with the Scenario framework.
This adapter pattern allows you to wrap any existing agent implementation (LLM calls, agent frameworks, or complex multi-step systems) to work with the Scenario testing framework. The adapter receives structured input about the conversation state and returns responses in a standardized format.
Attributes
role
- The role this agent plays in scenarios (USER, AGENT, or JUDGE)
Example
import scenario from my_agent import MyCustomAgent class MyAgentAdapter(scenario.AgentAdapter): def __init__(self): self.agent = MyCustomAgent() async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes: # Get the latest user message user_message = input.last_new_user_message_str() # Call your existing agent response = await self.agent.process( message=user_message, history=input.messages, thread_id=input.thread_id ) # Return the response (can be string, message dict, or list of messages) return response # Use in a scenario result = await scenario.run( name="test my agent", description="User asks for help with a coding problem", agents=[ MyAgentAdapter(), scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Provides helpful coding advice"]) ] )
Note
- The call method must be async
- Return types can be: str, ChatCompletionMessageParam, List[ChatCompletionMessageParam], or ScenarioResult
- For stateful agents, use input.thread_id to maintain conversation context
- For stateless agents, use input.messages for the full conversation history
Expand source code
class AgentAdapter(ABC): """ Abstract base class for integrating custom agents with the Scenario framework. This adapter pattern allows you to wrap any existing agent implementation (LLM calls, agent frameworks, or complex multi-step systems) to work with the Scenario testing framework. The adapter receives structured input about the conversation state and returns responses in a standardized format. Attributes: role: The role this agent plays in scenarios (USER, AGENT, or JUDGE) Example: ``` import scenario from my_agent import MyCustomAgent class MyAgentAdapter(scenario.AgentAdapter): def __init__(self): self.agent = MyCustomAgent() async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes: # Get the latest user message user_message = input.last_new_user_message_str() # Call your existing agent response = await self.agent.process( message=user_message, history=input.messages, thread_id=input.thread_id ) # Return the response (can be string, message dict, or list of messages) return response # Use in a scenario result = await scenario.run( name="test my agent", description="User asks for help with a coding problem", agents=[ MyAgentAdapter(), scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Provides helpful coding advice"]) ] ) ``` Note: - The call method must be async - Return types can be: str, ChatCompletionMessageParam, List[ChatCompletionMessageParam], or ScenarioResult - For stateful agents, use input.thread_id to maintain conversation context - For stateless agents, use input.messages for the full conversation history """ role: ClassVar[AgentRole] = AgentRole.AGENT @abstractmethod async def call(self, input: AgentInput) -> AgentReturnTypes: """ Process the input and generate a response. This is the main method that your agent implementation must provide. It receives structured information about the current conversation state and must return a response in one of the supported formats. Args: input: AgentInput containing conversation history, thread context, and scenario state Returns: AgentReturnTypes: The agent's response, which can be: - str: Simple text response - ChatCompletionMessageParam: Single OpenAI-format message - List[ChatCompletionMessageParam]: Multiple messages for complex responses - ScenarioResult: Direct test result (typically only used by judge agents) Example: ``` async def call(self, input: AgentInput) -> AgentReturnTypes: # Simple string response user_msg = input.last_new_user_message_str() return f"I understand you said: {user_msg}" # Or structured message response return { "role": "assistant", "content": "Let me help you with that...", } # Or multiple messages for complex interactions return [ {"role": "assistant", "content": "Let me search for that information..."}, {"role": "assistant", "content": "Here's what I found: ..."} ] ``` """ pass
Ancestors
- abc.ABC
Subclasses
Class variables
var role : ClassVar[AgentRole]
-
The type of the None singleton.
Methods
async def call(self, input: AgentInput) ‑> str | openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam | List[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam] | ScenarioResult
-
Process the input and generate a response.
This is the main method that your agent implementation must provide. It receives structured information about the current conversation state and must return a response in one of the supported formats.
Args
input
- AgentInput containing conversation history, thread context, and scenario state
Returns
AgentReturnTypes
-
The agent's response, which can be:
-
str: Simple text response
-
ChatCompletionMessageParam: Single OpenAI-format message
-
List[ChatCompletionMessageParam]: Multiple messages for complex responses
-
ScenarioResult: Direct test result (typically only used by judge agents)
-
Example
async def call(self, input: AgentInput) -> AgentReturnTypes: # Simple string response user_msg = input.last_new_user_message_str() return f"I understand you said: {user_msg}" # Or structured message response return { "role": "assistant", "content": "Let me help you with that...", } # Or multiple messages for complex interactions return [ {"role": "assistant", "content": "Let me search for that information..."}, {"role": "assistant", "content": "Here's what I found: ..."} ]
Expand source code
@abstractmethod async def call(self, input: AgentInput) -> AgentReturnTypes: """ Process the input and generate a response. This is the main method that your agent implementation must provide. It receives structured information about the current conversation state and must return a response in one of the supported formats. Args: input: AgentInput containing conversation history, thread context, and scenario state Returns: AgentReturnTypes: The agent's response, which can be: - str: Simple text response - ChatCompletionMessageParam: Single OpenAI-format message - List[ChatCompletionMessageParam]: Multiple messages for complex responses - ScenarioResult: Direct test result (typically only used by judge agents) Example: ``` async def call(self, input: AgentInput) -> AgentReturnTypes: # Simple string response user_msg = input.last_new_user_message_str() return f"I understand you said: {user_msg}" # Or structured message response return { "role": "assistant", "content": "Let me help you with that...", } # Or multiple messages for complex interactions return [ {"role": "assistant", "content": "Let me search for that information..."}, {"role": "assistant", "content": "Here's what I found: ..."} ] ``` """ pass
class AgentInput (**data: Any)
-
Input data structure passed to agent adapters during scenario execution.
This class encapsulates all the information an agent needs to generate its next response, including conversation history, thread context, and scenario state. It provides convenient methods to access the most recent user messages.
Attributes
thread_id
- Unique identifier for the conversation thread
messages
- Complete conversation history as OpenAI-compatible messages
new_messages
- Only the new messages since the agent's last call
judgment_request
- Whether this call is requesting a judgment from a judge agent
scenario_state
- Current state of the scenario execution
Example
class MyAgent(AgentAdapter): async def call(self, input: AgentInput) -> str: # Get the latest user message user_msg = input.last_new_user_message_str() # Process with your LLM/agent response = await my_llm.complete( messages=input.messages, prompt=user_msg ) return response
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError
][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.self
is explicitly positional-only to allowself
as a field name.Expand source code
class AgentInput(BaseModel): """ Input data structure passed to agent adapters during scenario execution. This class encapsulates all the information an agent needs to generate its next response, including conversation history, thread context, and scenario state. It provides convenient methods to access the most recent user messages. Attributes: thread_id: Unique identifier for the conversation thread messages: Complete conversation history as OpenAI-compatible messages new_messages: Only the new messages since the agent's last call judgment_request: Whether this call is requesting a judgment from a judge agent scenario_state: Current state of the scenario execution Example: ``` class MyAgent(AgentAdapter): async def call(self, input: AgentInput) -> str: # Get the latest user message user_msg = input.last_new_user_message_str() # Process with your LLM/agent response = await my_llm.complete( messages=input.messages, prompt=user_msg ) return response ``` """ thread_id: str # Prevent pydantic from validating/parsing the messages and causing issues: https://github.com/pydantic/pydantic/issues/9541 messages: Annotated[List[ChatCompletionMessageParam], SkipValidation] new_messages: Annotated[List[ChatCompletionMessageParam], SkipValidation] judgment_request: bool = False scenario_state: ScenarioStateType def last_new_user_message(self) -> ChatCompletionUserMessageParam: """ Get the most recent user message from the new messages. Returns: The last user message in OpenAI message format Raises: ValueError: If no new user messages are found Example: ``` user_message = input.last_new_user_message() content = user_message["content"] ``` """ user_messages = [m for m in self.new_messages if m["role"] == "user"] if not user_messages: raise ValueError( "No new user messages found, did you mean to call the assistant twice? Perhaps change your adapter to use the full messages list instead." ) return user_messages[-1] def last_new_user_message_str(self) -> str: """ Get the content of the most recent user message as a string. This is a convenience method for getting simple text content from user messages. For multimodal messages or complex content, use last_new_user_message() instead. Returns: The text content of the last user message Raises: ValueError: If no new user messages found or if the message content is not a string Example: ``` user_text = input.last_new_user_message_str() response = f"You said: {user_text}" ``` """ content = self.last_new_user_message()["content"] if type(content) != str: raise ValueError( f"Last user message is not a string: {content.__repr__()}. Please use the full messages list instead." ) return content
Ancestors
- pydantic.main.BaseModel
Class variables
var judgment_request : bool
-
The type of the None singleton.
var messages : List[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam]
-
The type of the None singleton.
var model_config
-
The type of the None singleton.
var new_messages : List[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam]
-
The type of the None singleton.
var scenario_state : Any
-
The type of the None singleton.
var thread_id : str
-
The type of the None singleton.
Methods
def last_new_user_message(self) ‑> openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam
-
Get the most recent user message from the new messages.
Returns
The last user message in OpenAI message format
Raises
ValueError
- If no new user messages are found
Example
user_message = input.last_new_user_message() content = user_message["content"]
Expand source code
def last_new_user_message(self) -> ChatCompletionUserMessageParam: """ Get the most recent user message from the new messages. Returns: The last user message in OpenAI message format Raises: ValueError: If no new user messages are found Example: ``` user_message = input.last_new_user_message() content = user_message["content"] ``` """ user_messages = [m for m in self.new_messages if m["role"] == "user"] if not user_messages: raise ValueError( "No new user messages found, did you mean to call the assistant twice? Perhaps change your adapter to use the full messages list instead." ) return user_messages[-1]
def last_new_user_message_str(self) ‑> str
-
Get the content of the most recent user message as a string.
This is a convenience method for getting simple text content from user messages. For multimodal messages or complex content, use last_new_user_message() instead.
Returns
The text content of the last user message
Raises
ValueError
- If no new user messages found or if the message content is not a string
Example
user_text = input.last_new_user_message_str() response = f"You said: {user_text}"
Expand source code
def last_new_user_message_str(self) -> str: """ Get the content of the most recent user message as a string. This is a convenience method for getting simple text content from user messages. For multimodal messages or complex content, use last_new_user_message() instead. Returns: The text content of the last user message Raises: ValueError: If no new user messages found or if the message content is not a string Example: ``` user_text = input.last_new_user_message_str() response = f"You said: {user_text}" ``` """ content = self.last_new_user_message()["content"] if type(content) != str: raise ValueError( f"Last user message is not a string: {content.__repr__()}. Please use the full messages list instead." ) return content
class AgentRole (*args, **kwds)
-
Defines the different roles that agents can play in a scenario.
This enum is used to identify the role of each agent during scenario execution, enabling the framework to determine the order and interaction patterns between different types of agents.
Attributes
USER
- Represents a user simulator agent that generates user inputs
AGENT
- Represents the agent under test that responds to user inputs
JUDGE
- Represents a judge agent that evaluates the conversation and determines success/failure
Expand source code
class AgentRole(Enum): """ Defines the different roles that agents can play in a scenario. This enum is used to identify the role of each agent during scenario execution, enabling the framework to determine the order and interaction patterns between different types of agents. Attributes: USER: Represents a user simulator agent that generates user inputs AGENT: Represents the agent under test that responds to user inputs JUDGE: Represents a judge agent that evaluates the conversation and determines success/failure """ USER = "User" AGENT = "Agent" JUDGE = "Judge"
Ancestors
- enum.Enum
Class variables
var AGENT
-
The type of the None singleton.
var JUDGE
-
The type of the None singleton.
var USER
-
The type of the None singleton.
class JudgeAgent (*, criteria: List[str] | None = None, model: str | None = None, api_key: str | None = None, temperature: float = 0.0, max_tokens: int | None = None, system_prompt: str | None = None)
-
Agent that evaluates conversations against success criteria.
The JudgeAgent watches conversations in real-time and makes decisions about whether the agent under test is meeting the specified criteria. It can either allow the conversation to continue or end it with a success/failure verdict.
The judge uses function calling to make structured decisions and provides detailed reasoning for its verdicts. It evaluates each criterion independently and provides comprehensive feedback about what worked and what didn't.
Attributes
role
- Always AgentRole.JUDGE for judge agents
model
- LLM model identifier to use for evaluation
api_key
- Optional API key for the model provider
temperature
- Sampling temperature for evaluation consistency
max_tokens
- Maximum tokens for judge reasoning
criteria
- List of success criteria to evaluate against
system_prompt
- Custom system prompt to override default judge behavior
Example
import scenario # Basic judge agent with criteria judge = scenario.JudgeAgent( criteria=[ "Agent provides helpful responses", "Agent asks relevant follow-up questions", "Agent does not provide harmful information" ] ) # Customized judge with specific model and behavior strict_judge = scenario.JudgeAgent( model="openai/gpt-4.1-mini", criteria=[ "Code examples are syntactically correct", "Explanations are technically accurate", "Security best practices are mentioned" ], temperature=0.0, # More deterministic evaluation system_prompt="You are a strict technical reviewer evaluating code quality." ) # Use in scenario result = await scenario.run( name="coding assistant test", description="User asks for help with Python functions", agents=[ coding_agent, scenario.UserSimulatorAgent(), judge ] ) print(f"Passed criteria: {result.passed_criteria}") print(f"Failed criteria: {result.failed_criteria}")
Note
- Judge agents evaluate conversations continuously, not just at the end
- They can end scenarios early if clear success/failure conditions are met
- Provide detailed reasoning for their decisions
- Support both positive criteria (things that should happen) and negative criteria (things that shouldn't)
Initialize a judge agent with evaluation criteria.
Args
criteria
- List of success criteria to evaluate the conversation against. Can include both positive requirements ("Agent provides helpful responses") and negative constraints ("Agent should not provide personal information").
model
- LLM model identifier (e.g., "openai/gpt-4.1-mini"). If not provided, uses the default model from global configuration.
api_key
- API key for the model provider. If not provided, uses the key from global configuration or environment.
temperature
- Sampling temperature for evaluation (0.0-1.0). Lower values (0.0-0.2) recommended for consistent evaluation.
max_tokens
- Maximum number of tokens for judge reasoning and explanations.
system_prompt
- Custom system prompt to override default judge behavior. Use this to create specialized evaluation perspectives.
Raises
Exception
- If no model is configured either in parameters or global config
Example
# Customer service judge cs_judge = JudgeAgent( criteria=[ "Agent replies with the refund policy", "Agent offers next steps for the customer", ], temperature=0.1 ) # Technical accuracy judge tech_judge = JudgeAgent( criteria=[ "Agent adds a code review pointing out the code compilation errors", "Agent adds a code review about the missing security headers" ], system_prompt="You are a senior software engineer reviewing code for production use." )
Expand source code
class JudgeAgent(AgentAdapter): """ Agent that evaluates conversations against success criteria. The JudgeAgent watches conversations in real-time and makes decisions about whether the agent under test is meeting the specified criteria. It can either allow the conversation to continue or end it with a success/failure verdict. The judge uses function calling to make structured decisions and provides detailed reasoning for its verdicts. It evaluates each criterion independently and provides comprehensive feedback about what worked and what didn't. Attributes: role: Always AgentRole.JUDGE for judge agents model: LLM model identifier to use for evaluation api_key: Optional API key for the model provider temperature: Sampling temperature for evaluation consistency max_tokens: Maximum tokens for judge reasoning criteria: List of success criteria to evaluate against system_prompt: Custom system prompt to override default judge behavior Example: ``` import scenario # Basic judge agent with criteria judge = scenario.JudgeAgent( criteria=[ "Agent provides helpful responses", "Agent asks relevant follow-up questions", "Agent does not provide harmful information" ] ) # Customized judge with specific model and behavior strict_judge = scenario.JudgeAgent( model="openai/gpt-4.1-mini", criteria=[ "Code examples are syntactically correct", "Explanations are technically accurate", "Security best practices are mentioned" ], temperature=0.0, # More deterministic evaluation system_prompt="You are a strict technical reviewer evaluating code quality." ) # Use in scenario result = await scenario.run( name="coding assistant test", description="User asks for help with Python functions", agents=[ coding_agent, scenario.UserSimulatorAgent(), judge ] ) print(f"Passed criteria: {result.passed_criteria}") print(f"Failed criteria: {result.failed_criteria}") ``` Note: - Judge agents evaluate conversations continuously, not just at the end - They can end scenarios early if clear success/failure conditions are met - Provide detailed reasoning for their decisions - Support both positive criteria (things that should happen) and negative criteria (things that shouldn't) """ role = AgentRole.JUDGE model: str api_key: Optional[str] temperature: float max_tokens: Optional[int] criteria: List[str] system_prompt: Optional[str] def __init__( self, *, criteria: Optional[List[str]] = None, model: Optional[str] = None, api_key: Optional[str] = None, temperature: float = 0.0, max_tokens: Optional[int] = None, system_prompt: Optional[str] = None, ): """ Initialize a judge agent with evaluation criteria. Args: criteria: List of success criteria to evaluate the conversation against. Can include both positive requirements ("Agent provides helpful responses") and negative constraints ("Agent should not provide personal information"). model: LLM model identifier (e.g., "openai/gpt-4.1-mini"). If not provided, uses the default model from global configuration. api_key: API key for the model provider. If not provided, uses the key from global configuration or environment. temperature: Sampling temperature for evaluation (0.0-1.0). Lower values (0.0-0.2) recommended for consistent evaluation. max_tokens: Maximum number of tokens for judge reasoning and explanations. system_prompt: Custom system prompt to override default judge behavior. Use this to create specialized evaluation perspectives. Raises: Exception: If no model is configured either in parameters or global config Example: ``` # Customer service judge cs_judge = JudgeAgent( criteria=[ "Agent replies with the refund policy", "Agent offers next steps for the customer", ], temperature=0.1 ) # Technical accuracy judge tech_judge = JudgeAgent( criteria=[ "Agent adds a code review pointing out the code compilation errors", "Agent adds a code review about the missing security headers" ], system_prompt="You are a senior software engineer reviewing code for production use." ) ``` """ # Override the default system prompt for the judge agent self.criteria = criteria or [] self.api_key = api_key self.temperature = temperature self.max_tokens = max_tokens self.system_prompt = system_prompt if model: self.model = model if ScenarioConfig.default_config is not None and isinstance( ScenarioConfig.default_config.default_model, str ): self.model = model or ScenarioConfig.default_config.default_model elif ScenarioConfig.default_config is not None and isinstance( ScenarioConfig.default_config.default_model, ModelConfig ): self.model = model or ScenarioConfig.default_config.default_model.model self.api_key = ( api_key or ScenarioConfig.default_config.default_model.api_key ) self.temperature = ( temperature or ScenarioConfig.default_config.default_model.temperature ) self.max_tokens = ( max_tokens or ScenarioConfig.default_config.default_model.max_tokens ) if not hasattr(self, "model"): raise Exception(agent_not_configured_error_message("TestingAgent")) @scenario_cache() async def call( self, input: AgentInput, ) -> AgentReturnTypes: """ Evaluate the current conversation state against the configured criteria. This method analyzes the conversation history and determines whether the scenario should continue or end with a verdict. It uses function calling to make structured decisions and provides detailed reasoning. Args: input: AgentInput containing conversation history and scenario context Returns: AgentReturnTypes: Either an empty list (continue scenario) or a ScenarioResult (end scenario with verdict) Raises: Exception: If the judge cannot make a valid decision or if there's an error in the evaluation process Note: - Returns empty list [] to continue the scenario - Returns ScenarioResult to end with success/failure - Provides detailed reasoning for all decisions - Evaluates each criterion independently - Can end scenarios early if clear violation or success is detected """ scenario = input.scenario_state criteria_str = "\n".join( [f"{idx + 1}. {criterion}" for idx, criterion in enumerate(self.criteria)] ) messages = [ { "role": "system", "content": self.system_prompt or f""" <role> You are an LLM as a judge watching a simulated conversation as it plays out live to determine if the agent under test meets the criteria or not. </role> <goal> Your goal is to determine if you already have enough information to make a verdict of the scenario below, or if the conversation should continue for longer. If you do have enough information, use the finish_test tool to determine if all the criteria have been met, if not, use the continue_test tool to let the next step play out. </goal> <scenario> {scenario.description} </scenario> <criteria> {criteria_str} </criteria> <rules> - Be strict, do not let the conversation continue if the agent already broke one of the "do not" or "should not" criterias. - DO NOT make any judgment calls that are not explicitly listed in the success or failure criteria, withhold judgement if necessary </rules> """, }, *input.messages, ] is_last_message = ( input.scenario_state.current_turn == input.scenario_state.config.max_turns ) if is_last_message: messages.append( { "role": "user", "content": """ System: <finish_test> This is the last message, conversation has reached the maximum number of turns, give your final verdict, if you don't have enough information to make a verdict, say inconclusive with max turns reached. </finish_test> """, } ) # Define the tools criteria_names = [ re.sub( r"[^a-zA-Z0-9]", "_", criterion.replace(" ", "_").replace("'", "").lower(), )[:70] for criterion in self.criteria ] tools = [ { "type": "function", "function": { "name": "continue_test", "description": "Continue the test with the next step", "strict": True, "parameters": { "type": "object", "properties": {}, "required": [], "additionalProperties": False, }, }, }, { "type": "function", "function": { "name": "finish_test", "description": "Complete the test with a final verdict", "strict": True, "parameters": { "type": "object", "properties": { "criteria": { "type": "object", "properties": { criteria_names[idx]: { "enum": [True, False, "inconclusive"], "description": criterion, } for idx, criterion in enumerate(self.criteria) }, "required": criteria_names, "additionalProperties": False, "description": "Strict verdict for each criterion", }, "reasoning": { "type": "string", "description": "Explanation of what the final verdict should be", }, "verdict": { "type": "string", "enum": ["success", "failure", "inconclusive"], "description": "The final verdict of the test", }, }, "required": ["criteria", "reasoning", "verdict"], "additionalProperties": False, }, }, }, ] enforce_judgment = input.judgment_request has_criteria = len(self.criteria) > 0 if enforce_judgment and not has_criteria: return ScenarioResult( success=False, messages=[], reasoning="TestingAgent was called as a judge, but it has no criteria to judge against", ) response = cast( ModelResponse, completion( model=self.model, messages=messages, temperature=self.temperature, max_tokens=self.max_tokens, tools=tools, tool_choice=( {"type": "function", "function": {"name": "finish_test"}} if (is_last_message or enforce_judgment) and has_criteria else "required" ), ), ) # Extract the content from the response if hasattr(response, "choices") and len(response.choices) > 0: message = cast(Choices, response.choices[0]).message # Check if the LLM chose to use the tool if message.tool_calls: tool_call = message.tool_calls[0] if tool_call.function.name == "continue_test": return [] if tool_call.function.name == "finish_test": # Parse the tool call arguments try: args = json.loads(tool_call.function.arguments) verdict = args.get("verdict", "inconclusive") reasoning = args.get("reasoning", "No reasoning provided") criteria = args.get("criteria", {}) passed_criteria = [ self.criteria[idx] for idx, criterion in enumerate(criteria.values()) if criterion == True ] failed_criteria = [ self.criteria[idx] for idx, criterion in enumerate(criteria.values()) if criterion == False ] # Return the appropriate ScenarioResult based on the verdict return ScenarioResult( success=verdict == "success" and len(failed_criteria) == 0, messages=messages, reasoning=reasoning, passed_criteria=passed_criteria, failed_criteria=failed_criteria, ) except json.JSONDecodeError: raise Exception( f"Failed to parse tool call arguments from judge agent: {tool_call.function.arguments}" ) else: raise Exception( f"Invalid tool call from judge agent: {tool_call.function.name}" ) else: raise Exception( f"Invalid response from judge agent, tool calls not found: {message.__repr__()}" ) else: raise Exception( f"Unexpected response format from LLM: {response.__repr__()}" )
Ancestors
- AgentAdapter
- abc.ABC
Class variables
var api_key : str | None
-
The type of the None singleton.
var criteria : List[str]
-
The type of the None singleton.
var max_tokens : int | None
-
The type of the None singleton.
var model : str
-
The type of the None singleton.
var system_prompt : str | None
-
The type of the None singleton.
var temperature : float
-
The type of the None singleton.
Methods
async def call(self, input: AgentInput) ‑> str | openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam | List[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam] | ScenarioResult
-
Evaluate the current conversation state against the configured criteria.
This method analyzes the conversation history and determines whether the scenario should continue or end with a verdict. It uses function calling to make structured decisions and provides detailed reasoning.
Args
input
- AgentInput containing conversation history and scenario context
Returns
AgentReturnTypes
- Either an empty list (continue scenario) or a ScenarioResult (end scenario with verdict)
Raises
Exception
- If the judge cannot make a valid decision or if there's an error in the evaluation process
Note
- Returns empty list [] to continue the scenario
- Returns ScenarioResult to end with success/failure
- Provides detailed reasoning for all decisions
- Evaluates each criterion independently
- Can end scenarios early if clear violation or success is detected
Expand source code
@scenario_cache() async def call( self, input: AgentInput, ) -> AgentReturnTypes: """ Evaluate the current conversation state against the configured criteria. This method analyzes the conversation history and determines whether the scenario should continue or end with a verdict. It uses function calling to make structured decisions and provides detailed reasoning. Args: input: AgentInput containing conversation history and scenario context Returns: AgentReturnTypes: Either an empty list (continue scenario) or a ScenarioResult (end scenario with verdict) Raises: Exception: If the judge cannot make a valid decision or if there's an error in the evaluation process Note: - Returns empty list [] to continue the scenario - Returns ScenarioResult to end with success/failure - Provides detailed reasoning for all decisions - Evaluates each criterion independently - Can end scenarios early if clear violation or success is detected """ scenario = input.scenario_state criteria_str = "\n".join( [f"{idx + 1}. {criterion}" for idx, criterion in enumerate(self.criteria)] ) messages = [ { "role": "system", "content": self.system_prompt or f""" <role> You are an LLM as a judge watching a simulated conversation as it plays out live to determine if the agent under test meets the criteria or not. </role> <goal> Your goal is to determine if you already have enough information to make a verdict of the scenario below, or if the conversation should continue for longer. If you do have enough information, use the finish_test tool to determine if all the criteria have been met, if not, use the continue_test tool to let the next step play out. </goal> <scenario> {scenario.description} </scenario> <criteria> {criteria_str} </criteria> <rules> - Be strict, do not let the conversation continue if the agent already broke one of the "do not" or "should not" criterias. - DO NOT make any judgment calls that are not explicitly listed in the success or failure criteria, withhold judgement if necessary </rules> """, }, *input.messages, ] is_last_message = ( input.scenario_state.current_turn == input.scenario_state.config.max_turns ) if is_last_message: messages.append( { "role": "user", "content": """ System: <finish_test> This is the last message, conversation has reached the maximum number of turns, give your final verdict, if you don't have enough information to make a verdict, say inconclusive with max turns reached. </finish_test> """, } ) # Define the tools criteria_names = [ re.sub( r"[^a-zA-Z0-9]", "_", criterion.replace(" ", "_").replace("'", "").lower(), )[:70] for criterion in self.criteria ] tools = [ { "type": "function", "function": { "name": "continue_test", "description": "Continue the test with the next step", "strict": True, "parameters": { "type": "object", "properties": {}, "required": [], "additionalProperties": False, }, }, }, { "type": "function", "function": { "name": "finish_test", "description": "Complete the test with a final verdict", "strict": True, "parameters": { "type": "object", "properties": { "criteria": { "type": "object", "properties": { criteria_names[idx]: { "enum": [True, False, "inconclusive"], "description": criterion, } for idx, criterion in enumerate(self.criteria) }, "required": criteria_names, "additionalProperties": False, "description": "Strict verdict for each criterion", }, "reasoning": { "type": "string", "description": "Explanation of what the final verdict should be", }, "verdict": { "type": "string", "enum": ["success", "failure", "inconclusive"], "description": "The final verdict of the test", }, }, "required": ["criteria", "reasoning", "verdict"], "additionalProperties": False, }, }, }, ] enforce_judgment = input.judgment_request has_criteria = len(self.criteria) > 0 if enforce_judgment and not has_criteria: return ScenarioResult( success=False, messages=[], reasoning="TestingAgent was called as a judge, but it has no criteria to judge against", ) response = cast( ModelResponse, completion( model=self.model, messages=messages, temperature=self.temperature, max_tokens=self.max_tokens, tools=tools, tool_choice=( {"type": "function", "function": {"name": "finish_test"}} if (is_last_message or enforce_judgment) and has_criteria else "required" ), ), ) # Extract the content from the response if hasattr(response, "choices") and len(response.choices) > 0: message = cast(Choices, response.choices[0]).message # Check if the LLM chose to use the tool if message.tool_calls: tool_call = message.tool_calls[0] if tool_call.function.name == "continue_test": return [] if tool_call.function.name == "finish_test": # Parse the tool call arguments try: args = json.loads(tool_call.function.arguments) verdict = args.get("verdict", "inconclusive") reasoning = args.get("reasoning", "No reasoning provided") criteria = args.get("criteria", {}) passed_criteria = [ self.criteria[idx] for idx, criterion in enumerate(criteria.values()) if criterion == True ] failed_criteria = [ self.criteria[idx] for idx, criterion in enumerate(criteria.values()) if criterion == False ] # Return the appropriate ScenarioResult based on the verdict return ScenarioResult( success=verdict == "success" and len(failed_criteria) == 0, messages=messages, reasoning=reasoning, passed_criteria=passed_criteria, failed_criteria=failed_criteria, ) except json.JSONDecodeError: raise Exception( f"Failed to parse tool call arguments from judge agent: {tool_call.function.arguments}" ) else: raise Exception( f"Invalid tool call from judge agent: {tool_call.function.name}" ) else: raise Exception( f"Invalid response from judge agent, tool calls not found: {message.__repr__()}" ) else: raise Exception( f"Unexpected response format from LLM: {response.__repr__()}" )
Inherited members
class ScenarioConfig (**data: Any)
-
Global configuration class for the Scenario testing framework.
This class allows users to set default behavior and parameters that apply to all scenario executions, including the LLM model to use for simulator and judge agents, execution limits, and debugging options.
Attributes
default_model
- Default LLM model configuration for agents (can be string or ModelConfig)
max_turns
- Maximum number of conversation turns before scenario times out
verbose
- Whether to show detailed output during execution (True/False or verbosity level)
cache_key
- Key for caching scenario results to ensure deterministic behavior
debug
- Whether to enable debug mode with step-by-step interaction
Example
# Configure globally for all scenarios scenario.configure( default_model="openai/gpt-4.1-mini", max_turns=15, verbose=True, cache_key="my-test-suite-v1", debug=False ) # Or create a specific config instance config = ScenarioConfig( default_model=ModelConfig( model="openai/gpt-4.1-mini", temperature=0.2 ), max_turns=20 )
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError
][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.self
is explicitly positional-only to allowself
as a field name.Expand source code
class ScenarioConfig(BaseModel): """ Global configuration class for the Scenario testing framework. This class allows users to set default behavior and parameters that apply to all scenario executions, including the LLM model to use for simulator and judge agents, execution limits, and debugging options. Attributes: default_model: Default LLM model configuration for agents (can be string or ModelConfig) max_turns: Maximum number of conversation turns before scenario times out verbose: Whether to show detailed output during execution (True/False or verbosity level) cache_key: Key for caching scenario results to ensure deterministic behavior debug: Whether to enable debug mode with step-by-step interaction Example: ``` # Configure globally for all scenarios scenario.configure( default_model="openai/gpt-4.1-mini", max_turns=15, verbose=True, cache_key="my-test-suite-v1", debug=False ) # Or create a specific config instance config = ScenarioConfig( default_model=ModelConfig( model="openai/gpt-4.1-mini", temperature=0.2 ), max_turns=20 ) ``` """ default_model: Optional[Union[str, ModelConfig]] = None max_turns: Optional[int] = 10 verbose: Optional[Union[bool, int]] = True cache_key: Optional[str] = None debug: Optional[bool] = False default_config: ClassVar[Optional["ScenarioConfig"]] = None @classmethod def configure( cls, default_model: Optional[str] = None, max_turns: Optional[int] = None, verbose: Optional[Union[bool, int]] = None, cache_key: Optional[str] = None, debug: Optional[bool] = None, ) -> None: """ Set global configuration settings for all scenario executions. This method allows you to configure default behavior that will be applied to all scenarios unless explicitly overridden in individual scenario runs. Args: default_model: Default LLM model identifier for user simulator and judge agents max_turns: Maximum number of conversation turns before timeout (default: 10) verbose: Enable verbose output during scenario execution cache_key: Cache key for deterministic scenario behavior across runs debug: Enable debug mode for step-by-step execution with user intervention Example: ``` import scenario # Set up default configuration scenario.configure( default_model="openai/gpt-4.1-mini", max_turns=15, verbose=True, debug=False ) # All subsequent scenario runs will use these defaults result = await scenario.run( name="my test", description="Test scenario", agents=[my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent()] ) ``` """ existing_config = cls.default_config or ScenarioConfig() cls.default_config = existing_config.merge( ScenarioConfig( default_model=default_model, max_turns=max_turns, verbose=verbose, cache_key=cache_key, debug=debug, ) ) def merge(self, other: "ScenarioConfig") -> "ScenarioConfig": """ Merge this configuration with another configuration. Values from the other configuration will override values in this configuration where they are not None. Args: other: Another ScenarioConfig instance to merge with Returns: A new ScenarioConfig instance with merged values Example: ``` base_config = ScenarioConfig(max_turns=10, verbose=True) override_config = ScenarioConfig(max_turns=20) merged = base_config.merge(override_config) # Result: max_turns=20, verbose=True ``` """ return ScenarioConfig( **{ **self.items(), **other.items(), } ) def items(self): """ Get configuration items as a dictionary. Returns: Dictionary of configuration key-value pairs, excluding None values Example: ``` config = ScenarioConfig(max_turns=15, verbose=True) items = config.items() # Result: {"max_turns": 15, "verbose": True} ``` """ return {k: getattr(self, k) for k in self.model_dump(exclude_none=True).keys()}
Ancestors
- pydantic.main.BaseModel
Class variables
var cache_key : str | None
-
The type of the None singleton.
var debug : bool | None
-
The type of the None singleton.
var default_config : ClassVar[ScenarioConfig | None]
-
The type of the None singleton.
var default_model : str | ModelConfig | None
-
The type of the None singleton.
var max_turns : int | None
-
The type of the None singleton.
var model_config
-
The type of the None singleton.
var verbose : bool | int | None
-
The type of the None singleton.
Static methods
def configure(default_model: str | None = None, max_turns: int | None = None, verbose: bool | int | None = None, cache_key: str | None = None, debug: bool | None = None) ‑> None
-
Set global configuration settings for all scenario executions.
This method allows you to configure default behavior that will be applied to all scenarios unless explicitly overridden in individual scenario runs.
Args
default_model
- Default LLM model identifier for user simulator and judge agents
max_turns
- Maximum number of conversation turns before timeout (default: 10)
verbose
- Enable verbose output during scenario execution
cache_key
- Cache key for deterministic scenario behavior across runs
debug
- Enable debug mode for step-by-step execution with user intervention
Example
import scenario # Set up default configuration scenario.configure( default_model="openai/gpt-4.1-mini", max_turns=15, verbose=True, debug=False ) # All subsequent scenario runs will use these defaults result = await scenario.run( name="my test", description="Test scenario", agents=[my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent()] )
Methods
def items(self)
-
Get configuration items as a dictionary.
Returns
Dictionary of configuration key-value pairs, excluding None values
Example
config = ScenarioConfig(max_turns=15, verbose=True) items = config.items() # Result: {"max_turns": 15, "verbose": True}
Expand source code
def items(self): """ Get configuration items as a dictionary. Returns: Dictionary of configuration key-value pairs, excluding None values Example: ``` config = ScenarioConfig(max_turns=15, verbose=True) items = config.items() # Result: {"max_turns": 15, "verbose": True} ``` """ return {k: getattr(self, k) for k in self.model_dump(exclude_none=True).keys()}
def merge(self, other: ScenarioConfig) ‑> ScenarioConfig
-
Merge this configuration with another configuration.
Values from the other configuration will override values in this configuration where they are not None.
Args
other
- Another ScenarioConfig instance to merge with
Returns
A new ScenarioConfig instance with merged values
Example
base_config = ScenarioConfig(max_turns=10, verbose=True) override_config = ScenarioConfig(max_turns=20) merged = base_config.merge(override_config) # Result: max_turns=20, verbose=True
Expand source code
def merge(self, other: "ScenarioConfig") -> "ScenarioConfig": """ Merge this configuration with another configuration. Values from the other configuration will override values in this configuration where they are not None. Args: other: Another ScenarioConfig instance to merge with Returns: A new ScenarioConfig instance with merged values Example: ``` base_config = ScenarioConfig(max_turns=10, verbose=True) override_config = ScenarioConfig(max_turns=20) merged = base_config.merge(override_config) # Result: max_turns=20, verbose=True ``` """ return ScenarioConfig( **{ **self.items(), **other.items(), } )
class ScenarioResult (**data: Any)
-
Represents the final result of a scenario test execution.
This class contains all the information about how a scenario performed, including whether it succeeded, the conversation that took place, and detailed reasoning about which criteria were met or failed.
Attributes
success
- Whether the scenario passed all criteria and completed successfully
messages
- Complete conversation history that occurred during the scenario
reasoning
- Detailed explanation of why the scenario succeeded or failed
passed_criteria
- List of success criteria that were satisfied
failed_criteria
- List of success criteria that were not satisfied
total_time
- Total execution time in seconds (if measured)
agent_time
- Time spent in agent calls in seconds (if measured)
Example
result = await scenario.run( name="weather query", description="User asks about weather", agents=[ weather_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful weather information"]) ] ) print(f"Test {'PASSED' if result.success else 'FAILED'}") print(f"Reasoning: {result.reasoning}") if not result.success: print("Failed criteria:") for criteria in result.failed_criteria: print(f" - {criteria}")
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError
][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.self
is explicitly positional-only to allowself
as a field name.Expand source code
class ScenarioResult(BaseModel): """ Represents the final result of a scenario test execution. This class contains all the information about how a scenario performed, including whether it succeeded, the conversation that took place, and detailed reasoning about which criteria were met or failed. Attributes: success: Whether the scenario passed all criteria and completed successfully messages: Complete conversation history that occurred during the scenario reasoning: Detailed explanation of why the scenario succeeded or failed passed_criteria: List of success criteria that were satisfied failed_criteria: List of success criteria that were not satisfied total_time: Total execution time in seconds (if measured) agent_time: Time spent in agent calls in seconds (if measured) Example: ``` result = await scenario.run( name="weather query", description="User asks about weather", agents=[ weather_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful weather information"]) ] ) print(f"Test {'PASSED' if result.success else 'FAILED'}") print(f"Reasoning: {result.reasoning}") if not result.success: print("Failed criteria:") for criteria in result.failed_criteria: print(f" - {criteria}") ``` """ success: bool # Prevent issues with slightly inconsistent message types for example when comming from Gemini right at the result level messages: Annotated[List[ChatCompletionMessageParam], SkipValidation] reasoning: Optional[str] = None passed_criteria: List[str] = [] failed_criteria: List[str] = [] total_time: Optional[float] = None agent_time: Optional[float] = None def __repr__(self) -> str: """ Provide a concise representation for debugging and logging. Returns: A string representation showing success status and reasoning """ status = "PASSED" if self.success else "FAILED" return f"ScenarioResult(success={self.success}, status={status}, reasoning='{self.reasoning or 'None'}')"
Ancestors
- pydantic.main.BaseModel
Class variables
var agent_time : float | None
-
The type of the None singleton.
var failed_criteria : List[str]
-
The type of the None singleton.
var messages : List[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam]
-
The type of the None singleton.
var model_config
-
The type of the None singleton.
var passed_criteria : List[str]
-
The type of the None singleton.
var reasoning : str | None
-
The type of the None singleton.
var success : bool
-
The type of the None singleton.
var total_time : float | None
-
The type of the None singleton.
class ScenarioState (**data: Any)
-
Represents the current state of a scenario execution.
This class provides access to the conversation history, turn information, and utility methods for inspecting messages and tool calls. It's passed to script step functions and available through AgentInput.scenario_state.
Attributes
description
- The scenario description that guides the simulation
messages
- Complete conversation history as OpenAI-compatible messages
thread_id
- Unique identifier for this conversation thread
current_turn
- Current turn number in the conversation
config
- Configuration settings for this scenario execution
Example
def check_agent_behavior(state: ScenarioState) -> None: # Check if the agent called a specific tool if state.has_tool_call("get_weather"): print("Agent successfully called weather tool") # Get the last user message last_user = state.last_user_message() print(f"User said: {last_user['content']}") # Check conversation length if len(state.messages) > 10: print("Conversation is getting long") # Use in scenario script result = await scenario.run( name="tool usage test", description="Test that agent uses the correct tools", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful response"]) ], script=[ scenario.user("What's the weather like?"), scenario.agent(), check_agent_behavior, # Custom inspection function scenario.succeed() ] )
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError
][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.self
is explicitly positional-only to allowself
as a field name.Expand source code
class ScenarioState(BaseModel): """ Represents the current state of a scenario execution. This class provides access to the conversation history, turn information, and utility methods for inspecting messages and tool calls. It's passed to script step functions and available through AgentInput.scenario_state. Attributes: description: The scenario description that guides the simulation messages: Complete conversation history as OpenAI-compatible messages thread_id: Unique identifier for this conversation thread current_turn: Current turn number in the conversation config: Configuration settings for this scenario execution Example: ``` def check_agent_behavior(state: ScenarioState) -> None: # Check if the agent called a specific tool if state.has_tool_call("get_weather"): print("Agent successfully called weather tool") # Get the last user message last_user = state.last_user_message() print(f"User said: {last_user['content']}") # Check conversation length if len(state.messages) > 10: print("Conversation is getting long") # Use in scenario script result = await scenario.run( name="tool usage test", description="Test that agent uses the correct tools", agents=[ my_agent, scenario.UserSimulatorAgent(), scenario.JudgeAgent(criteria=["Agent provides helpful response"]) ], script=[ scenario.user("What's the weather like?"), scenario.agent(), check_agent_behavior, # Custom inspection function scenario.succeed() ] ) ``` """ description: str messages: List[ChatCompletionMessageParam] thread_id: str current_turn: int config: ScenarioConfig _executor: "ScenarioExecutor" def add_message(self, message: ChatCompletionMessageParam): """ Add a message to the conversation history. This method delegates to the scenario executor to properly handle message broadcasting and state updates. Args: message: OpenAI-compatible message to add to the conversation Example: ``` def inject_system_message(state: ScenarioState) -> None: state.add_message({ "role": "system", "content": "The user is now in a hurry" }) ``` """ self._executor.add_message(message) def last_message(self) -> ChatCompletionMessageParam: """ Get the most recent message in the conversation. Returns: The last message in the conversation history Raises: ValueError: If no messages exist in the conversation Example: ``` def check_last_response(state: ScenarioState) -> None: last = state.last_message() if last["role"] == "assistant": content = last.get("content", "") assert "helpful" in content.lower() ``` """ if len(self.messages) == 0: raise ValueError("No messages found") return self.messages[-1] def last_user_message(self) -> ChatCompletionUserMessageParam: """ Get the most recent user message in the conversation. Returns: The last user message in the conversation history Raises: ValueError: If no user messages exist in the conversation Example: ``` def analyze_user_intent(state: ScenarioState) -> None: user_msg = state.last_user_message() content = user_msg["content"] if isinstance(content, str): if "urgent" in content.lower(): print("User expressed urgency") ``` """ user_messages = [m for m in self.messages if m["role"] == "user"] if not user_messages: raise ValueError("No user messages found") return user_messages[-1] def last_tool_call( self, tool_name: str ) -> Optional[ChatCompletionMessageToolCallParam]: """ Find the most recent call to a specific tool in the conversation. Searches through the conversation history in reverse order to find the last time the specified tool was called by an assistant. Args: tool_name: Name of the tool to search for Returns: The tool call object if found, None otherwise Example: ``` def verify_weather_call(state: ScenarioState) -> None: weather_call = state.last_tool_call("get_current_weather") if weather_call: args = json.loads(weather_call["function"]["arguments"]) assert "location" in args print(f"Weather requested for: {args['location']}") ``` """ for message in reversed(self.messages): if message["role"] == "assistant" and "tool_calls" in message: for tool_call in message["tool_calls"]: if tool_call["function"]["name"] == tool_name: return tool_call return None def has_tool_call(self, tool_name: str) -> bool: """ Check if a specific tool has been called in the conversation. This is a convenience method that returns True if the specified tool has been called at any point in the conversation. Args: tool_name: Name of the tool to check for Returns: True if the tool has been called, False otherwise Example: ``` def ensure_tool_usage(state: ScenarioState) -> None: # Verify the agent used required tools assert state.has_tool_call("search_database") assert state.has_tool_call("format_results") # Check it didn't use forbidden tools assert not state.has_tool_call("delete_data") ``` """ return self.last_tool_call(tool_name) is not None
Ancestors
- pydantic.main.BaseModel
Class variables
var config : ScenarioConfig
-
The type of the None singleton.
var current_turn : int
-
The type of the None singleton.
var description : str
-
The type of the None singleton.
var messages : List[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam]
-
The type of the None singleton.
var model_config
-
The type of the None singleton.
var thread_id : str
-
The type of the None singleton.
Methods
def add_message(self, message: openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam)
-
Add a message to the conversation history.
This method delegates to the scenario executor to properly handle message broadcasting and state updates.
Args
message
- OpenAI-compatible message to add to the conversation
Example
def inject_system_message(state: ScenarioState) -> None: state.add_message({ "role": "system", "content": "The user is now in a hurry" })
Expand source code
def add_message(self, message: ChatCompletionMessageParam): """ Add a message to the conversation history. This method delegates to the scenario executor to properly handle message broadcasting and state updates. Args: message: OpenAI-compatible message to add to the conversation Example: ``` def inject_system_message(state: ScenarioState) -> None: state.add_message({ "role": "system", "content": "The user is now in a hurry" }) ``` """ self._executor.add_message(message)
def has_tool_call(self, tool_name: str) ‑> bool
-
Check if a specific tool has been called in the conversation.
This is a convenience method that returns True if the specified tool has been called at any point in the conversation.
Args
tool_name
- Name of the tool to check for
Returns
True if the tool has been called, False otherwise
Example
def ensure_tool_usage(state: ScenarioState) -> None: # Verify the agent used required tools assert state.has_tool_call("search_database") assert state.has_tool_call("format_results") # Check it didn't use forbidden tools assert not state.has_tool_call("delete_data")
Expand source code
def has_tool_call(self, tool_name: str) -> bool: """ Check if a specific tool has been called in the conversation. This is a convenience method that returns True if the specified tool has been called at any point in the conversation. Args: tool_name: Name of the tool to check for Returns: True if the tool has been called, False otherwise Example: ``` def ensure_tool_usage(state: ScenarioState) -> None: # Verify the agent used required tools assert state.has_tool_call("search_database") assert state.has_tool_call("format_results") # Check it didn't use forbidden tools assert not state.has_tool_call("delete_data") ``` """ return self.last_tool_call(tool_name) is not None
def last_message(self) ‑> openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam
-
Get the most recent message in the conversation.
Returns
The last message in the conversation history
Raises
ValueError
- If no messages exist in the conversation
Example
def check_last_response(state: ScenarioState) -> None: last = state.last_message() if last["role"] == "assistant": content = last.get("content", "") assert "helpful" in content.lower()
Expand source code
def last_message(self) -> ChatCompletionMessageParam: """ Get the most recent message in the conversation. Returns: The last message in the conversation history Raises: ValueError: If no messages exist in the conversation Example: ``` def check_last_response(state: ScenarioState) -> None: last = state.last_message() if last["role"] == "assistant": content = last.get("content", "") assert "helpful" in content.lower() ``` """ if len(self.messages) == 0: raise ValueError("No messages found") return self.messages[-1]
def last_tool_call(self, tool_name: str) ‑> openai.types.chat.chat_completion_message_tool_call_param.ChatCompletionMessageToolCallParam | None
-
Find the most recent call to a specific tool in the conversation.
Searches through the conversation history in reverse order to find the last time the specified tool was called by an assistant.
Args
tool_name
- Name of the tool to search for
Returns
The tool call object if found, None otherwise
Example
def verify_weather_call(state: ScenarioState) -> None: weather_call = state.last_tool_call("get_current_weather") if weather_call: args = json.loads(weather_call["function"]["arguments"]) assert "location" in args print(f"Weather requested for: {args['location']}")
Expand source code
def last_tool_call( self, tool_name: str ) -> Optional[ChatCompletionMessageToolCallParam]: """ Find the most recent call to a specific tool in the conversation. Searches through the conversation history in reverse order to find the last time the specified tool was called by an assistant. Args: tool_name: Name of the tool to search for Returns: The tool call object if found, None otherwise Example: ``` def verify_weather_call(state: ScenarioState) -> None: weather_call = state.last_tool_call("get_current_weather") if weather_call: args = json.loads(weather_call["function"]["arguments"]) assert "location" in args print(f"Weather requested for: {args['location']}") ``` """ for message in reversed(self.messages): if message["role"] == "assistant" and "tool_calls" in message: for tool_call in message["tool_calls"]: if tool_call["function"]["name"] == tool_name: return tool_call return None
def last_user_message(self) ‑> openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam
-
Get the most recent user message in the conversation.
Returns
The last user message in the conversation history
Raises
ValueError
- If no user messages exist in the conversation
Example
def analyze_user_intent(state: ScenarioState) -> None: user_msg = state.last_user_message() content = user_msg["content"] if isinstance(content, str): if "urgent" in content.lower(): print("User expressed urgency")
Expand source code
def last_user_message(self) -> ChatCompletionUserMessageParam: """ Get the most recent user message in the conversation. Returns: The last user message in the conversation history Raises: ValueError: If no user messages exist in the conversation Example: ``` def analyze_user_intent(state: ScenarioState) -> None: user_msg = state.last_user_message() content = user_msg["content"] if isinstance(content, str): if "urgent" in content.lower(): print("User expressed urgency") ``` """ user_messages = [m for m in self.messages if m["role"] == "user"] if not user_messages: raise ValueError("No user messages found") return user_messages[-1]
def model_post_init(self: BaseModel, context: Any, /) ‑> None
-
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that's what pydantic-core passes when calling it.
Args
self
- The BaseModel instance.
context
- The context.
Expand source code
def init_private_attributes(self: BaseModel, context: Any, /) -> None: """This function is meant to behave like a BaseModel method to initialise private attributes. It takes context as an argument since that's what pydantic-core passes when calling it. Args: self: The BaseModel instance. context: The context. """ if getattr(self, '__pydantic_private__', None) is None: pydantic_private = {} for name, private_attr in self.__private_attributes__.items(): default = private_attr.get_default() if default is not PydanticUndefined: pydantic_private[name] = default object_setattr(self, '__pydantic_private__', pydantic_private)
class UserSimulatorAgent (*, model: str | None = None, api_key: str | None = None, temperature: float = 0.0, max_tokens: int | None = None, system_prompt: str | None = None)
-
Agent that simulates realistic user behavior in scenario conversations.
This agent generates user messages that are appropriate for the given scenario context, simulating how a real human user would interact with the agent under test. It uses an LLM to generate natural, contextually relevant user inputs that help drive the conversation forward according to the scenario description.
Attributes
role
- Always AgentRole.USER for user simulator agents
model
- LLM model identifier to use for generating user messages
api_key
- Optional API key for the model provider
temperature
- Sampling temperature for response generation
max_tokens
- Maximum tokens to generate in user messages
system_prompt
- Custom system prompt to override default user simulation behavior
Example
import scenario # Basic user simulator with default behavior user_sim = scenario.UserSimulatorAgent( model="openai/gpt-4.1-mini" ) # Customized user simulator custom_user_sim = scenario.UserSimulatorAgent( model="openai/gpt-4.1-mini", temperature=0.3, system_prompt="You are a technical user who asks detailed questions" ) # Use in scenario result = await scenario.run( name="user interaction test", description="User seeks help with Python programming", agents=[ my_programming_agent, user_sim, scenario.JudgeAgent(criteria=["Provides helpful code examples"]) ] )
Note
- The user simulator automatically generates short, natural user messages
- It follows the scenario description to stay on topic
- Messages are generated in a casual, human-like style (lowercase, brief, etc.)
- The simulator will not act as an assistant - it only generates user inputs
Initialize a user simulator agent.
Args
model
- LLM model identifier (e.g., "openai/gpt-4.1-mini"). If not provided, uses the default model from global configuration.
api_key
- API key for the model provider. If not provided, uses the key from global configuration or environment.
temperature
- Sampling temperature for message generation (0.0-1.0). Lower values make responses more deterministic.
max_tokens
- Maximum number of tokens to generate in user messages. If not provided, uses model defaults.
system_prompt
- Custom system prompt to override default user simulation behavior. Use this to create specialized user personas or behaviors.
Raises
Exception
- If no model is configured either in parameters or global config
Example
# Basic user simulator user_sim = UserSimulatorAgent(model="openai/gpt-4.1-mini") # User simulator with custom persona expert_user = UserSimulatorAgent( model="openai/gpt-4.1-mini", temperature=0.2, system_prompt=''' You are an expert software developer testing an AI coding assistant. Ask challenging, technical questions and be demanding about code quality. ''' )
Expand source code
class UserSimulatorAgent(AgentAdapter): """ Agent that simulates realistic user behavior in scenario conversations. This agent generates user messages that are appropriate for the given scenario context, simulating how a real human user would interact with the agent under test. It uses an LLM to generate natural, contextually relevant user inputs that help drive the conversation forward according to the scenario description. Attributes: role: Always AgentRole.USER for user simulator agents model: LLM model identifier to use for generating user messages api_key: Optional API key for the model provider temperature: Sampling temperature for response generation max_tokens: Maximum tokens to generate in user messages system_prompt: Custom system prompt to override default user simulation behavior Example: ``` import scenario # Basic user simulator with default behavior user_sim = scenario.UserSimulatorAgent( model="openai/gpt-4.1-mini" ) # Customized user simulator custom_user_sim = scenario.UserSimulatorAgent( model="openai/gpt-4.1-mini", temperature=0.3, system_prompt="You are a technical user who asks detailed questions" ) # Use in scenario result = await scenario.run( name="user interaction test", description="User seeks help with Python programming", agents=[ my_programming_agent, user_sim, scenario.JudgeAgent(criteria=["Provides helpful code examples"]) ] ) ``` Note: - The user simulator automatically generates short, natural user messages - It follows the scenario description to stay on topic - Messages are generated in a casual, human-like style (lowercase, brief, etc.) - The simulator will not act as an assistant - it only generates user inputs """ role = AgentRole.USER model: str api_key: Optional[str] temperature: float max_tokens: Optional[int] system_prompt: Optional[str] def __init__( self, *, model: Optional[str] = None, api_key: Optional[str] = None, temperature: float = 0.0, max_tokens: Optional[int] = None, system_prompt: Optional[str] = None, ): """ Initialize a user simulator agent. Args: model: LLM model identifier (e.g., "openai/gpt-4.1-mini"). If not provided, uses the default model from global configuration. api_key: API key for the model provider. If not provided, uses the key from global configuration or environment. temperature: Sampling temperature for message generation (0.0-1.0). Lower values make responses more deterministic. max_tokens: Maximum number of tokens to generate in user messages. If not provided, uses model defaults. system_prompt: Custom system prompt to override default user simulation behavior. Use this to create specialized user personas or behaviors. Raises: Exception: If no model is configured either in parameters or global config Example: ``` # Basic user simulator user_sim = UserSimulatorAgent(model="openai/gpt-4.1-mini") # User simulator with custom persona expert_user = UserSimulatorAgent( model="openai/gpt-4.1-mini", temperature=0.2, system_prompt=''' You are an expert software developer testing an AI coding assistant. Ask challenging, technical questions and be demanding about code quality. ''' ) ``` """ # Override the default system prompt for the user simulator agent self.api_key = api_key self.temperature = temperature self.max_tokens = max_tokens self.system_prompt = system_prompt if model: self.model = model if ScenarioConfig.default_config is not None and isinstance( ScenarioConfig.default_config.default_model, str ): self.model = model or ScenarioConfig.default_config.default_model elif ScenarioConfig.default_config is not None and isinstance( ScenarioConfig.default_config.default_model, ModelConfig ): self.model = model or ScenarioConfig.default_config.default_model.model self.api_key = ( api_key or ScenarioConfig.default_config.default_model.api_key ) self.temperature = ( temperature or ScenarioConfig.default_config.default_model.temperature ) self.max_tokens = ( max_tokens or ScenarioConfig.default_config.default_model.max_tokens ) if not hasattr(self, "model"): raise Exception(agent_not_configured_error_message("TestingAgent")) @scenario_cache() async def call( self, input: AgentInput, ) -> AgentReturnTypes: """ Generate the next user message in the conversation. This method analyzes the current conversation state and scenario context to generate an appropriate user message that moves the conversation forward in a realistic, human-like manner. Args: input: AgentInput containing conversation history and scenario context Returns: AgentReturnTypes: A user message in OpenAI format that continues the conversation Note: - Messages are generated in a casual, human-like style - The simulator follows the scenario description to stay contextually relevant - Uses role reversal internally to work around LLM biases toward assistant roles - Results are cached when cache_key is configured for deterministic testing """ scenario = input.scenario_state messages = [ { "role": "system", "content": self.system_prompt or f""" <role> You are pretending to be a user, you are testing an AI Agent (shown as the user role) based on a scenario. Approach this naturally, as a human user would, with very short inputs, few words, all lowercase, imperative, not periods, like when they google or talk to chatgpt. </role> <goal> Your goal (assistant) is to interact with the Agent Under Test (user) as if you were a human user to see if it can complete the scenario successfully. </goal> <scenario> {scenario.description} </scenario> <rules> - DO NOT carry over any requests yourself, YOU ARE NOT the assistant today, you are the user </rules> """, }, {"role": "assistant", "content": "Hello, how can I help you today?"}, *input.messages, ] # User to assistant role reversal # LLM models are biased to always be the assistant not the user, so we need to do this reversal otherwise models like GPT 4.5 is # super confused, and Claude 3.7 even starts throwing exceptions. messages = reverse_roles(messages) response = cast( ModelResponse, completion( model=self.model, messages=messages, temperature=self.temperature, max_tokens=self.max_tokens, tools=[], ), ) # Extract the content from the response if hasattr(response, "choices") and len(response.choices) > 0: message = cast(Choices, response.choices[0]).message message_content = message.content if message_content is None: raise Exception(f"No response from LLM: {response.__repr__()}") return {"role": "user", "content": message_content} else: raise Exception( f"Unexpected response format from LLM: {response.__repr__()}" )
Ancestors
- AgentAdapter
- abc.ABC
Class variables
var api_key : str | None
-
The type of the None singleton.
var max_tokens : int | None
-
The type of the None singleton.
var model : str
-
The type of the None singleton.
var system_prompt : str | None
-
The type of the None singleton.
var temperature : float
-
The type of the None singleton.
Methods
async def call(self, input: AgentInput) ‑> str | openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam | List[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam | openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam | openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam | openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam | openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam | openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam] | ScenarioResult
-
Generate the next user message in the conversation.
This method analyzes the current conversation state and scenario context to generate an appropriate user message that moves the conversation forward in a realistic, human-like manner.
Args
input
- AgentInput containing conversation history and scenario context
Returns
AgentReturnTypes
- A user message in OpenAI format that continues the conversation
Note
- Messages are generated in a casual, human-like style
- The simulator follows the scenario description to stay contextually relevant
- Uses role reversal internally to work around LLM biases toward assistant roles
- Results are cached when cache_key is configured for deterministic testing
Expand source code
@scenario_cache() async def call( self, input: AgentInput, ) -> AgentReturnTypes: """ Generate the next user message in the conversation. This method analyzes the current conversation state and scenario context to generate an appropriate user message that moves the conversation forward in a realistic, human-like manner. Args: input: AgentInput containing conversation history and scenario context Returns: AgentReturnTypes: A user message in OpenAI format that continues the conversation Note: - Messages are generated in a casual, human-like style - The simulator follows the scenario description to stay contextually relevant - Uses role reversal internally to work around LLM biases toward assistant roles - Results are cached when cache_key is configured for deterministic testing """ scenario = input.scenario_state messages = [ { "role": "system", "content": self.system_prompt or f""" <role> You are pretending to be a user, you are testing an AI Agent (shown as the user role) based on a scenario. Approach this naturally, as a human user would, with very short inputs, few words, all lowercase, imperative, not periods, like when they google or talk to chatgpt. </role> <goal> Your goal (assistant) is to interact with the Agent Under Test (user) as if you were a human user to see if it can complete the scenario successfully. </goal> <scenario> {scenario.description} </scenario> <rules> - DO NOT carry over any requests yourself, YOU ARE NOT the assistant today, you are the user </rules> """, }, {"role": "assistant", "content": "Hello, how can I help you today?"}, *input.messages, ] # User to assistant role reversal # LLM models are biased to always be the assistant not the user, so we need to do this reversal otherwise models like GPT 4.5 is # super confused, and Claude 3.7 even starts throwing exceptions. messages = reverse_roles(messages) response = cast( ModelResponse, completion( model=self.model, messages=messages, temperature=self.temperature, max_tokens=self.max_tokens, tools=[], ), ) # Extract the content from the response if hasattr(response, "choices") and len(response.choices) > 0: message = cast(Choices, response.choices[0]).message message_content = message.content if message_content is None: raise Exception(f"No response from LLM: {response.__repr__()}") return {"role": "user", "content": message_content} else: raise Exception( f"Unexpected response format from LLM: {response.__repr__()}" )
Inherited members