Testing Tool Calls in Scenarios
Tool calls are a core part of modern agent workflows. This guide covers how to write scenario tests that verify tool usage, how to assert on tool call behavior, and how to mock or script tool call results for robust, deterministic tests.
Checking for Tool Calls
To verify that your agent makes the correct tool call, use the state.has_tool_call("tool_name")
API in an assertion function. In this framework, assertion functions should be placed directly in the script
list as steps, after the relevant agent/user turns. This is the idiomatic and supported way to check for tool calls in your scenario tests.
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_weather_agent_tool_call():
class WeatherAgent(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
# Your agent logic that triggers a tool call
return await my_weather_agent.process(input.messages)
# Define a custom assertion for the tool call
def check_for_weather_tool_call(state: scenario.ScenarioState):
assert state.has_tool_call("get_current_weather")
result = await scenario.run(
name="weather tool call",
description="User asks for the weather in Paris.",
agents=[
WeatherAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=[
"Agent should call the weather tool with the correct location"
])
],
script=[
scenario.user(),
scenario.agent(),
check_for_weather_tool_call, # Assertion function as a script step
scenario.succeed(),
],
)
Asserting Tool Call Arguments
Sometimes you need to check not just that a tool was called, but that it was called with the correct arguments.
def check_tool_call_args(state):
tool_calls = state.latest_agent_message().tool_calls
assert tool_calls, "No tool calls found"
assert tool_calls[0].function.name == "get_current_weather"
assert "Paris" in tool_calls[0].function.arguments
result = await scenario.run(
...,
script=[
scenario.user("What's the weather in Paris?"),
scenario.agent(),
check_tool_call_args,
scenario.succeed(),
],
)
Evaluating Tool Calls with JudgeAgent Criteria
For more nuanced or subjective tool call evaluation, use JudgeAgent with natural language criteria.
result = await scenario.run(
...,
agents=[
WeatherAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=[
"The agent should call the get_current_weather tool with the correct location (Rome) and use the result to answer the user's question."
])
],
script=[
scenario.user("Should I bring an umbrella to Rome?"),
scenario.agent(),
scenario.judge(),
],
)
Mocking or Scripting Tool Call Results
To make your tests deterministic and avoid backend setup, inject a tool response directly using scenario.message()
.
import pytest
import scenario
import litellm
from function_schema import get_function_schema
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_mocked_weather_agent_tool():
# Integrate with your agent
class WeatherAgent(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
return weather_agent(input.messages)
# Run the scenario
result = await scenario.run(
name="checking the weather",
description="""
The user is planning a boat trip from Barcelona to Rome,
and is wondering what the weather will be like.
""",
agents=[
WeatherAgent(),
scenario.UserSimulatorAgent(model="openai/gpt-4.1"),
],
script=[
scenario.message(
{"role": "user", "content": "What's the weather in Paris?"}
),
scenario.message(
{
"role": "assistant",
"content": None,
"tool_calls": [
{
"id": "call_123",
"function": {
"name": "get_current_weather",
"arguments": '{"location": "Paris"}',
},
"type": "function",
}
],
}
),
scenario.message(
{
"role": "tool",
"tool_call_id": "call_123",
"content": "The weather in Paris is sunny and 75°F.",
}
),
scenario.agent(),
scenario.succeed(),
],
set_id="python-examples",
)
# Assert the simulation was successful
assert result.success
# Example agent implementation, without any frameworks
import litellm
import random
def get_current_weather(city: str) -> str:
"""
Get the current weather in a given city.
Args:
city: The city to get the weather for.
Returns:
The current weather in the given city.
"""
choices = [
"sunny",
"cloudy",
"rainy",
"snowy",
]
temperature = random.randint(0, 30)
return f"The weather in {city} is {random.choice(choices)} with a temperature of {temperature}°C."
@scenario.cache()
def weather_agent(messages, response_messages=[]) -> scenario.AgentReturnTypes:
tools = [
get_current_weather,
]
response = litellm.completion(
model="openai/gpt-4.1",
messages=[
{
"role": "system",
"content": """
You a helpful assistant that may help the user with weather information.
Do not guess the city if they don't provide it.
""",
},
*messages,
*response_messages,
],
tools=[
{"type": "function", "function": get_function_schema(tool)}
for tool in tools
],
tool_choice="auto",
)
message = response.choices[0].message # type: ignore
return [*response_messages, message] # type: ignore
Summary of Tool Call Testing Approaches
Designing robust scenario tests for tool-using agents requires more than just checking if a tool was called. The following best practices help ensure your tests are reliable, maintainable, and accurately reflect real-world agent behavior—covering everything from argument validation to error handling and deterministic scripting.
- Add assertion functions directly to your script to check tool call behavior at the right moment.
- Check tool call arguments to verify the agent is using tools correctly, not just that a call was made.
- Simulate tool failures or edge cases by scripting tool responses, making your tests robust to error handling.
- Use JudgeAgent with natural language criteria for nuanced or subjective tool call evaluation.
- Script tool responses with
scenario.message()
to make tests deterministic and avoid external dependencies.