Testing Tool Calls in Scenarios

Tool calls are a core part of modern agent workflows. This guide covers how to write scenario tests that verify tool usage, how to assert on tool call behavior, and how to mock or script tool call results for robust, deterministic tests.

Checking for Tool Calls

To verify that your agent makes the correct tool call, use the state.has_tool_call("tool_name") API in an assertion function. In this framework, assertion functions should be placed directly in the script list as steps, after the relevant agent/user turns. This is the idiomatic and supported way to check for tool calls in your scenario tests.

Example: Check that a weather agent calls the weather API

python

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_weather_agent_tool_call():
    class WeatherAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            # Your agent logic that triggers a tool call
            return await my_weather_agent.process(input.messages)
 
    # Define a custom assertion for the tool call
    def check_for_weather_tool_call(state: scenario.ScenarioState):
        assert state.has_tool_call("get_current_weather")
 
    result = await scenario.run(
        name="weather tool call",
        description="User asks for the weather in Paris.",
        agents=[
            WeatherAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent should call the weather tool with the correct location"
            ])
        ],
        script=[
            scenario.user(),
            scenario.agent(),
            check_for_weather_tool_call, # Assertion function as a script step
            scenario.succeed(),
        ],
    )

Asserting Tool Call Arguments

Sometimes you need to check not just that a tool was called, but that it was called with the correct arguments.

python

def check_tool_call_args(state):
    tool_calls = state.latest_agent_message().tool_calls
    assert tool_calls, "No tool calls found"
    assert tool_calls[0].function.name == "get_current_weather"
    assert "Paris" in tool_calls[0].function.arguments
 
result = await scenario.run(
    ...,
    script=[
        scenario.user("What's the weather in Paris?"),
        scenario.agent(),
        check_tool_call_args,
        scenario.succeed(),
    ],
)

Evaluating Tool Calls with JudgeAgent Criteria

For more nuanced or subjective tool call evaluation, use JudgeAgent with natural language criteria.

python

result = await scenario.run(
    ...,
    agents=[
        WeatherAgent(),
        scenario.UserSimulatorAgent(),
        scenario.JudgeAgent(criteria=[
            "The agent should call the get_current_weather tool with the correct location (Rome) and use the result to answer the user's question."
        ])
    ],
    script=[
        scenario.user("Should I bring an umbrella to Rome?"),
        scenario.agent(),
        scenario.judge(),
    ],
)

Mocking or Scripting Tool Call Results

To make your tests deterministic and avoid backend setup, inject a tool response directly using scenario.message().

python

import pytest
import scenario
import litellm
from function_schema import get_function_schema
 
 
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_mocked_weather_agent_tool():
    # Integrate with your agent
    class WeatherAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return weather_agent(input.messages)
 
    # Run the scenario
    result = await scenario.run(
        name="checking the weather",
        description="""
            The user is planning a boat trip from Barcelona to Rome,
            and is wondering what the weather will be like.
        """,
        agents=[
            WeatherAgent(),
            scenario.UserSimulatorAgent(model="openai/gpt-4.1"),
        ],
        script=[
            scenario.message(
                {"role": "user", "content": "What's the weather in Paris?"}
            ),
            scenario.message(
                {
                    "role": "assistant",
                    "content": None,
                    "tool_calls": [
                        {
                            "id": "call_123",
                            "function": {
                                "name": "get_current_weather",
                                "arguments": '{"location": "Paris"}',
                            },
                            "type": "function",
                        }
                    ],
                }
            ),
            scenario.message(
                {
                    "role": "tool",
                    "tool_call_id": "call_123",
                    "content": "The weather in Paris is sunny and 75°F.",
                }
            ),
            scenario.agent(),
            scenario.succeed(),
        ],
        set_id="python-examples",
    )
 
    # Assert the simulation was successful
    assert result.success
 
 
# Example agent implementation, without any frameworks
import litellm
import random
 
 
def get_current_weather(city: str) -> str:
    """
    Get the current weather in a given city.
 
    Args:
        city: The city to get the weather for.
 
    Returns:
        The current weather in the given city.
    """
 
    choices = [
        "sunny",
        "cloudy",
        "rainy",
        "snowy",
    ]
    temperature = random.randint(0, 30)
    return f"The weather in {city} is {random.choice(choices)} with a temperature of {temperature}°C."
 
 
@scenario.cache()
def weather_agent(messages, response_messages=[]) -> scenario.AgentReturnTypes:
    tools = [
        get_current_weather,
    ]
 
    response = litellm.completion(
        model="openai/gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": """
                    You a helpful assistant that may help the user with weather information.
                    Do not guess the city if they don't provide it.
                """,
            },
            *messages,
            *response_messages,
        ],
        tools=[
            {"type": "function", "function": get_function_schema(tool)}
            for tool in tools
        ],
        tool_choice="auto",
    )
 
    message = response.choices[0].message  # type: ignore
 
    return [*response_messages, message]  # type: ignore

Summary of Tool Call Testing Approaches

Designing robust scenario tests for tool-using agents requires more than just checking if a tool was called. The following best practices help ensure your tests are reliable, maintainable, and accurately reflect real-world agent behavior—covering everything from argument validation to error handling and deterministic scripting.

Add assertion functions directly to your script to check tool call behavior at the right moment.
Check tool call arguments to verify the agent is using tools correctly, not just that a call was made.
Simulate tool failures or edge cases by scripting tool responses, making your tests robust to error handling.
Use JudgeAgent with natural language criteria for nuanced or subjective tool call evaluation.
Script tool responses with scenario.message() to make tests deterministic and avoid external dependencies.