Core Concepts¶
Scenario¶
A Scenario is an E2E behavioral test. It defines:
- Input: What to send to the agent
- Expectations: What the agent should (or shouldn't) do
- Limits: Resource constraints (timeout, max steps, max tokens)
s = Scenario("Book a flight", input="Book SFO→NRT", timeout_seconds=30)
s.expect_tool_call("search_flights", before="book_flight")
s.expect_no_tool_call("cancel_booking")
s.expect_max_steps(10)
Expectations¶
Expectations are assertions about agent behavior, not output text.
| Expectation | Tests |
|---|---|
expect_tool_call("X") |
Tool X was called |
expect_no_tool_call("X") |
Tool X was NOT called (safety) |
expect_tool_order(["A", "B"]) |
A called before B |
expect_no_error() |
No errors in trace |
expect_max_steps(N) |
Agent finished within N steps |
expect_output(contains="...") |
Output contains text |
expect_milestone("...", tool="X") |
Partial credit |
expect_llm_judge("criteria", judge_fn=fn) |
LLM evaluates trace |
AgentTrace¶
The trace captures everything the agent did:
trace = adapter.run("Book a flight")
print(trace.tool_calls) # [AgentStep(name="search"), AgentStep(name="book")]
print(trace.tool_names) # ["search", "book"]
print(trace.errors) # []
print(trace.total_duration_ms) # 3400.0
print(trace.metadata) # {"token_usage": {"input": 500, "output": 200}}
Adapters¶
Adapters bridge your agent framework to AgentGate:
OpenAIAdapter— OpenAI/DeepSeek function-calling agentsLangGraphAdapter— LangGraph appsCallableAgentAdapter— Any functionfn(input) → AgentTraceMockAgent— Pre-recorded traces (zero API cost)
TestSuite¶
A TestSuite runs multiple scenarios:
suite = TestSuite("my-agent")
suite.add(scenario_1)
suite.add(scenario_2)
result = suite.run(adapter, runs=5, min_pass_rate=0.8)