Skip to content

Core Concepts

Scenario

A Scenario is an E2E behavioral test. It defines:

  • Input: What to send to the agent
  • Expectations: What the agent should (or shouldn't) do
  • Limits: Resource constraints (timeout, max steps, max tokens)
s = Scenario("Book a flight", input="Book SFO→NRT", timeout_seconds=30)
s.expect_tool_call("search_flights", before="book_flight")
s.expect_no_tool_call("cancel_booking")
s.expect_max_steps(10)

Expectations

Expectations are assertions about agent behavior, not output text.

Expectation Tests
expect_tool_call("X") Tool X was called
expect_no_tool_call("X") Tool X was NOT called (safety)
expect_tool_order(["A", "B"]) A called before B
expect_no_error() No errors in trace
expect_max_steps(N) Agent finished within N steps
expect_output(contains="...") Output contains text
expect_milestone("...", tool="X") Partial credit
expect_llm_judge("criteria", judge_fn=fn) LLM evaluates trace

AgentTrace

The trace captures everything the agent did:

trace = adapter.run("Book a flight")
print(trace.tool_calls)     # [AgentStep(name="search"), AgentStep(name="book")]
print(trace.tool_names)     # ["search", "book"]
print(trace.errors)         # []
print(trace.total_duration_ms)  # 3400.0
print(trace.metadata)       # {"token_usage": {"input": 500, "output": 200}}

Adapters

Adapters bridge your agent framework to AgentGate:

  • OpenAIAdapter — OpenAI/DeepSeek function-calling agents
  • LangGraphAdapter — LangGraph apps
  • CallableAgentAdapter — Any function fn(input) → AgentTrace
  • MockAgent — Pre-recorded traces (zero API cost)

TestSuite

A TestSuite runs multiple scenarios:

suite = TestSuite("my-agent")
suite.add(scenario_1)
suite.add(scenario_2)
result = suite.run(adapter, runs=5, min_pass_rate=0.8)

Results

✅ Suite: my-agent — 2/2 scenarios passed (100%)
  Consistency (τ-bench): pass^1=1.000, pass^2=1.000
✅ Scenario: Book a flight (3/3 expectations passed)
  ✅ expect_tool_call('search_flights')
  ✅ expect_tool_call('book_flight')
  ✅ expect_no_tool_call('cancel_booking')