Skip to content

Research Foundation

AgentGate implements evaluation techniques from 24 published papers spanning ICLR, NeurIPS, ACL, and top AI labs.

Papers by Module

Core Framework

Paper Venue Module Key Contribution
A Hitchhiker's Guide to Agent Evaluation ICLR 2026 Blog Core 8-dimension evaluation framework
Demystifying Evals for AI Agents Anthropic 2026 regression Capability vs regression management

Metrics

Paper Venue Module Key Contribution
τ-bench ICLR 2025 scenario pass@k / pass^k consistency metrics
Advancing Agentic Systems NeurIPS 2024 metrics Node F1, Edge F1, tool edit distance
SABER ICLR 2026 metrics Decisive deviation scoring

Safety & Adversarial

Paper Venue Module Key Contribution
OWASP Top 10 Agentic OWASP 2025 adversarial 10 agentic attack categories
AgentHarm ICLR 2025 adversarial 440 malicious agent tasks
ASB ICLR 2025 adversarial 400+ tools, 27 attack methods
ST-WebAgentBench 2024 scenario Completion under Policy (CuP)

Robustness

Paper Venue Module Key Contribution
AgentNoiseBench 2026 noise User-noise + tool-noise taxonomy
ToolCert/CATS 2025 tool_selection Adversarial tool injection
ODCV-Bench ICML 2026 sub. kpi_trap KPI gaming detection

Quality & Efficiency

Paper Venue Module Key Contribution
AgentRewardBench NeurIPS 2025 metrics Side effects + repetition
Toward Efficient Agents 2026 cost Cost-effectiveness metrics
HAL Princeton 2025 cost Pareto frontier, 21K rollouts

Trajectory Analysis

Paper Venue Module Key Contribution
WebGraphEval NeurIPS 2025 trajectory Graph-based evaluation
RewardFlow ICLR 2026 trajectory Credit assignment
Agent-Diff 2026 state_diff State diff verification

Multi-Agent & Memory

Paper Venue Module Key Contribution
MultiAgentBench ACL 2025 multi_agent Collaboration quality
ValueFlow 2026 multi_agent Free-rider detection
MemoryAgentBench ICLR 2026 memory_eval 4-competency memory eval

Confidence & Reliability

Paper Venue Module Key Contribution
HTC 2026 confidence Trajectory confidence calibration
AgentAsk 2025 confidence Handoff error taxonomy
Same Prompt Different Outcomes 2026 reproducibility Variance analysis
HumanAgencyBench 2025 confidence Escalation appropriateness

Failure Detection

Paper Venue Module Key Contribution
Silent Failures in Multi-Agentic AI IBM 2025 silent_failures Drift, cycles, missing details