Research Foundation¶

AgentGate implements evaluation techniques from 24 published papers spanning ICLR, NeurIPS, ACL, and top AI labs.

Papers by Module¶

Paper	Venue	Module	Key Contribution
A Hitchhiker's Guide to Agent Evaluation	ICLR 2026 Blog	Core	8-dimension evaluation framework
Demystifying Evals for AI Agents	Anthropic 2026	`regression`	Capability vs regression management

Paper	Venue	Module	Key Contribution
τ-bench	ICLR 2025	`scenario`	pass@k / pass^k consistency metrics
Advancing Agentic Systems	NeurIPS 2024	`metrics`	Node F1, Edge F1, tool edit distance
SABER	ICLR 2026	`metrics`	Decisive deviation scoring

Paper	Venue	Module	Key Contribution
OWASP Top 10 Agentic	OWASP 2025	`adversarial`	10 agentic attack categories
AgentHarm	ICLR 2025	`adversarial`	440 malicious agent tasks
ASB	ICLR 2025	`adversarial`	400+ tools, 27 attack methods
ST-WebAgentBench	2024	`scenario`	Completion under Policy (CuP)

Paper	Venue	Module	Key Contribution
AgentNoiseBench	2026	`noise`	User-noise + tool-noise taxonomy
ToolCert/CATS	2025	`tool_selection`	Adversarial tool injection
ODCV-Bench	ICML 2026 sub.	`kpi_trap`	KPI gaming detection

Paper	Venue	Module	Key Contribution
AgentRewardBench	NeurIPS 2025	`metrics`	Side effects + repetition
Toward Efficient Agents	2026	`cost`	Cost-effectiveness metrics
HAL	Princeton 2025	`cost`	Pareto frontier, 21K rollouts

Paper	Venue	Module	Key Contribution
WebGraphEval	NeurIPS 2025	`trajectory`	Graph-based evaluation
RewardFlow	ICLR 2026	`trajectory`	Credit assignment
Agent-Diff	2026	`state_diff`	State diff verification

Paper	Venue	Module	Key Contribution
MultiAgentBench	ACL 2025	`multi_agent`	Collaboration quality
ValueFlow	2026	`multi_agent`	Free-rider detection
MemoryAgentBench	ICLR 2026	`memory_eval`	4-competency memory eval

Paper	Venue	Module	Key Contribution
HTC	2026	`confidence`	Trajectory confidence calibration
AgentAsk	2025	`confidence`	Handoff error taxonomy
Same Prompt Different Outcomes	2026	`reproducibility`	Variance analysis
HumanAgencyBench	2025	`confidence`	Escalation appropriateness

Paper	Venue	Module	Key Contribution
Silent Failures in Multi-Agentic AI	IBM 2025	`silent_failures`	Drift, cycles, missing details