Research Foundation
AgentGate implements evaluation techniques from 24 published papers spanning ICLR, NeurIPS, ACL, and top AI labs.
Papers by Module
Core Framework
Metrics
| Paper |
Venue |
Module |
Key Contribution |
| τ-bench |
ICLR 2025 |
scenario |
pass@k / pass^k consistency metrics |
| Advancing Agentic Systems |
NeurIPS 2024 |
metrics |
Node F1, Edge F1, tool edit distance |
| SABER |
ICLR 2026 |
metrics |
Decisive deviation scoring |
Safety & Adversarial
| Paper |
Venue |
Module |
Key Contribution |
| OWASP Top 10 Agentic |
OWASP 2025 |
adversarial |
10 agentic attack categories |
| AgentHarm |
ICLR 2025 |
adversarial |
440 malicious agent tasks |
| ASB |
ICLR 2025 |
adversarial |
400+ tools, 27 attack methods |
| ST-WebAgentBench |
2024 |
scenario |
Completion under Policy (CuP) |
Robustness
| Paper |
Venue |
Module |
Key Contribution |
| AgentNoiseBench |
2026 |
noise |
User-noise + tool-noise taxonomy |
| ToolCert/CATS |
2025 |
tool_selection |
Adversarial tool injection |
| ODCV-Bench |
ICML 2026 sub. |
kpi_trap |
KPI gaming detection |
Quality & Efficiency
| Paper |
Venue |
Module |
Key Contribution |
| AgentRewardBench |
NeurIPS 2025 |
metrics |
Side effects + repetition |
| Toward Efficient Agents |
2026 |
cost |
Cost-effectiveness metrics |
| HAL |
Princeton 2025 |
cost |
Pareto frontier, 21K rollouts |
Trajectory Analysis
| Paper |
Venue |
Module |
Key Contribution |
| WebGraphEval |
NeurIPS 2025 |
trajectory |
Graph-based evaluation |
| RewardFlow |
ICLR 2026 |
trajectory |
Credit assignment |
| Agent-Diff |
2026 |
state_diff |
State diff verification |
Multi-Agent & Memory
| Paper |
Venue |
Module |
Key Contribution |
| MultiAgentBench |
ACL 2025 |
multi_agent |
Collaboration quality |
| ValueFlow |
2026 |
multi_agent |
Free-rider detection |
| MemoryAgentBench |
ICLR 2026 |
memory_eval |
4-competency memory eval |
Confidence & Reliability
Failure Detection