15 May 2026

How can we best evaluate agentic AI?

On October 14, 2025, a workshop convened experts to address critical gaps in evaluating agentic AI, which operates autonomously, interacts with environments, and pursues open-ended goals, unlike static or narrowly scoped models. Developing a research roadmap for measurement is essential for building evidence-based governance frameworks. Key themes include the lack of a shared definition for “agentic AI,” suggesting an understanding of agency as a spectrum rather than a binary property. Significant measurement challenges exist because agentic systems exhibit stochastic behavior, and their performance cannot be fully characterized by contained benchmarks, necessitating real-world field testing and domain-specific assessments. Challenges inherited from large language model evaluation, such as training data contamination and overfitting to benchmark tasks, are exacerbated. Future research must apply measurement science to AI, simulate human-agent interaction, and evaluate memory-enabled personalized agents, long-horizon tasks, and multi-agent systems for effective governance.

No comments: