AI Evaluation: Better Benchmarks and Beyond

Amelia Hardy

Bryan Maygers

Researchers, tech companies, and news media outlets frequently discuss AI model performance in terms of benchmarks. But what do these benchmarks really tell us?

Amelia Hardy, researcher at Stanford University, will examine how benchmarks are used to inform real world decisions. Hardy will explore their strengths and limitations, introduce best practices for building more reliable and informative benchmarks, and discuss alternative evaluation methods that go beyond benchmarking. SSIR deputy digital editor, Bryan Maygers, will moderate a short Q&A after Hardy’s presentation.