However Briefly

Russell Brandom

created: May 8, 2025, 9 a.m. | updated: May 13, 2025, 12:11 p.m.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. “Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

Read Full Article

7 months, 1 week ago: MIT Technology Review