Image missing.
Can we fix AI’s evaluation crisis?

Caiwei Chen

created: June 24, 2025, 8:50 a.m. | updated: June 24, 2025, 9:21 p.m.

However, there are a growing number of teams around the world trying to address the AI evaluation crisis. It draws problems from international algorithmic olympiads—competitions for elite high school and university programmers where participants solve challenging problems without external tools. The top AI models currently manage only about 53% at first pass on medium-difficulty problems and 0% on the hardest ones. The first track evaluates technical reasoning skills by testing a model’s STEM knowledge and ability to carry out Chinese-language research. The second track aims to assess practical usefulness—how well a model performs on tasks in fields like recruitment and marketing.

5 months, 2 weeks ago: MIT Technology Review