Stories
Slash Boxes
Comments

SoylentNews is people

Submission Preview

Link to Story

Humanity’s Last Exam, a Groundbreaking New Benchmark

Accepted submission by AnonTechie at 2025-01-29 09:15:45
/dev/random

Scale AI and CAIS Unveil Results of Humanity’s Last Exam, a Groundbreaking New Benchmark

Scale AI and the Center for AI Safety (CAIS) are proud to publish the results of Humanity’s Last Exam, a groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise. The results demonstrated a significant improvement from the reasoning capabilities of earlier models, but current models still were only able to answer fewer than 10 percent of the expert questions correctly. The paper can be read here [arxiv.org].

The new benchmark, called “Humanity’s Last Exam,” evaluated whether AI systems have achieved world-class expert-level reasoning and knowledge capabilities across a wide range of fields, including math, humanities, and the natural sciences. Throughout the fall, CAIS and Scale AI crowdsourced questions from experts to assemble the hardest and broadest problems to stump the AI models. The exam was developed to address the challenge of “benchmark saturation”: models that regularly achieve near-perfect scores on existing tests, but may not be able to answer questions outside of those tests. Saturation reduces the utility of a benchmark as a precise measurement of future model progress.

[Source]: Scale AI [scale.com]


Original Submission