Humanity’s Last Exam, a Groundbreaking New Benchmark

posted by janrinok on Friday January 31, @08:42AM

Scale AI and CAIS Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

Scale AI and the Center for AI Safety (CAIS) are proud to publish the results of Humanity's Last Exam, a groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise. The results demonstrated a significant improvement from the reasoning capabilities of earlier models, but current models still were only able to answer fewer than 10 percent of the expert questions correctly. The paper can be read here.
The new benchmark, called "Humanity's Last Exam," evaluated whether AI systems have achieved world-class expert-level reasoning and knowledge capabilities across a wide range of fields, including math, humanities, and the natural sciences. Throughout the fall, CAIS and Scale AI crowdsourced questions from experts to assemble the hardest and broadest problems to stump the AI models. The exam was developed to address the challenge of "benchmark saturation": models that regularly achieve near-perfect scores on existing tests, but may not be able to answer questions outside of those tests. Saturation reduces the utility of a benchmark as a precise measurement of future model progress.

[Source]: Scale AI

Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.

Humanity’s Last Exam, a Groundbreaking New Benchmark | Log In/Create an Account | Top | 13 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

Humanity’s Last Exam, a Groundbreaking New Benchmark

Lazy Liar GeminiLazy Liar Gemini (Score: 5, Insightful) by Mojibake Tengu on Friday January 31, @10:09AM (5 children)

Re:Lazy Liar GeminiRe:Lazy Liar Gemini (Score: 5, Insightful) by PiMuNu on Friday January 31, @12:38PM (1 child)

Re:Lazy Liar Gemini(Score: 3, Interesting) by HiThere on Friday January 31, @05:28PM

Re:Lazy Liar Gemini(Score: 2, Touché) by Anonymous Coward on Friday January 31, @01:31PM

Re:Lazy Liar GeminiRe:Lazy Liar Gemini (Score: 0) by Anonymous Coward on Friday January 31, @07:27PM (1 child)

As to that(Score: 2, Touché) by Undefined on Saturday February 01, @03:01PM

Redundancy(Score: 1, Touché) by Anonymous Coward on Friday January 31, @03:10PM

High accuracy on HLE...would not alone suggest AGIHigh accuracy on HLE...would not alone suggest AGI (Score: 2, Interesting) by pTamok on Friday January 31, @04:41PM (2 children)

Re:High accuracy on HLE...would not alone suggest (Score: 4, Insightful) by Ox0000 on Friday January 31, @05:05PM

Do your own testing — very revealing(Score: 2, Insightful) by Undefined on Saturday February 01, @03:10PM

This benchmark is meaningless...This benchmark is meaningless... (Score: 3, Insightful) by Ox0000 on Friday January 31, @04:41PM (1 child)

Re:This benchmark is meaningless...(Score: 2) by JoeMerchant on Friday January 31, @10:54PM

When a metric becomes a target(Score: 3, Informative) by ikanreed on Friday January 31, @04:50PM

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

Related Links

Humanity’s Last Exam, a Groundbreaking New Benchmark

Lazy Liar GeminiLazy Liar Gemini (Score: 5, Insightful) by Mojibake Tengu on Friday January 31, @10:09AM (5 children)

Re:Lazy Liar GeminiRe:Lazy Liar Gemini (Score: 5, Insightful) by PiMuNu on Friday January 31, @12:38PM (1 child)

Re:Lazy Liar Gemini(Score: 3, Interesting) by HiThere on Friday January 31, @05:28PM

Re:Lazy Liar Gemini(Score: 2, Touché) by Anonymous Coward on Friday January 31, @01:31PM

Re:Lazy Liar GeminiRe:Lazy Liar Gemini (Score: 0) by Anonymous Coward on Friday January 31, @07:27PM (1 child)

As to that(Score: 2, Touché) by Undefined on Saturday February 01, @03:01PM

Redundancy(Score: 1, Touché) by Anonymous Coward on Friday January 31, @03:10PM

High accuracy on HLE...would not alone suggest AGIHigh accuracy on HLE...would not alone suggest AGI (Score: 2, Interesting) by pTamok on Friday January 31, @04:41PM (2 children)

Re:High accuracy on HLE...would not alone suggest (Score: 4, Insightful) by Ox0000 on Friday January 31, @05:05PM

Do your own testing — very revealing(Score: 2, Insightful) by Undefined on Saturday February 01, @03:10PM

This benchmark is meaningless...This benchmark is meaningless... (Score: 3, Insightful) by Ox0000 on Friday January 31, @04:41PM (1 child)

Re:This benchmark is meaningless...(Score: 2) by JoeMerchant on Friday January 31, @10:54PM

When a metric becomes a target(Score: 3, Informative) by ikanreed on Friday January 31, @04:50PM