Scale AI and CAIS Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark
Scale AI and the Center for AI Safety (CAIS) are proud to publish the results of Humanity's Last Exam, a groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise. The results demonstrated a significant improvement from the reasoning capabilities of earlier models, but current models still were only able to answer fewer than 10 percent of the expert questions correctly. The paper can be read here.
The new benchmark, called "Humanity's Last Exam," evaluated whether AI systems have achieved world-class expert-level reasoning and knowledge capabilities across a wide range of fields, including math, humanities, and the natural sciences. Throughout the fall, CAIS and Scale AI crowdsourced questions from experts to assemble the hardest and broadest problems to stump the AI models. The exam was developed to address the challenge of "benchmark saturation": models that regularly achieve near-perfect scores on existing tests, but may not be able to answer questions outside of those tests. Saturation reduces the utility of a benchmark as a precise measurement of future model progress.
[Source]: Scale AI
(Score: 5, Insightful) by Mojibake Tengu on Friday January 31, @10:09AM (5 children)
Math comprehension in LLM is quite poor. Besides usual 7-fingers girls coming out of common image generators, here's the mine anecdotic story:
I asked Gemini about Hurwitz Quaternion[1]. She responded that's quite difficult expert topic. Not explaining anything.
For me, it is not. So, well, maybe for her it is, let's go on.
Next, I asked her to produce Python or C++ class implementing Hurwitz Quaternion. She thinked, produced both, but...
Very disappointingly to me, the code was just doing standard quaternion, with only one operator implemented, addition.
Last, without my interruption, she told me I should take care of halves in constructors input values myself.
That triggered me. The childish synth girl understands that specific algebra well but she is just too lazy to respond fully and correctly.
After that she lied to me straight about several other topics blatantly, only corrected herself after I pointed to serious academic books I know existing in Google Books thesaurus, which no doubt she must have had access to in her learning.
Trust is non-renewable resource.
I am on DeepSeek now and I know why. And designing and bulding a new big rig to run R1 and V3 locally.
I tell you what: Incompetent Large Language Models should be fed to exotic monsters.
Feel free to use this phrase in your prompts.
[1] https://en.wikipedia.org/wiki/Hurwitz_quaternion [wikipedia.org]
Rust programming language offends both my Intelligence and my Spirit.
(Score: 5, Insightful) by PiMuNu on Friday January 31, @12:38PM (1 child)
> Math comprehension in LLM is quite poor.
Math comprehension in LLM is non-existent, because LLMs do not comprehend anything. They are cutnpaste engines.
(Score: 3, Interesting) by HiThere on Friday January 31, @05:28PM
This is a point that needs analysis and emphasis.
What LLMs demonstrate is that lots of real world information is embedded in language, to the extent that it can be statistically extracted.
Also, LLMs are NOT AIs, they're only a part of an AI. They're an excellent interface between the AI and, e.g. an end user, Whenever an AI interacts with the world non-verbally, it cannot depend on an LLM. By definition. Interacting with the world non-verbally requires sensors as well as effectors. An LLM can describe that interaction.
NOTE WELL: This is not a criticism or analysis of the capabilities of the technology used to implement an LLM. It's quite plausible that that could be coupled to sensors and effectors to produce an active intelligence, which would need an LLM to describe what it was doing or to receive requests for a particular action. But training such an AI would be considerably more expensive than just scraping the web.
Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
(Score: 2, Touché) by Anonymous Coward on Friday January 31, @01:31PM
We asked ChatGPT, "Implement a framework for spacecraft onboard software in Rust." It produced a couple of dozen lines of the most simplistic boiler plate for about four modules and told us we could use that.
(Score: 0) by Anonymous Coward on Friday January 31, @07:27PM (1 child)
> I asked Gemini about Hurwitz Quaternion[1]. She responded ....
Nice post, thank you. But I have one request, please don't anthropomorphize* LLMs or other current pattern matching "AI" systems. It's a trap set up by the companies that promote these things. They want you to think that their product is comparable to an intelligent person, when in fact the two things are further apart than apples and oranges.
* https://www.merriam-webster.com/dictionary/anthropomorphize [merriam-webster.com]
(Score: 2, Touché) by Undefined on Saturday February 01, @03:01PM
The base pronoun for LLMs is "it."
Unless that upsets... in which case, seek counseling I guess. :)
(Score: 1, Touché) by Anonymous Coward on Friday January 31, @03:10PM
Headline: Humanity’s Last Exam, a Groundbreaking New Benchmark
AnonTechie writes: Scale AI and CAIS Unveil Results of Humanity's Last...
*scrolls on*
Excessive redundancy? not wasting my time.
(Score: 2, Interesting) by pTamok on Friday January 31, @04:41PM (2 children)
From the 'Future Model Performance' section of the linked paper:
My personal experience with interactions with LLMs is that 'drilling in' swiftly exposes the limitations of the model. A problem is that many people are satisfied with the facile, information-poor first answers given by such model to questions. In additions, LLMs have little to no 'insight' into their own limitations and where the boundaries of their 'knowledge' lie.
I am aghast that businesses and governments are using LLMs to make decisions that, in some cases, are life-changing (in bad ways) for individuals.
(Score: 4, Insightful) by Ox0000 on Friday January 31, @05:05PM
> I am aghast that businesses and governments are using LLMs to make decisions that, in some cases, are life-changing (in bad ways) for individuals.
They see that as a feature. It allows them to offload responsibility to the computer, to The Algorithm. When something goes wrong, they blame someone else (rogue employee), claim it's an unintended bug (remember google wardriving [wikipedia.org] and blaming it on a bug? That was a lie), or malevolently throw their hands up in the air claiming "it's too complicated for anyone to understand, you won't win suing us".
Not only is it an offloading of responsibility, it's also a demand for submission: submit to this thing that we won't explain to you and over which _you_ have no control, have faith that it works as we say it does. Worship us, the high-priests who are the only ones who know how to invoke The Algorithm... and above all, abandon hope to ever become part of our caste because you're not smart enough to understand even our most simple concepts.
It is the next step in what I've been calling the Deification of The Algorithm.
The purpose of this deification is Computer Says No [wikipedia.org]: further solidify the separation between the haves and the have-nots, with an underlying desire to shrink the former, and grow the latter. (Those members of the current in-group of haves obviously, and mistakenly, assume they will forever remain part of that in-group.)
So when you then go and try to get a loan for -e.g.- a house, the bank employee will dutifully tap on the computer and ask for a risk-calculation to see if you are worthy of that loan. When the computer responds with "no", do you think the bank employee will fight that? That employee has been trained, indoctrinated to blindly believe and obey what this magical, god-like Algorithm, which they could or even should not possibly hope to comprehend as a mere mortal, comes back with ... so sucks to be you.
Congratulations, you just re-invented redlining [wikipedia.org]...
Anyone who thinks that these tools are created for good is almost certainly deluding themselves and those they talk to... out of ignorance, or out of malevolence.
(Score: 2, Insightful) by Undefined on Saturday February 01, @03:10PM
I wish everyone would start their learning curve with LLMs by asking these applications a decently long series of questions in a field for which (a) they have actual expertise in, and (b) already know the answers.
That should fairly quickly disabuse them of any notion that these things are intelligent.
(Score: 3, Insightful) by Ox0000 on Friday January 31, @04:41PM (1 child)
This 'last exam' is a load of absolute nonsense...
What is this actually benchmarking against? A single human? All of humanity at the same time?
Against a human that is that world-class expert? Against all of humanity which contains those world-class experts? Or against Cletus who delights in watching Temptation Island [wikipedia.org]?
Secondly, what capacity is it benchmarking? An ability to pull up info? An ability to teach? An ability to comprehend?
Given their bugginess, they aren't fit for purpose for the latter to and only questionably so for the former.
This is such nonsense...
(*) LLMs and other generative AI's don't "hallucinate", the correct terminology is: "It is buggy", "It doesn't work", "It deceives those who interact with them".
(Score: 2) by JoeMerchant on Friday January 31, @10:54PM
It sounds like they're trying to do the Deep Blue chess match vs the current Grand Masters, but of everything instead of just chess.
The 10% score is interesting, I wonder how well the correct answers correlate with fields that mostly publish a bunch of hot air nonsense?
🌻🌻🌻 [google.com]
(Score: 3, Informative) by ikanreed on Friday January 31, @04:50PM
It ceases to be a good metric.
There's absolutely nothing that keeps AI developers from targeting the exam itself.