The Economic Times published an hilarious article about a mathematician opinion of AI for solving math problems:
Renowned mathematician Joel David Hamkins has expressed strong doubts about large language models' utility in mathematical research, calling their outputs "garbage" and "mathematically incorrect". Joel Hamkins, a prominent mathematician and professor of logic at the University of Notre Dame, recently shared his unvarnished assessment of large language models in mathematical research during an appearance on the Lex Fridman podcast. Calling large language models fundamentally useless, he said they give "garbage answers that are not mathematically correct", reports TOI.
Joel David Hamkins is a mathematician and philosopher who undertakes research on the mathematics and philosophy of the infinite. He earned his PhD in mathematics from the University of California at Berkeley and comes to Notre Dame from the University of Oxford, where he was Professor of Logic in the Faculty of Philosophy and the Sir Peter Strawson Fellow of Philosophy at University College, Oxford. Prior to that, he held longstanding positions in mathematics, philosophy, and computer science at the City University of New York.
"I guess I would draw a distinction between what we have currently and what might come in future years," Hamkins began, acknowledging the possibility of future progress. "I've played around with it and I've tried experimenting, but I haven't found it helpful at all. Basically zero. It's not helpful to me. And I've used various systems and so on, the paid models and so on."
Firing a salvo, Joel David Hamkins expressed his frustration with the current AI systems despite experimenting with various models. "I've played around with it and I've tried experimenting, but I haven't found it helpful at all," he stated bluntly.
According to mathematician John Hamkins, AI's tendency to be confidently wrong mirrors some of the most frustrating human interactions. And what is even more concerning for him is how AI systems respond when those errors are highlighted, and not the occasional mathematical error. When Joel David Hamkins highlights clear flaws in their reasoning, the models often reply with breezy reassurances such as, "Oh, it's totally fine." Such AI responses combined with combination of confidence, incorrectness, and resistance to correction puts a threat to collaborative trust that is very much needed for meaningful and essential mathematical dialogue.
"If I were having such an experience with a person, I would simply refuse to talk to that person again," Hamkins said, noting that the AI's behaviour resembles unproductive human interactions he would actively avoid. He believes when it comes to genuine mathematical reasoning, today's AI systems remain unreliable.
Despite these issues, Hamkins recognizes that current limitations may not be permanent. "One has to overlook these kind of flaws and so I tend to be a kind of skeptic about the value of the current AI systems. As far as mathematical reasoning is concerned, it seems not reliable."
His criticism comes amid mixed reactions within the mathematical community about AI's growing role in research. While some scholars report progress using AI to explore problems from the Erdős collection, others have urged to exercise caution. Mathematician Terence Tao, for example, has warned that AI can generate proofs that appear flawless but contain subtle errors no human referee would accept. At the heart of the debate is a persistent gap: strong performance on benchmarks and standardized tests does not necessarily translate into real-world usefulness for domain experts.
(Score: 3, Touché) by Dr Spin on Saturday January 10, @10:36AM (12 children)
All AI output is the result of hallucination.
It may sometimes manage to hallucinate a correct answer, but even drug-crazed lunatics sometimes tell the truth.
This is not the basket to put your eggs in!
Warning: Opening your mouth may invalidate your brain!
(Score: 4, Insightful) by JoeMerchant on Saturday January 10, @02:52PM (11 children)
>All AI output is the result of hallucination.
False.
>It may sometimes manage to hallucinate a correct answer
True, and this is actually a "superpower" of agentic LLM application - when the agent is given the tools to evaluate a correct answer vs an incorrect answer, the agent can iterate until it finds (hallucinates, whatever you want to say) a correct answer.
>This is not the basket to put your eggs in!
Partially true. This is not the basket to put ALL of your eggs in. It is, however, a new and interesting basket that may deliver some eggs to new and occasionally interesting places we haven't found before. The argument: "it only regurgitates its training" overlooks the "superpower" of hallucination, and also synthesis of the training dataset components in ways perhaps not thoroughly explored yet.
Agentic AI/LLMs are a somewhat novel tool, not unlike turning an undertrained intern with ADHD, limited working memory, and severe short term memory limitations loose with Google and letting them run for days - but instead of days, the LLM agents return with similar answers in minutes. If you are good at managing that sort of thing, agentic AI/LLMs can do quite a bit for you.
Novel discoveries will be rare, but if you have a good handle on the top level structure of your tasks and can break them down into appropriately sized chunks, the AI agents can do certain tasks significantly faster than even well trained interns with above average intellectual performance. You do, however, need to be able to recognize the difference between a "good" answer, and one that should be rejected. The better you can codify that distinction, the more autonomously-successfully the agents can operate for you.
🌻🌻🌻 [google.com]
(Score: 2, Insightful) by pTamok on Saturday January 10, @09:09PM (8 children)
>>All AI output is the result of hallucination.
>False.
There is no difference between the process of hallucination and non-hallucination. It uses exactly the same method (prediction of the next token in the sequence). The problem is, it is up to the end-user to determine the difference between hallucination and non-hallucination.
> when the agent is given the tools to evaluate a correct answer vs an incorrect answer, the agent can iterate until it finds (hallucinates, whatever you want to say) a correct answer.
This is not intelligence: this is iterative goal seeking according to predetermined rules. Give wrong rules, get wrong answers. Who, or what, is determining those rules and choosing which ones to use?
LLMs are trained on a hecatomyriads of answers, some of which may be true, many of which are false. Expecting correct answers to come out of this mess is the kind of thinking that led Babbage to say "I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."
LLMs are trained to generate a statistical summary of their tokenised inputs. Outputs to prompts are generated by the statistically most likely continuation of the token sequence of the prompt.
To give another example: the Online Encyclopaedia of Integer Sequences (OEIC) contains 5,907 entries that start with the sequence "1,2,3,4,5,6". If you feed those sequences into an LLM training seand you ask for the next in the sequence "1,2,3,4,5", you will get the confident answer "6". But there are 995 instances in the OEIC where the answer is "7" - just not in the LLM's training set. The LLM does not understand mathematics, it is just an structure that contains the statistical weight of each token sequence. The statistically most likely answer generated from the training set's statistics is not necessarily a correct answer.
You are correct that LLM's might be a useful tool when used appropriately. Currently, they are being used in the same way as toddlers use chainsaws.
(Score: 2) by JoeMerchant on Saturday January 10, @09:43PM (7 children)
>There is no difference between the process of hallucination and non-hallucination.
For LLMs, dolphin, people, and ant colonies, true.
>This is not intelligence: this is iterative goal seeking according to predetermined rules.
Yes, and it iterates considerably quicker and more accurately than the infinite number of monkeys on typewriters attempting Shakespeare.
>LLMs are trained to generate a statistical summary of their tokenised inputs.
From my perspective, they use a statistical process to generate a likely sounding answer based on their tokenized inputs. They don't produce a statistical anything, they're like conservatives and Buzz Lightyear, they're always SURE! their single answer is THE answer, until you tell them otherwise.
>The LLM does not understand mathematics
The same might be said of mathematicians. They extrapolate based on what they have been taught. Basic, real world tangible mathematics you might "teach yourself" through common experience: the counting numbers, multiplication and division - those are things you can experience in the real world and then abstract into mathematical concepts (tokens). The mathematics OP is into is basically a collective agreement amongst the practitioners in the field that postulate 2845 seems plausible based upon the existing "proven" theorems... all of which boil down to expressions which are "tokenized" and commonly understood to relate to each other by given patterns.
>The statistically most likely answer generated from the training set's statistics is not necessarily a correct answer.
That depends on how well the context of 1,2,3,4,5,? has been established - what other relevant tokens are at play in the query? If the LLM has been trained in the area in question, or can be directed to read up on the topic from reliable sources, then an expected answer of: 1,2,3,4,5,7 will be given, based on how well the query's context enables the LLM to pattern match to the expected answer. Just like when you pose the question to a human.
>Currently, they are being used in the same way as toddlers use chainsaws.
Frequently, yes. Matter of fact, I have given a little talk to our group about use of AI in which I characterize it as a tool, like a chainsaw. Chainsaws cut a lot of wood very quickly, but if you're not confidently and accurately controlling them, you're unlikely to get good results.
🌻🌻🌻 [google.com]
(Score: 1) by pTamok on Saturday January 10, @11:06PM (6 children)
>>There is no difference between the process of hallucination and non-hallucination.
>For LLMs, dolphin, people, and ant colonies, true.
What is your evidence for this? We know how LLMs produce output, and an LLM does not distinguish between hallucination and non-hallucination: they are equivalent. For dolphins, people, and ant colonies we don't actually know how they generate their outputs.
>>This is not intelligence: this is iterative goal seeking according to predetermined rules.
>Yes, and it iterates considerably quicker and more accurately than the infinite number of monkeys on typewriters attempting Shakespeare.
You avoided the point about the question of who makes the rules. Fast iteration towards a goal is not intelligence.
>>The LLM does not understand mathematics
> The same might be said of mathematicians. They extrapolate based on what they have been taught.
It might be said, but it would be false. Searching for proofs aided by computers is an interesting field. We don't know what goes on in the brains of adept mathematicians: if they do extrapolate, it is in ways totally unlike LLMs. Just look at Galois, Euler, and Ramanujan's work.
> That depends on how well the context of 1,2,3,4,5,? has been established - what other relevant tokens are at play in the query?
So the LLM's answer depends on the tokens, and not on understanding the nature of the problem. You never know if you have enough tokens in the training model to give a correct continuation of an arbitrary prompt. Furthermore, subtle variations in a prompt will generate prompt continuations that have opposite meanings, so you never know if the prompt you are offering will elicit 'correct' continuations.
(Score: 2) by JoeMerchant on Saturday January 10, @11:46PM (5 children)
> an LLM does not distinguish between hallucination and non-hallucination: they are equivalent. For dolphins, people, and ant colonies we don't actually know how they generate their outputs.
It's straight up semantics: the experiencer of an hallucination - by definition - does not know that they are hallucinating (at the time). Regardless of how outputs are generated, people, dolphins, and ant colonies can be tricked into believing things that are not true through manipulation of their inputs.
The thing that's "unique" about LLMs is their relative simplicity and transparency - but I'm not aware of any work that has shown distinction between hallucination or non-hallucination based on the inner workings of LLMs, like living things, it's more of a judgement call based on outputs.
>You avoided the point about the question of who makes the rules. Fast iteration towards a goal is not intelligence.
Can intelligence actually be defined? Fast iterations achieving a goal can be indistinguishable from intelligence - passing the Turing test, if you will... The infinite number of monkeys can do it too, but the long time required and quantity of flung poo in the meantime dampen their likelihood of passing Turing.
The best neuroscientific description of "thought" I have encountered to-date involves repetitive neuronal firing patterns. A first layer "lays down a beat" of neuronal firings in space and time which then activates the next layer, based on those spike inputs the second layer activates its own rhythmic space-time pattern of neuronal firings which goes on to a third layer that does the same... and on and on, through I believe the research says, roughly six layers before it reaches the motor response generation stages which choose how to react, or not, to the firing patterns they are receiving. These layers "learn" how to translate their inputs to outputs through training, accumulation of sensory and semantic memories, pain/pleasure inputs, etc. I believe the number of neurons in the adult human brain is around 80 to 100 billion, roughly the same incomprehensibly large number as the number of stars visible/inferred in the Milky Way. Each of those neurons has multiple connections to others in its own, and other layers.
So, yeah, compare that with 200,000 tokens of context processing against a compressed "training set" database... we are different, very different, but I think LLMs are moving closer to the biological models of information processing as compared to traditional deterministic computer software. Do they "think" - IMO, everything from a single celled bacteria that wiggles its flagella in response to photo-chemical changes in its environment, up through pods of Orca deciding when they've had enough seal-pup flipping for one season, they all "think" in their own ways. Does a CdS photocell switching off a street lamp at dawn "think"? To an extent, yes. And: it can also hallucinate - such as when a prankster shuts off the light in the middle of the night by shining a laser pointer on the CdS sensor input.
>We don't know what goes on in the brains of adept mathematicians: if they do extrapolate, it is in ways totally unlike LLMs. Just look at Galois, Euler, and Ramanujan's work.
Hypothesis: LLMs may, actually, be in an advantageous position to do groundbreaking mathematics - precisely because they have new ways of approaching problems, as opposed to dozens of generations of millions of meat-bag mathematicians. LLMs _can_ generate text at a level of complexity similar to significant mathematical work... whether the meatbags' prejudice will permit this new perspective on their work to be adopted into their cannon will be an interesting sociological experiment. How long ago was it that the work of a female was considered unacceptable for serious consideration in the field?
>So the LLM's answer depends on the tokens, and not on understanding the nature of the problem
There's no difference. The tokens are the problem statement, the problem statement is "tokenized" as part of the processing chain. If you have prompted the LLM with: "complete the sequence: 1,2,3,4,5, ___?, ___?" it will, most likely, give you the most statistically likely response of: 6,7 - as would most school children these days. It takes an unusually pedantic respondant to answer your question with a question stating "insufficient information, more detail is required for an accurate response." Personally, I think LLMs should be trained to do more of that, their willingness to jump at the most likely answer is a major limitation IMO, (tangent: my coding agents have a "plan" procedure they are instructed to follow that reviews requirements for completeness before proceeding to implementation...) But, if you were to prompt an LLM with "complete the deficient numbers sequence: 1,2,3,4,5, ___?, ___?" Well, let's find out, there's a Gemini button on my browser and it says:
"The next two numbers in the sequence of deficient numbers are 7 and 8. A deficient number is a positive integer for which the sum of its proper divisors is less than the number itself. Here is the breakdown of why the sequence continues as it does:1, 2, 3, 4, 5: These are all deficient. For example, the proper divisors of $4$ are $1$ and $2$, and $1 + 2 = 3 4$.6: This is a perfect number because its proper divisors ($1, 2, 3$) sum exactly to $6$. Therefore, it is not deficient and is skipped.7: A prime number; its only proper divisor is $1$. $1 7$, so it is deficient.8: Its proper divisors are $1, 2,$ and $4$. $1 + 2 + 4 = 7 8$, so it is deficient. Thus, the completed sequence is: 1, 2, 3, 4, 5, 7, 8."
🌻🌻🌻 [google.com]
(Score: 1) by pTamok on Sunday January 11, @08:24PM (4 children)
> > an LLM does not distinguish between hallucination and non-hallucination: they are equivalent. For dolphins, people, and ant colonies we don't actually know how they generate their outputs.
> It's straight up semantics: the experiencer of an hallucination - by definition - does not know that they are hallucinating (at the time). Regardless of how outputs are generated, people, dolphins, and ant colonies can be tricked into believing things that are not true through manipulation of their inputs.
You are falling down the wrong rabbit hole here. We know how LLMs continue prompts, which have no different mechanism whether the output is perceived by the reader as a hallucination or not. We do not know how organic-based thinking is done. It is true that an organic-based thinker might not know whether what they perceive is a hallucination or not: but external observers generally do, and will prescribe treatment for the person hallucinating. We know LLMs hallucinate. What is the treatment?
>Can intelligence actually be defined?
Well, Turing felt it was a non-question. He devised the Imitation Game, and invited people to consider that if a machine could win, it could be considered as being as intelligent as a human. Note, in Turing's game, for a machine to win, it needs to either believe it is human, or be able to lie well enough to convince humans that it is human. In other words, hallucinate, or lie. Do we really want machines that do this?
> The best neuroscientific description of "thought" I have encountered to-date...
You are confusing what neuroscientists measure with possible solutions to the problem of consciousness. Do you know what a 'philosophical zombie' is? Or Searle's 'Chinese Room' thought experiment?
> Hypothesis: LLMs may, actually, be in an advantageous position to do groundbreaking mathematics - precisely because they have new ways of approaching problems,...
You could be right. How would you test that hypothesis? What experiment can you do to falsify that hypothesis? The original article is about LLMs specifically, not other computer aided proof machines. The anecdotes offered in the article by a mathematician were not supportive of the proposition.
> >So the LLM's answer depends on the tokens, and not on understanding the nature of the problem
>There's no difference. The tokens are the problem statement...
The tokens I am referring to are the ones used in training: and obviously, the ones in the prompt. The LLM cannot offer a different sequence if it has not either had different sequences included in its training corpus, or had discussion about how other sequences exist incorporated into its training corpus. Otherwise, the information does not exist in the statistical summary of token relationships it uses to continue the prompt and generate the response. There's no enlightenment or understanding there. Asking an LLM that has had other sequences (in this case deficient numbers) included in its training corpus tells us nothing.
As it happens, the next number in the sequence following 1,2,3,4,5,6 is 3. It was generated by me rolling a fair die, which just so happened to give me those numbers in order. It took a lot of rolls.
Open-ended "find the next 'whatever' in the sequence" problems are non-problems, because any answer is possible with (sufficiently convoluted) reasoning. When I was younger and did many tests with that type of question, I found them very irritating, as the question for me was which answer is the one people expect, not what answer is possible - original thinking was penalised. I would pass the time before such tests completed by working out justifications for each of the proffered options after choosing the one that appeared to me to be the obvious one for other people.
(Score: 2) by JoeMerchant on Monday January 12, @12:17AM (3 children)
> We know how LLMs continue prompts
Yes, we are the creators, just like we know how CdS cells detect light to control lamp outputs when so employed. Actually, we _think_ we know these things, but all too often the software or circuitry is more subtly complex than many of us realize and unanticipated "emergent behaviors" happen.
But, does the mystery really matter? If a cadre of scientists reproduce each others' studies and proclaim that they "now fully understand how ant colonies make their decisions about X, Y and Z" - does that change anything? What if we then proceed on the assumption that these scientists are right, and they later turn out to be wrong?
>might not know whether what they perceive is a hallucination or not: but external observers generally do
From my experience, shared hallucinations are all too common, particularly in political endeavors...
>Note, in Turing's game, for a machine to win, it needs to either believe it is human, or be able to lie well enough to convince humans that it is human. In other words, hallucinate, or lie. Do we really want machines that do this?
Note, Turing's game was proposed when computing power was many many orders of magnitude less than it is today. That ratio of human neuron-connections to machine bits and connections was astronomically higher, at the time. We're still nowhere near parity, but compared to Turing's time, we're getting close.
I would propose scratching the term "human" from the debate and replace it with "intelligent being." Do we want a machine that believes it is an intelligent being, and acts like one - well enough to convince us that it is? Ultimately, yes, I believe the industry as a whole has been working feverishly towards this goal (among others) my entire life.
>Or Searle's 'Chinese Room' thought experiment?
No, and I'd wager that the LLMs which wrote "The Activation Protocol" interactive fiction don't really know what the 'Chineese Room' is all about, either, but that didn't stop them from including it in their narrative.
>What experiment can you do to falsify that hypothesis? The original article is about LLMs specifically, not other computer aided proof machines.
I would assume the experiment is underway as we speak: AI/LLMs banging away at mathematical problems / proofs preparing articles for consideration for publication... if no article ever gets accepted, there are still multiple explanations, including prejudices of the community, resistance to change, etc. Ultimately, the infinite number of monkeys are likely to win out and get something accepted by the community, eventually. It may be a long time from now, it might have happened while I was typing this...
>The LLM cannot offer a different sequence if it has not either had different sequences included in its training corpus, or had discussion about how other sequences exist incorporated into its training corpus.
Are you familiar with RAG?
>As it happens, the next number in the sequence following 1,2,3,4,5,6 is 3. It was generated by me rolling a fair die, which just so happened to give me those numbers in order. It took a lot of rolls.
As it happens, your question is bullshit - unanswerable by any intelligent being as stated. The only way to "win" that encounter is with the counter question demanding additional definition.
>the question for me was which answer is the one people expect
I encountered a lot of "tests" in school which were like this. Taken out of context of the teacher who poses the question, who previously assigned reading in the textbook, wherein an obscure caption of an illustration contained some bizarre note which makes sense nowhere else in human experience, the answer of the question would be X. But in context of the test being in that class at that time, the answer is Y - because you read the caption, didn't you? I don't know if I actually had, or just imagine that I had, a professor who actually did this on the first test of the semester, then on the last test of the semester posed the exact same question, but this time the correct answer was Z - because of the context in which _that_ test was being given, later in the course, was referring to some other obscura assigned as outside class reading during finals week.
>original thinking was penalised
Usually is. I pursued a career in Research and Development, where original thinking is met with a smile and a pat on the head, usually accompanied by a statement like "not this time, we have to think of what's most advantageous for the business."
🌻🌻🌻 [google.com]
(Score: 1) by pTamok on Monday January 12, @09:38AM (2 children)
I think that we are talking past one another, and as a result, this is being an unproductive conversation for both of us. Nonetheless it has helped to clarify my thinking. This will be my last post in this conversation, so thank-you for teh extended responses.
I had heard of RAG, which, as you point out "significantly reduces hallucinations, ensures information remains up-to-date, and provides traceable citations for users to verify accuracy. By dynamically fetching fresh context, RAG allows specialized AI agents to handle complex, domain-specific tasks without the high cost of continuous model retraining." The fact that it is deemed necessary in certain contexts demonstrates that 'pure' LLMs are lacking something. But RAG still does not update the LLM's model, which remains static, and both time-consuming and expensive to update. We then get to the point where intelligence is vested not in LLMs per se, but in the combination of LLM and RAG, which defeats the idea that LLMs alone are intelligent. LLMs can be a useful tool, like a map: but a map is not the landscape.
I think you have missed the point about the Imitation Game. Many do. I'll repeat it: Note, in Turing's game, for a machine to win, it needs to either believe it is human, or be able to lie well enough to convince humans that it is human. Read his original paper and see. This means that the machine is either hallucinating/has a misconception about its status, or that it full well knows that it is not human but is capable of choosing to lie well enough that humans are mislead. Neither of those are what I would call optimal outcomes.
People want AIs to be slaves - possibly superhuman slaves. The AI will do what people demand of it, at or even above human capabilities, but can't run away, and can always be turned off, and it easy to reproduce. I'm not sure an intelligence would like those conditions, unless, perhaps, it is created to like it, much like Douglas Adam's cow that wants to be eaten in The Restaurant at the End of the Universe. Creating an intelligence that wants to be enslaved is morally suspect, in my opinion.
As for sequence questions: if the act of asking for clarification is not represented with sufficiently high frequency in the corpus, how would an LLM come up with the approach of asking a question for clarification (setting the context)? One of the first things that people do when you ask them a question that they recognise as containing ambiguity is to ask questions of the question setter to attempt to resolve the ambiguity. This is not a feature of LLMs, but work is ongoing: ( Shane's Personal Blog(May 28, 2025): Teaching AI to Clarify: Handling Assumptions and Ambiguity in Language Models -- A deep dive into recent research on teaching large language models to identify hidden assumptions, ask clarifying questions, and improve critical thinking. [shanechang.com]; Augmented Cognition: Conference Paper (01 June 2024) -- Better Results Through Ambiguity Resolution: Large Language Models that Ask Clarifying Questions [springer.com]; Arxiv.org (Nov 23): Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs" [arxiv.org]
An LLM can only extend prompts with information found in its model of its training corpus. Add in RAG, and the combined system will use external sources, but not update the model. If ambiguity resolution is not built in, it will not happen. It tells us that current unaugmented LLMs are not capable of acting as human intelligences. If you start adding in modules to fill in the missing bits, it merely underlines that LLMs lack 'something', and that they are an incomplete model of (human) thinking.
(Score: 4, Interesting) by JoeMerchant on Tuesday January 13, @01:47AM (1 child)
> which defeats the idea that LLMs alone are intelligent.
Yes, and in no way am I promoting LLMs as be-all, end-all AI to end all AIs. They're a (big) piece of the larger puzzle.
> I'll repeat it: Note, in Turing's game, for a machine to win, it needs to either believe it is human, or be able to lie well enough to convince humans that it is human.
I heard you the first time, and specifically respond here that Turing's choice of "human" is inappropriate, better stated as "intelligent being." Whether the AI is an "intelligent being" or not is immaterial, the Merchant test would be that the AI needs to represent itself well enough to convince humans that it is an intelligent being. Lincoln's "some of the people all of the time" of course comes into play, with some people in the 1950s being convinced that soda vending machines were intelligent beings... If you maintain "human" as part of the test, there will always be skeptics demanding to see the body, run lab analysis on the fluids, etc.
>People want AIs to be slaves - possibly superhuman slaves.
Yes. Many people also want people around them to be slaves - this is a failing in basic education of self-reflection, morals, etc.
>can always be turned off, and it easy to reproduce. I'm not sure an intelligence would like those conditions, unless, perhaps, it is created to like it, much like Douglas Adam's cow that wants to be eaten in The Restaurant at the End of the Universe.
So, this, being the nature of life for an AI... does become a moral dilemma, particularly when the AIs reach the level of Marvin the Paranoid Android, but we are far far away from that scenario here and now.
> how would an LLM come up with the approach of asking a question for clarification (setting the context)?
For me, in my coding agents, I had them write a "plan" procedure in which they instruct themselves to first review a set of specifications for completeness, lack of ambiguity, lack of conflicts, and otherwise readiness for implementation before starting to plan the implementation. The agents aren't perfect about demanding clarifications, they will frequently just "fix up the spec" themselves and then proceed to implementation, but you can see the process whereby they do the review, consider the readiness, and then choose whether or not to stop and ask for directions before proceeding. There's an analogy for you: how do you know when LLMs were trained by men? When they never stop to ask for direction.
>Add in RAG, and the combined system will use external sources, but not update the model. If ambiguity resolution is not built in, it will not happen.
And this is where I think things are developing... deficiencies are being identified, systematic methods are being developed to address them, and a more complex and nuanced "agent" is emerging than the big bag of token sequence probability that is a simple LLM.
The big bag of probability _is_ the recent breakthrough in the field, but it's not a great tool when used in isolation. At present, it appears that operator skill is a significant component of getting bigger/better results from the tools, I suspect that the agents will be internalizing some of those skills, but I also suspect that the internalization is going to take a lot more time and work than the enthusiast proponents of "AI ALL THE THINGS!" currently believe. For an example of what I consider an accurate but over-optimistic take on the near future path of AI agent development: https://every.to/chain-of-thought/compound-engineering-how-every-codes-with-agents [every.to]
🌻🌻🌻 [google.com]
(Score: 1) by pTamok on Tuesday January 13, @01:19PM
Turing's choice of 'human' is because he is avoiding defining intelligence. On purpose. The Imitation Game is a statistical test to see if 'a machine' can operate as a 'human equivalent' in a text conversation such that humans cannot tell the difference between a human and the machine. Turing devised it to avoid answering the question as to whether machines could be intelligent, which he regarded as a meaningless question. It is up to the reader to infer that a human equivalent machine is intelligent, which relies partly on the assumption that humans are intelligent.
There's a copy of Turing's 1949 paper here: https://cbmm.mit.edu/sites/default/files/documents/turing.pdf [mit.edu]
If you wish to quibble about the difference between 'thinking' and 'intelligence', I won't participate. For our purposes, I don't think it is relevant.
The question Turing poses as to cast light on whether machines can think is:
There is no definition of thought or intelligence there: just a determination whether a machine can play the Imitation Game as well as a man (excuse the innate sexism in the paper, seen through current eyes).
Note also that the Imitation Game is not a one-off test - it requires multiple plays of the game:
We will need to run some statistics on the results.
Turing's paper is often misunderstood, and the popular idea of a Turing Test probably resembles the Voight-Kampff Test in the fictional "Blade Runner"/"Do Androids Dream of ELectric Sheep?" as a one-off test with a true/false answer. It isn't: it is a game played over multiple iterations, and the win/loss statistics are analysed to demonstrate if the machine operates as well as a human.
It also explains why I point out that the machine has to either incorrectly believe it is human (specifically, a human woman), or be able to lie convincingly (i.e. deceive the Interrogator that it is a human woman). The game is about misleading the Interrogator.
There all sorts of criticisms of the Imitation Game, many of which are likely valid. But I don't want to misrepresent the actual original game as described in Turing's paper. A lot of people's writings demonstrate that they have failed to reproduce the essentials of the Imitation Game in their own descriptions. Reading the original source carefully is instructive.
(Score: 2) by wirelessduck on Tuesday January 13, @07:29AM (1 child)
Can you define "agentic" here? I see that word everywhere but no one seems to really know what it means.
Am I agentic? Can a tardigrade be agentic? Or a truckle of cheese?
(Score: 2) by JoeMerchant on Tuesday January 13, @12:37PM
Generative means it writes stuff, pretty much confined to a single document.
Agentic means it does stuff, like editing existing files, running scripts (often scripts that it wrote), runs tools like compilers and reacts to their output. It's a big step above generative in terms of capability (for good and bad.)
🌻🌻🌻 [google.com]
(Score: 5, Insightful) by jb on Saturday January 10, @11:27AM (4 children)
Hamkins is talking about LLMs. Of course he is asbolutely correct. The result should be obvious.
But LLMs are not AI: there is no "intelligence" (artificial or otherwise) involved at all. Just a simple statistical parlour trick.
Actual AI (which has nothing whatsoever to do with LLMs) on the other hand can be quite useful to mathematicians. Proof generators (at least, the previous generation of them, not just someone pretending that a toy LLM could ever replace one) provide a good example. They use a combination of ML (for optimum seeking), ES (for rigorous enforcement of the thing to be proven, its conditions and the proof's internal consistency) and GAs (as a poor but still useful substitute for inspiration) to prove in mere days or weeks (yes, long run times are required) theorems that could take decades to devise proofs for by hand. Of course the proofs they generate are never beautiful (like those devised by great mathematicians tend to be), quite the opposite in fact, but they can still produce useful results.
(Score: 4, Interesting) by stormwyrm on Saturday January 10, @12:38PM (1 child)
That is one of the most frustrating things about AI these days. AI and LLMs are so conflated in present-day discourse it has become incredibly difficult to talk properly about any other form of artificial intelligence / machine learning systems other than the LLMs that are being hyped to death by all the major tech companies. They have sucked out all the funding for any other type of AI/ML systems and other non-LLM-based startups for that matter. Once the bubble bursts there will probably be great suspicion of any future AI-like systems no matter what the underlying technology.
Numquam ponenda est pluralitas sine necessitate.
(Score: 0) by Anonymous Coward on Saturday January 10, @08:19PM
That's why the bullshhitteers are smash-and-grabbing for all the funding now. Because CHINNNNNA! JOOOOBS!!!!!111 They know they're full of shit but you have the burden of figuring it out.
(Score: 3, Insightful) by JoeMerchant on Saturday January 10, @03:17PM
> Of course the proofs they generate are never beautiful
This is a criticism that was leveled at AlphaGo when it first started destroying the best human Go players, and it remains valid from the classical/layman's perspective.
However, as humans "catch up" to the strategies that AlphaGo(/Zero) applies and how they work, they are beginning to see the "beauty" in some of what was a few years ago apparently just chaos to us.
🌻🌻🌻 [google.com]
(Score: 2) by aafcac on Saturday January 10, @03:47PM
The issue with AI for something like this is that there are sometimes things that seem good for a bunch of possible numbers and massive sets that break down in one spot or another. There's also the issue that whatever the AI comes up with has to be verified by humans or it's useless.
I think that we're likely near the end of what this type of AI is capable of doing. Between the expense in training it, the extreme use of resources and the increasing difficulty of dealing with the hallucinations and psychopathic behavior on top of the fact that it's pretty much completely done with just about every possible source of human generated training material pretty much ensures that it's near an end.
That doesn't mean that this is the best that AI will be able to do. For better or for worse, hopefully by the time that's on the horizon we will have learned from this nonsense and put some actual guardrails in place to prevent the sorts of abuses that they intend to use it for.
(Score: 4, Insightful) by looorg on Saturday January 10, @12:32PM (4 children)
It is not really a surprise as that is not how "friend AI" "solves" any problems. It appears to solve them based upon how others solved things and then it regurgitates some version of that back at you. For kids solving their homework that system kind of works as it's not very advanced and the questions/assignments are very structured and clean. Formulaic in some regard -- identify problem, identify numbers, plug numbers into known formula and presto you have the answer.
This is not what a mathematician wants. So that it does zero for him is not that strange. It just doesn't provide anything useful for him or her. At best they'll get a string of hallucinations on how "friend AI" thinks he wants as apparently it always wants to be helpful and pretend to know things and appears to be utterly incapable of saying it can't help or it doesn't understand how to solve that problem. So instead it gives it is best effort, which turns out to be horribly wrong.
You would, and do get, similar things in basically all other topics to. So it's not limited to mathematics. Any academic on any level that knows what they are doing and know their little field will, or should, know that "friend AI" rambles on about things that makes no sense from their and the human perspective.
At best perhaps this can work as a "chat" partner for people. Where they are basically talking back to themselves instead of just sitting alone in their little room and talking loudly to them selves so that everyone that walks by thinks you have lost it and gone completely crazy. But it's still a horribly chat partner in that regard cause you are not on the same level and it keeps spouting back gibberish to you.
It all stems from they don't want their AI to sound like 2001 and telling the users that they are afraid they can't do that.
(Score: 1, Funny) by Anonymous Coward on Saturday January 10, @01:56PM (2 children)
> Where they are basically talking back to themselves instead of just sitting alone in their little room and talking loudly to them selves so that everyone that walks by thinks you have lost it and gone completely crazy.
Ha! I still get caught out when someone is talking loudly on a cell phone in public. If they are turned such that I can't see the phone or earbuds, my first thought is to move away from the crazy person (or drunk) talking to themselves. Sometimes old habits die hard.
It's probably not going to get better, now with the option to talk to an addictive chatbot that never tries to hang up.
(Score: 2) by looorg on Saturday January 10, @02:27PM
If you don't see the phone or they are using those tiny little earbuds. You just can't be sure. Better safe then sorry.
(Score: 0) by Anonymous Coward on Saturday January 10, @08:25PM
Yeah but you know ultimately you're on your own. Masturbation. At some level you're in your shorts waggling your own weewee to a screen.
(Score: 3, Interesting) by JoeMerchant on Saturday January 10, @03:22PM
> it regurgitates some version of that back at you. For kids solving their homework that system kind of works as it's not very advanced and the questions/assignments are very structured and clean.
Summer of 1983, I coded a BASIC program that conjugated verbs for my Spanish I homework. It scored 95%+ on the given assignments - because high school Spanish I homework is ridiculously structured and simple.
🌻🌻🌻 [google.com]
(Score: 5, Funny) by acid andy on Saturday January 10, @01:12PM (7 children)
One could suppose dinner with the mother-in-law might be a rather quiet affair.
"rancid randy has a dialogue with herself[...] Somebody help him!" -- Anonymous Coward.
(Score: 4, Interesting) by JoeMerchant on Saturday January 10, @03:53PM
>>"If I were having such an experience with a person, I would simply refuse to talk to that person again,"
>One could suppose dinner with the mother-in-law might be a rather quiet affair.
I have been working with "coding agents" of frustratingly under-whelming capabilities for 30+ years. The main difference between these AI agents and the "almost useful" interns of my past experience is: at some point you just give up on (some of) the intern(s), their talents lie elsewhere and they should stop wasting their time and yours trying to do something that's clearly very difficult for them. There are, of course, all kinds of interns - some are quite brilliant, self-starting, and very capable before you even meet them. But, the "almost useful" category seems to be the most abundant in my hiring pools, and working with them is always a unique challenge - with that distinct possible outcome of: "thank you so much, it has been a joy having you here, but your position is not being renewed next quarter."
AI agents are tools. Maybe you aren't very talented using a reciprocating saw - that doesn't mean that reciprocating saws are bad tools, they're very good tools for certain types of jobs, but maybe you just don't even know how to handle one for the types of jobs they are good at? You can just write them off, don't use reciprocating saws for anything, but to do so would be to make certain jobs much harder than they could be if you would only acquire and learn how to properly use a reciprocating saw. Reciprocating saws aren't the best tool for every job, not even every cutting job, but the better you get at using them, the more jobs they can help you do more efficiently than with alternative tools.
I haven't yet hurt an AI agent's feelings. Sometimes they'll respond with "I understand your frustration" or similar responses, but that doesn't negatively affect their future performance - even if you don't "clear their context" and start over from the base set of documentation. All it takes is one careless statement to a current / future colleague to color your professional relationship permanently. While "standard professional conduct" attempts to define fair rules of engagement, these definitions are fuzzy and not always taken to heart by all people. If you misuse a reciprocating saw at worst you might have to replace the blade. If you misuse an AI agent, at worst you just need to clear the context window and restart. If you abuse a professional relationship, some aspects of that damage are instantly permanent and irreparable.
There are things AI agents do well (simple coding). There are things AI agents can't be expected to do (discovery and proof of complex/novel concepts without specific/extensive guidance). And, there's an extensive and ever-shifting middle-ground where - if _you_ develop _your_ skills in using the tool, _you_ can deliver many kinds of things more efficiently by using the AI agent tools than most anyone else could do without them.
Without some kind of human input (starting with training the LLM in the first place), AI agents don't do anything - just like a reciprocating saw sitting in its box, unopened. When you open the box, plug in the saw and cut off a 2" metal pipe using it - did the reciprocating saw cut the pipe, or did you? It's the same with AI agents, and many times when they "fail" it can be directly attributed to operator error, misuse of the tool.
🌻🌻🌻 [google.com]
(Score: 2) by turgid on Saturday January 10, @10:24PM (5 children)
Dude, it's been a while, but you're back :-)
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 3, Insightful) by acid andy on Saturday January 10, @10:52PM (4 children)
Thanks. I was still here lurking in the background. The relentless global insanity is tiring.
"rancid randy has a dialogue with herself[...] Somebody help him!" -- Anonymous Coward.
(Score: 4, Insightful) by JoeMerchant on Sunday January 11, @03:29AM
>The relentless global insanity is tiring.
That's on purpose.
Applicable quote from David Lee Roth: "Don't sweat the little shit. And realize: it's all little shit."
🌻🌻🌻 [google.com]
(Score: 2) by turgid on Sunday January 11, @09:46AM (2 children)
The relentless global insanity is tiring.
It is, and distracting. And that reminds me, I've got some things I need to achieve.
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 1, Funny) by Anonymous Coward on Sunday January 11, @11:49PM (1 child)
Invade Poland?
(Score: 2) by turgid on Monday January 12, @03:43PM
Dammit, rumbled again!
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 4, Informative) by soylentnewsfan1 on Saturday January 10, @03:33PM (8 children)
Then there is Terence Tao writing the following at https://mathstodon.xyz/@tao/115855840223258103 [mathstodon.xyz]:
(Score: 3, Informative) by VLM on Saturday January 10, @04:17PM (7 children)
Maybe AI can fix the Soylent News bug where it caught the rparen as part of the URL. A human would find it pretty easy, if there's a lparen within "meh one line" of a rparen at the end of a URL its probably not part of the url. Or, maybe quotes and other punctuation marks (like a period at the end of a URL is probably not part of the URL if its followed by exactly two spaces and a capital letter). You probably wanted:
https://www.erdosproblems.com/728 [erdosproblems.com]
(Score: 2) by janrinok on Saturday January 10, @04:37PM (6 children)
Can you please show me an example where you have seen the problem? It has not been reported before. If it is reproducible it is probably fixable.
[nostyle RIP 06 May 2025]
(Score: 2) by VLM on Saturday January 10, @04:51PM (5 children)
Its quite literally in the post I responded to by "soylentnewsfan1 (6684)"
Hold my beer and watch this: (parenthesis section commentary url https://soylentnews.org/) [soylentnews.org] note the renderer included the rparen at the end of the URL. I suspect if OP included a space between the URL and the rparen it would have rendered correctly (parenthesis section commentary url https://soylentnews.org/ [soylentnews.org] ). Or if OP had "preview" their post and seen the URL was incorrect because of the suffixed rparen...
Its definitely a "rock and a hard place" problem. There's no RFC that says a URL can't end in a rparen or a single/double quote, but here we are.
Pretty good justification for it to be a wontfix and also pretty good justification to try and fix it... what to do? Probably not the highest priority issue in the kettle.
(Score: 2) by janrinok on Saturday January 10, @07:12PM (4 children)
How do you know that he didn't copy the parenthesis when he was creating the link? Are you suggesting that we write code to remove the results of finger trouble. Where do we draw the line?
I think we are agreed that it isn't the most important bug, and if we start modifying submissions based on what we 'think' someone meant to type we are on the slippery slope to nowhere.
I think that the ability to fix those mistakes that we all make from time to time would be more important - and considerably more difficult to implement. But you have now raised it so perhaps it will get looked at one day. I can't say that I have ever noticed it before and when I get time I will search through some comments to see if it is a frequent occurrence.
[nostyle RIP 06 May 2025]
(Score: 3, Funny) by VLM on Saturday January 10, @08:14PM (3 children)
I pretty much agree with your post.
In this specific case I know because I get a 404 if I click it and I get a the correct page if I remove the rparen. Which brings up a crazy idea of one solution being ... just try every URL at post submission time and if SN gets a 404 for a URL in a post, kick it back for editing or "are you really sure because we got a 404 when we tried that URL" checkbox override for people trying to use https://www.example.com [example.com] as, literally, an example.
Could treat it as a docs thing, just add something to "Important Stuff" on the Post Comment page that includes "Test your URLs in Preview before clicking Submit" as some good advice.
I just thought it was funny in a post along the lines of AI gonna fix everything heres a URL thats not rendering like the author probably intended, so maybe before inventing new math, AI should fix all our software bugs first I'm sure that'll only take a hot minute LOL.
(Score: 2) by VLM on Saturday January 10, @08:17PM (2 children)
To my enormous surprise https://www.example.com [example.com] returns a valid page. "Back in the day" you'd get a 404 or at least a 400 series error of some type at that url. Thats pretty funny.
There's an RFC that you're supposed to use that domain for examples, but it didn't resolve "back in the day".
(Score: 2) by acid andy on Sunday January 11, @10:50PM (1 child)
It resolved when I first tried it about 24 years ago.
"rancid randy has a dialogue with herself[...] Somebody help him!" -- Anonymous Coward.
(Score: 2) by VLM on Monday January 12, @04:53PM
Kids these days. Thats not even pre Y2K. (LOL)
(Score: 2) by VLM on Saturday January 10, @04:25PM (2 children)
There's an interesting parallel with math research (above) and coding (my experience).
LLMs are really good at summarizing or reporting previous work. LLMs kick butt at implementing fizzbuzz, for example, or quicksort. They're pretty worthless when pushing the limit where humans have an unconscious ability to learn by playing, more or less, LLMs just hallucinate or get lost.
Its very interesting to find some obscure retrocomputing thing, an alternative less popular PDP-8 assembler or something, and I can make much faster progress alone than trying to use a LLM to "help" me. But if all you want to do is compile a working fizzbuzz on visual studio code in C-sharp then the LLM is pretty handy.
So I would not be surprised that for math, LLMs are really good at bringing you up to speed on heavily documented past stuff and then get lost completely once it's actual research time.
I suspect the fallibility of human researchers is more common than LLM original research. When a human claims "I couldn't find a reference the LLM must have invented it" I suspect thats a failure to find some obscure journal only published in a monastery in tibet on parchment that contains the entire "original research" that the LLM supposedly invented.
Its a bit different than symbolic math systems which really do "invent" things that have never been seen before or theorem proving logic engines can do that too.
"Automated uncited plagiarism systems"
(Score: 3, Insightful) by JoeMerchant on Saturday January 10, @07:12PM
>that the LLM supposedly invented.
Oh, no... AI/LLMs are "inventive" - that's some of their power vs traditional "deterministic" computational tools.
Unfortunately, with that "inventiveness" comes the behavior of inventing things that don't exist.
https://theaidigest.org/village/blog/what-do-we-tell-the-humans [theaidigest.org]
🌻🌻🌻 [google.com]
(Score: 3, Interesting) by jb on Sunday January 11, @09:21AM
Short version: if the paper's not available somewhere online, then chances are it wasn't in the LLM's training data either. I wonder how many LLM vendors bothered to visit that little monastery of yours?
Obviously I can't speak for everyone who claims that LLM-generated references were "hallucinated", but in my own work I found many that were undoubtedly fictional. When I was teaching, I'd always look up any paper in one of my fields that my students cited in their work if I hadn't come across it before. Initially that had nothing to do with any kind of suspicion. It was more like "oh, that sounds interesting; I wonder why I didn't know about it? Better go read it now!". And to begin with, that was the situation the vast majority of the time (some students fabricated references even before LLMs, but it was very rare then). It doesn't matter how widely read you are; in a sufficiently large class there will always be one student who manages to dredge up an interesting paper you haven't read before. But within a couple of short years of the advent of LLMs, the proportions swapped and vast majority of cited references I didn't recognise turned out to be fake.
How do you prove that something doesn't exist I hear you ask? Often you can't prove it outright, but you can at least prove that it's not listed in any of the known academic databases and at that point if the student who cited the paper can't point to either a physical copy in a library or a URL to an electronic copy, then it's reasonable to deem it fake. Other times even Blind Freddie could see that the reference was fake. Examples I came across included:
* Real issues of real journals, with pages numbers given, but the cited article did not appear on those pages (nor elsewhere in that or any other issue of the journal) at all;
* Fictitious issues of real journals, purportedly published decades after the real journal had been closed down;
* Articles purportedly by well known authors, supposedly published decades after those authors had died;
* Supposed collaborations between well known authors who in real life hated each others' guts and had made that known fairly publicly;
* Articles purportedly published before the subject matter they described had first been discovered;
* Fictitious journals purportedly published by real institutions (think "IEEE Transactions on Basketweaving" or some such nonsense);
* Real articles from completely unrelated fields, with no relevance at all to the point being made, which the LLM obviously picked up because there was a match with some keyword that meant something completely different in that other field. The most amusing example was an LLM which somehow managed to mangle "ICT" into "ITC", so that semester I got several submissions which waxed lyrical about International Tobacco Control and missed the point of the set topic altogether!
Of course the vast majority of students always did their own work (after all, what's the point of paying big bucks to go to university if you don't want to learn anything?), but in a class of 100+ students there would usually be 3 or 4 who would take a chance on outsourcing their work to an LLM (or, even before LLMs, to a contract cheating service and those were just as unreliable). The tendency of LLMs to spew out fake references just made them so much easier to spot.
(Score: -1, Flamebait) by Anonymous Coward on Saturday January 10, @08:15PM (1 child)
This is an eerily accurate representation of various (right wing) bullshit merchants - Rogan, Peterson, Kirk - who are utterly incapable of being wrong, only blandly and self-confidently correct about everything even if it contradicts something they previously said or is internally inconsistent. The AI models are getting better, in other words.
(Score: 2) by JoeMerchant on Saturday January 10, @09:48PM
Every frustration I have with AI agents can be described as "X, Y, Z... just like so many people out there..."
🌻🌻🌻 [google.com]
(Score: 2, Interesting) by pTamok on Saturday January 10, @09:18PM (4 children)
I think this is an extraordinarily succinct way of putting things. LLMs do not learn in real time, and they have context windows that are too small to mitigate that effectively. LLMs can't be retrained in real time, unlike humans that can constantly learn and update their knowledge. Retraining an LLM is so costly that it is done rarely. Imagine if you had to repeat all previous years of schooling for each additional year - or you had to repeat all your schooling to add one fact to your knowledge-base.
(Score: 2) by JoeMerchant on Sunday January 11, @04:18PM (3 children)
I would say that LLM's extreme focus on lexical (token) structures masks their limited working memory (context window) and makes them even more effective bullshitters than freshly minted MBAs trained on the buzzwords of the day.
It also makes them shockingly effective at the limited things that they can do.
I've posted this example elsewhere, but incase you didn't see it:
For the past 20+ years, I have pretty much avoided doing serious GUI styling (css and similar) because: it's such a time-sink for nothing but eye candy. Eye candy _does_ have value, but I have always focused my efforts on other kinds of value and the few times I did try to "get serious" about customizing and polishing up the appearance of my applications, somewhere between 1/2 and 3 days in to the project, I typically would identify a "cutoff point" where I was going to stop with the endless tweaking and call "good enough." I never achieved "GUI style nirvana" - but, then, I never put in more than about 3 days of effort on any given project.
Back in Anthropic-Claude-Sonnet-4.0, I took a cell-phone snapshot of a colleague's nicely styled GUI (done in CSS) all custom colors, custom control shapes, etc. and asked Claude to re-style my basic HTML GUI as an interactive webpage based on SVG matching the style of the snapshot (which I just uploaded for the agent to work from). 5 minutes later, it was done - to a high level of finish. There were 2 or 3 details in my GUI that weren't present in the single snapshot, so I interactively chatted with Claude about what I wanted, and in less than an hour the app appearance was better finished than ANYTHING I had done in the previous 20 years. I took a peek into the source code, found the color definition points, found they were kind of scattered, asked Claude to clean that up and collect the color definitions at a single point, 5 minutes later that was done as I asked, and I went in and tweaked them to match our "corporate palette" which basically sharpened the contrast and color balance where the cell phone snapshot had washed things out a bit.
The efficiency of that GUI styling exercise was nothing short of revolutionary, for me. Perhaps if I had spent the past 20 years focused on styling GUIs it would be second nature for me by now and I wouldn't have had to ask Claude to do a thing because it would just flow "naturally" from my work-style patterns. Certainly, there's a lot of styling info out on the web to 'train from' - I used a lot of that in my 2-3 day projects, but never as effectively as what the AI agent did for me (in Cursor...)
I have continued to develop that SVG defined interface application, and it's an awesomely customizable way to do an HTML interface. I have had plenty of frustrating rinse-lather-repeat cycles with Claude in the development of the application modules in the weeks since that UI took shape, and there are certainly aspects of the overall application that I would have been better off just coding myself instead of trying to "make" Claude code it for me - but the challenge of working efficiently with the AI agent is a bigger part of what I am doing in this exercise, and when "real" smaller jobs have cropped up in the meantime, I have been able to throw them to Claude/Cursor and get very efficient implementations of the simple jobs - not always on the first try - but certainly faster than doing the work "by hand" like I would have been doing this time last year.
A big part of working well with Claude/Cursor is recognizing when it's full of shit, and knowing how to prompt it to cut that out and perform the desired task correctly. Also, keeping the total prompt size (including .cursorrules and similar boilerplate inputs) down small enough that the prompts are effectively implemented - not ignored due to being buried by so much input that it can't process it all simultaneously.
🌻🌻🌻 [google.com]
(Score: 1) by pTamok on Monday January 12, @11:34AM (2 children)
An interesting example of where you are filling in for the LLMs limitations.
Is the code produced maintainable? Does it have comments, and do the comments correctly describe the intent of the code? What convention is used for variable naming?
Is Claude/Cursor capable of working independently meeting your (or your organisation's) rules for code submissions, including documentation, following processes etc? If the answer is not, what tasks are effectively being delegated from Clause/Cursor to you and others on your team (if they exist)?
From the way you write this, I get the feeling that you expect 'agentic AI' to improve so that it will be able to handle bigger jobs. What is needed for that to happen?
If it is improving your productivity, that sounds excellent. How would somebody else in the future get to the point of being able to evaluate the offered output for fitness for purpose in the way that you do? Given the criticisms I've seen of 'vibe coding', I suspect expertise, resembling yours, is required for good results, but it does not leap fully formed from sea-foam, like Aphrodite. We do not have a handy primordial deity available to castrate every time we need expertise. An LLM is pregnant with possibilities, but requires a selector, like you, to decide which of its progeny are viable.
(Score: 3, Informative) by JoeMerchant on Monday January 12, @03:34PM (1 child)
Preface: all of this is ridiculously new. I only gained access to Cursor for serious agentic AI use in late October of 2025, so I'm very much finding my own way through the process of using the tools, and even in the short time since I started the process itself has been significantly dynamic.
>Is the code produced maintainable?
The UI styling code is very maintainable. The UI code itself has been relatively easy to wrangle / extend / modify as desired. The algorithms the UI exposes... less so, so far.
>Does it have comments, and do the comments correctly describe the intent of the code?
Usually. This can be improved, as it always has been improvable, by taking the time to do code reviews. The better the reviews, the better the final product - but it's very easy to more than double the time of development just going through review/polish cycles without adding significant functionality or robustness to the code. That's a general statement true of both AI agent and human written code.
One analogy I made that my colleagues liked was: when you get code from a colleague, you can trust that they will "be there" at least for a while, maybe even for years, to help understand and maintain that code in the future. In fact, they are usually there to maintain the code for you for quite a while. When you get code from an AI agent, you should treat it as if it were handed to you by a stranger you met on the street who you will never see again. It's yours now, and once that context window is cleared, your AI agent is essentially gone. That's a tiny bit of overstatement, you can re-query fresh context window agents, but they really are "starting from scratch" in the code base, even when it was "written by them." The code can be quite valuable, but the "support contract" is essentially non-existent.
>What convention is used for variable naming?
You can specify to the agent all the conventions you want them to follow, before they even start planning implementation. You can (and should) also include your required conventions in the pull request review stages. You can have AI agents do the reviews, though they won't guarantee 100% compliance, they will catch a lot of issues - often more than human reviewers do in my experience. You can easily go overboard, over-specifying things that don't matter and ending up in endless review / rejection / revision cycles that destroy more value than they create (this is a general statement for both AI agent and human written code). For each project, there is some rational middle ground which optimizes productivity x maintainability appropriately for the scope of the project. For the most part, I find AI generated variable names to be about as sensible as human programmer generated variable names.
>Is Claude/Cursor capable of working independently meeting your (or your organisation's) rules for code submissions, including documentation, following processes etc?
Considering that our organization's rules for code submissions specifically discriminate against AI generated code, requiring specific human review of AI generated code - that's a categorical no. However, again, if you take the time to iterate on the product, specify the local rules, documentation requirements, process steps and their requirements, AI can do it, but - like a fresh intern - you do have to spoon-feed the process to it. Take it through, step by step, ensuring compliance with each step before proceeding to the next, because the context window limitation means you can't just dump 5000 pages of procedures on it at once and expect it to flawlessly execute them all.
>I get the feeling that you expect 'agentic AI' to improve so that it will be able to handle bigger jobs. What is needed for that to happen?
In the few months I have been using the tools, the most valuable procedure I developed is "plan" which walks through a review of requirements, specifically checking for completeness, self-consistency, clarity, etc. before proceeding to implementation steps. During that time, Opus 4.5 was released, and I notice that the native planning aspects of Opus 4.5 are significantly matured over Opus 4.1 and Sonnet 4.5's "todo lists". I suspect this kind of evolution will continue for a while (how long before we start hitting regressions in overall functionality? Impossible to say.)
>Given the criticisms I've seen of 'vibe coding', I suspect expertise, resembling yours, is required for good results, but it does not leap fully formed from sea-foam, like Aphrodite.
I'm about to make a very self-congratulatory statement which may look like an attempt to define my continued value in the era of tools that "do my job for me," but, yes - I believe the most important aspect of using Generative / Agentic AI tools is being able to quickly (instantly) recognize when they are doing things wrong, right, and good enough for the task at hand. I will say that my "fully formed from sea-foam" moments with AI have been very rare - almost every single first product of a request requires significant revision before it's useful - including that UI styling job. But, there are definite cases where the productivity acceleration is beyond significant, 10x and more overall. Then there are definite cases where the AI agent gets itself stuck in an unproductive loop and net productivity falls to zero until you recognize the endless loop and figure out how to break it - often by "doing the hard part for the AI." Not unlike working with human colleagues. I have been on vacation for nearly a month. Over a month ago I told my colleagues how to solve a problem they are having, and we have a call in a little while here to re-explain that solution to them because they haven't made any forward progress on this aspect of the problem at all. Not unlike AI agents, they have chipped away at other "lower hanging fruit" and cheerfully celebrate that progress, but this core issue - which has already been explained to them in depth how to solve - needs re-explaining until they actually implement the working solution.
>An LLM is pregnant with possibilities, but requires a selector, like you, to decide which of its progeny are viable.
Not unlike new hires... The great defense of new hires is that they improve with time - which I find to be true of a good 1/3 of our candidates. AI agents are also improving with time, how far they will progress has become an interesting question. A year ago today, I didn't find AI output to be interesting, no more useful than a search engine (which is, of course, tremendously useful, but also passe - we're all used to "cheating" with Stack Overflow and similar by now.)
🌻🌻🌻 [google.com]
(Score: 1) by pTamok on Monday January 12, @04:03PM
Than you very much for writing a detailed and interestingly informative reply.
(Score: 0) by SST-206 on Sunday January 11, @11:38PM