--- --- --- --- Entire Story Below - Must Be Edited --- --- --- --- --- --- ---
Arthur T Knackerbracket has processed the following story [theregister.com]:
Enterprise CIOs have been mesmerized by GenAI claims of autonomous agents and systems that can figure anything out. But the complexity that such large models deliver is also fueling errors, hallucinations, and spiraling bills.
All of the major model makers – OpenAI, Microsoft, Google, Amazon, Anthropic, Perplexity, etc. - are singing from the same hymnal book, the one that says that the bigger the model, the more magical it is.
But much smaller models might do a better job with controllability and reliability.
Utkarsh Kanwat is an AI engineer with ANZ, a financial institution headquartered in Australia. In a blog post, he broke down the numbers showing that large GenAI models become mathematically unsustainable at scale.
"Here's the uncomfortable truth that every AI agent company is dancing around: error compounding makes autonomous multi-step workflows mathematically impossible at production scale," Kanwat wrote in a blog post [utkarshkanwat.com] over the weekend. "Let's do the math. If each step in an agent workflow has 95 percent reliability, which is optimistic for current LLMs," then five steps equal a 77 percent success rate, ten steps is a 59 percent success rate, and 20 steps is a 36 percent success rate.
What does that all mean? "Production systems need 99.9%+ reliability. Even if you magically achieve 99% per-step reliability (which no one has), you still only get 82% success over 20 steps. This isn't a prompt engineering problem. This isn't a model capability problem. This is mathematical reality."
Several analysts and GenAI specialists back Kanwat's view.
Jason Andersen, a principal analyst for Moor Insights & Strategy, said that enterprises often opt for the path of least resistance. If the large model maker is promising to solve all of their problems, they want to believe that. But it is often the much smaller and more-focused strategies that deliver better results. Small, tight and well-scoped is good. Loosey goosey is bad. There is a lot of wisdom in going small
"This points out that the real value of an agent in an enterprise sense is to put boundaries around the model so you can get a certain degree of purpose out of it," Andersen said. "When you have a well-crafted and well-scoped (GenAI) strategy, you are likely to have more success."
The larger the model, "the further away you get the accuracy line, further away from reliability," Andersen said. "Small, tight and well-scoped is good. Loosey goosey is bad. There is a lot of wisdom in going small."
Andersen said that he asks CIOs whether they want the AI model "to be the pilot or the navigator?"
A good example of this, he said, is GenAI-powered vibe coding. Should AI be helping the coder or replacing the coder?
"Both have humans in the loop but what role is the human providing? Is the human running the show or is GenAI running the show?" Andersen asked.
Justin St-Maurice, technical counselor at Info-Tech Research Group, agreed that many enterprises are not doing themselves any favors by focusing overwhelmingly on the largest models.
"We are putting agents into complex sociotechnical systems. Agent systems run the risk of causing feedback loops and going off the rails, and the inherent nature of LLMs is randomness," St-Maurice said. "There is a real balance between taking advantage of the generative nature of GenAI and putting rules around it to make it behave deterministically."
Andersen offered an analogy of hiring a new employee and instead of training that new worker on how the team does things, the executive told the new employee to figure it out on their own. And when that new employee's work is not what the executive wanted, the company blames the employee rather than the executive who didn't want to spend the time or money on training new talent.
Kanwat also argued that the smaller models – even when deployed in massive numbers – can be far more cost-effective and often deliver an outright lower price.
"Context windows create quadratic cost scaling that makes conversational agents economically impossible," Kanwat said, and then he offered what he said was his own financial experience.
"Each new interaction requires processing all previous context. Token costs scale quadratically with conversation length. A 100-turn conversation costs $50-100 in tokens alone," Kanwat said. "Multiply by thousands of users and you're looking at unsustainable economics. I learned this the hard way when prototyping a conversational database agent. The first few interactions were cheap. By the 50th query in a session, each response was costing multiple dollars more than the value it provided. The economics simply don't work for most scenarios."
Kanwat said that many autonomous agent companies are going to have severe economic issues.
"Venture-funded fully autonomous agent startups will hit the economics wall first. Their demos work great with 5-step workflows, but customers will demand 20+ step processes that break down mathematically," Kanwat said. "Burn rates will spike as they try to solve unsolvable reliability problems."
Andersen agreed with the pricing concerns.
"The more context you have to give every step, the more the price goes up. It is a logarithmic pricing model," Andersen said, stressing that the model makers are going to soon be forced to sharply increase what they charge enterprises.
A chorus of AI insiders chimed in. Himanshu Tyagi is the co-founder of AI vendor Sentient and he argued that "there's a trade-off between deep reasoning and streamlined reliability. Both should coexist, not compete. Big Tech isn't going to build this. They'll optimize for lock-in." Robin Brattel, CEO of AI vendor Lab 1, agreed that many enterprises are not sufficiently focusing on the benefits of smaller models.
"AI agents that focus on specific, small-scale applications will have reduced error rates and be far more successful in production," Brattel said. "Multi-step AI agents in production will find data inconsistency and integrations incredibly challenging to resolve, causing costs and error rates to spiral."
Brattel had specific suggestions for what IT should look for when assessing various model and agent options.
Consider the "Low precision requirement. Can the solution be approximately right? Illustrations are easier than code because the illustration can be 20 percent off the ideal and still work," Brattel said. Another factor is "low risk. Generating a poem for a custom birthday card is low risk compared to a self-driving car."
One security executive who also agreed that small can often be better is Chester Wisniewski, director of global field CISO at security vendor Sophos. When Wisniewski read Kanwat's post, he said his first reaction was "Hallelujah!"
"This general LLM experiment that Meta and Google and OpenAI are pushing is all just showoff (that they are offering this) Godlike presence in our lives," Wisniewski said. "If you hypertrain a neural network to do one thing, it will do it better, faster and cheaper. If you train a very small model, it is far more efficient."
The problem, he said, is that creating a large number of smaller models requires more work from IT and it's simply easier to accept a large model that claims to do it all.
Creating those small models "requires a lot of data scientists that know how to do that training," Wisniewski said.
Even Microsoft conceded that small models can often work far better than large models. But one of their AI execs said small only works well for enterprises if the CIO's team has put in the time and thinking to map out a precise AI strategy. For those IT leaders who have yet to figure out exactly what they want AI to do, there is a reason to still embrace the largest of models.
"Large models are still the fastest way to turn an ambiguous business problem into working software. Once you know the shape of the task, smaller custom models can be cheaper and faster," said Asha Sharma, the corporate VP for AI at Microsoft. "Smart companies don't pick a side. They standardize on a common safety and observability stack, then mix and match models to meet quality, cost, and latency goals." (Note: Microsoft declined an interview request from The Register. We reached out to just about every major model maker and they either declined or ignored our request. The Microsoft comment above came from an emailed statement sent after publication.)
Not all enterprises have focused solely on large models. Capital One, for example, has focused on GenAI efforts [cio.com] that limit themselves to their internal data and they also severely limit what can be queried to what the database knows.
Kanwat said most enterprises are not the ideal clean environments for GenAI experiments.
"Enterprise systems aren't clean APIs waiting for AI agents to orchestrate them. They're legacy systems with quirks, partial failure modes, authentication flows that change without notice, rate limits that vary by time of day, and compliance requirements that don't fit neatly into prompt templates," Kanwat said. "Enterprise software companies that bolted AI agents onto existing products will see adoption stagnate. Their agents can't integrate deeply enough to handle real workflows."
The better enterprise approach, Kanwat said, "is not a 'chat with your code' experience. It's a focused tool that solves a specific problem efficiently." ®