How the von Neumann bottleneck is impeding AI computing [ibm.com]:
Most computers are based on the von Neumann architecture, which separates compute and memory. This arrangement has been perfect for conventional computing, but it creates a data traffic jam in AI computing.
AI computing has a reputation for consuming epic quantities of energy. This is partly because of the sheer volume of data being handled. Training often requires billions or trillions of pieces of information to create a model with billions of parameters. But that's not the whole reason — it also comes down to how most computer chips are built.
Modern computer processors are quite efficient at performing the discrete computations they're usually tasked with. Though their efficiency nosedives when they must wait for data to move back and forth between memory and compute, they're designed to quickly switch over to work on some unrelated task. But for AI computing, almost all the tasks are interrelated, so there often isn't much other work that can be done when the processor gets stuck waiting, said IBM Research scientist Geoffrey Burr.
In that scenario, processors hit what is called the von Neumann bottleneck, the lag that happens when data moves slower than computation. It's the result of von Neumann architecture, found in almost every processor over the last six decades, wherein a processor's memory and computing units are separate, connected by a bus. This setup has advantages, including flexibility, adaptability to varying workloads, and the ability to easily scale systems and upgrade components. That makes this architecture great for conventional computing, and it won't be going away any time soon.
But for AI computing, whose operations are simple, numerous, and highly predictable, a conventional processor ends up working below its full capacity while it waits for model weights to be shuttled back and forth from memory. Scientists and engineers at IBM Research are working on new processors, like the AIU family, which use various strategies to break down the von Neumann bottleneck and supercharge AI computing.
The von Neumann bottleneck is named for mathematician and physicist John von Neumann, who first circulated a draft of his idea [archive.org] for a stored-program computer in 1945. In that paper, he described a computer with a processing unit, a control unit, memory that stored data and instructions, external storage, and input/output mechanisms. His description didn't name any specific hardware — likely to avoid security clearance issues with the US Army, for whom he was consulting. Almost no scientific discovery is made by one individual, though, and von Neumann architecture is no exception. Von Neumann's work was based on the work of J. Presper Eckert and John Mauchly, who invented the Electronic Numerical Integrator and Computer (ENIAC), the world's first digital computer. In the time since that paper was written, von Neumann architecture has become the norm.
"The von Neumann architecture is quite flexible, that's the main benefit," said IBM Research scientist Manuel Le Gallo-Bourdeau. "That's why it was first adopted, and that's why it's still the prominent architecture today."
[...] For AI computing, the von Neumann bottleneck creates a twofold efficiency problem: the number of model parameters (or weights) to move, and how far they need to move. More model weights mean larger storage, which usually means more distant storage, said IBM Research scientist Hsinyu (Sidney) Tsai. "Because the quantity of model weights is very large, you can't afford to hold them for very long, so you need to keep discarding and reloading," she said.
The main energy expenditure during AI runtime is spent on data transfers — bringing model weights back and forth from memory to compute. By comparison, the energy spent doing computations is low. In deep learning models, for example, the operations are almost all relatively simple matrix vector multiplication problems. Compute energy is still around 10% of modern AI workloads, so it isn't negligible, said Tsai. "It is just found to be no longer dominating energy consumption and latency, unlike in conventional workloads," she added.
About a decade ago, the von Neumann bottleneck wasn't a significant issue because processors and memory weren't so efficient, at least compared to the energy that was spent to transfer data, said Le Gallo-Bourdeau. But data transfer efficiency hasn't improved as much as processing and memory have over the years, so now processors can complete their computations much more quickly, leaving them sitting idle while data moves across the von Neumann bottleneck.
[...] Aside from eliminating the von Neumann bottleneck, one solution includes closing that distance. "The entire industry is working to try to improve data localization," Tsai said. IBM Research scientists recently announced such an approach: a polymer optical waveguide for co-packaged optics [ibm.com]. This module brings the speed and bandwidth density of fiber optics to the edge of chips, supercharging their connectivity and hugely reducing model training time and energy costs.
With currently available hardware, though, the result of all these data transfers is that training an LLM can easily take months, consuming more energy than a typical US home does in that time. And AI doesn't stop needing energy after model training. Inferencing has similar computational requirements, meaning that the von Neumann bottleneck slows it down in a similar fashion.
[...] While von Neumann architecture creates a bottleneck for AI computing, for other applications, it's perfectly suited. Sure, it causes issues in model training and inference, but von Neumann architecture is perfect for processing computer graphics or other compute-heavy processes. And when 32- or 64-bit floating point precision is called for, the low precision [ibm.com] of in-memory computing isn't up to the task.
"For general purpose computing, there's really nothing more powerful than the von Neumann architecture," said Burr. Under these circumstances, bytes are either operations or operands that are moving on a bus from a memory to a processor. "Just like an all-purpose deli where somebody might order some salami or pepperoni or this or that, but you're able to switch between them because you have the right ingredients on hand, and you can easily make six sandwiches in a row." Special-purpose computing, on the other hand, may involve 5,000 tuna sandwiches for one order — like AI computing as it shuttles static model weights.