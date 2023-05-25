How the von Neumann bottleneck is impeding AI computing:
Most computers are based on the von Neumann architecture, which separates compute and memory. This arrangement has been perfect for conventional computing, but it creates a data traffic jam in AI computing.
AI computing has a reputation for consuming epic quantities of energy. This is partly because of the sheer volume of data being handled. Training often requires billions or trillions of pieces of information to create a model with billions of parameters. But that's not the whole reason — it also comes down to how most computer chips are built.
Modern computer processors are quite efficient at performing the discrete computations they're usually tasked with. Though their efficiency nosedives when they must wait for data to move back and forth between memory and compute, they're designed to quickly switch over to work on some unrelated task. But for AI computing, almost all the tasks are interrelated, so there often isn't much other work that can be done when the processor gets stuck waiting, said IBM Research scientist Geoffrey Burr.
In that scenario, processors hit what is called the von Neumann bottleneck, the lag that happens when data moves slower than computation. It's the result of von Neumann architecture, found in almost every processor over the last six decades, wherein a processor's memory and computing units are separate, connected by a bus. This setup has advantages, including flexibility, adaptability to varying workloads, and the ability to easily scale systems and upgrade components. That makes this architecture great for conventional computing, and it won't be going away any time soon.
But for AI computing, whose operations are simple, numerous, and highly predictable, a conventional processor ends up working below its full capacity while it waits for model weights to be shuttled back and forth from memory. Scientists and engineers at IBM Research are working on new processors, like the AIU family, which use various strategies to break down the von Neumann bottleneck and supercharge AI computing.
The von Neumann bottleneck is named for mathematician and physicist John von Neumann, who first circulated a draft of his idea for a stored-program computer in 1945. In that paper, he described a computer with a processing unit, a control unit, memory that stored data and instructions, external storage, and input/output mechanisms. His description didn't name any specific hardware — likely to avoid security clearance issues with the US Army, for whom he was consulting. Almost no scientific discovery is made by one individual, though, and von Neumann architecture is no exception. Von Neumann's work was based on the work of J. Presper Eckert and John Mauchly, who invented the Electronic Numerical Integrator and Computer (ENIAC), the world's first digital computer. In the time since that paper was written, von Neumann architecture has become the norm.
"The von Neumann architecture is quite flexible, that's the main benefit," said IBM Research scientist Manuel Le Gallo-Bourdeau. "That's why it was first adopted, and that's why it's still the prominent architecture today."
[...] For AI computing, the von Neumann bottleneck creates a twofold efficiency problem: the number of model parameters (or weights) to move, and how far they need to move. More model weights mean larger storage, which usually means more distant storage, said IBM Research scientist Hsinyu (Sidney) Tsai. "Because the quantity of model weights is very large, you can't afford to hold them for very long, so you need to keep discarding and reloading," she said.
The main energy expenditure during AI runtime is spent on data transfers — bringing model weights back and forth from memory to compute. By comparison, the energy spent doing computations is low. In deep learning models, for example, the operations are almost all relatively simple matrix vector multiplication problems. Compute energy is still around 10% of modern AI workloads, so it isn't negligible, said Tsai. "It is just found to be no longer dominating energy consumption and latency, unlike in conventional workloads," she added.
About a decade ago, the von Neumann bottleneck wasn't a significant issue because processors and memory weren't so efficient, at least compared to the energy that was spent to transfer data, said Le Gallo-Bourdeau. But data transfer efficiency hasn't improved as much as processing and memory have over the years, so now processors can complete their computations much more quickly, leaving them sitting idle while data moves across the von Neumann bottleneck.
[...] Aside from eliminating the von Neumann bottleneck, one solution includes closing that distance. "The entire industry is working to try to improve data localization," Tsai said. IBM Research scientists recently announced such an approach: a polymer optical waveguide for co-packaged optics. This module brings the speed and bandwidth density of fiber optics to the edge of chips, supercharging their connectivity and hugely reducing model training time and energy costs.
With currently available hardware, though, the result of all these data transfers is that training an LLM can easily take months, consuming more energy than a typical US home does in that time. And AI doesn't stop needing energy after model training. Inferencing has similar computational requirements, meaning that the von Neumann bottleneck slows it down in a similar fashion.
[...] While von Neumann architecture creates a bottleneck for AI computing, for other applications, it's perfectly suited. Sure, it causes issues in model training and inference, but von Neumann architecture is perfect for processing computer graphics or other compute-heavy processes. And when 32- or 64-bit floating point precision is called for, the low precision of in-memory computing isn't up to the task.
"For general purpose computing, there's really nothing more powerful than the von Neumann architecture," said Burr. Under these circumstances, bytes are either operations or operands that are moving on a bus from a memory to a processor. "Just like an all-purpose deli where somebody might order some salami or pepperoni or this or that, but you're able to switch between them because you have the right ingredients on hand, and you can easily make six sandwiches in a row." Special-purpose computing, on the other hand, may involve 5,000 tuna sandwiches for one order — like AI computing as it shuttles static model weights.
(Score: 4, Funny) by krishnoid on Thursday October 02, @12:07AM (8 children)
So a big catering order? I always suspected AI was mostly for rich people.
(Score: 3, Interesting) by c0lo on Thursday October 02, @12:25AM (6 children)
To put the scale in evidence, we're speaking about 2-3nm size tuna sandwiches. Joe Average - the AI-center building worker - will need far more than that to just stay alive.
https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
(Score: 4, Interesting) by istartedi on Thursday October 02, @02:05AM (1 child)
Grok [x.com] says that you will need to supply 2.8×10^22 to 9.4×10^22 nano-sandwiches to equal one ordinary tuna sandwich, which it believes to be 15 by 10 by 5 cm. It says nothing about the proportions of tuna, bread and other ingredients but I assume that it assumed they are equally proportioned. It ignored the fact that at nano-scale, tuna is no longer tuna since cells are larger than nano-scale.
When queried on this matter it replies:
Appended to the end of comments you post. Max: 120 chars.
(Score: 3, Funny) by jb on Thursday October 02, @07:03AM
That might actually improve the flavour!
(Score: 3, Insightful) by anubi on Thursday October 02, @03:50AM (3 children)
The ultra-wealthy have to get their funds from somewhere, and they can't get it the way the low-income hoi-polloi get it, as the artificial monopolies of someone actually working to earn a buck doesn't have near the rate of return. They need something they can maintain a monopoly on
So what they do is dream up stuff like this, and get government grants. You only have to sell a few legislators ( who don't finance this personally! ) on their concept and they will award contracts out of the public purse, which can be increased by simply "extending the debt ceiling".
This will make a few people very wealthy, provide jobs ( whose proceeds are taxed back sans "shelters" ) to "liquidate the economy" so landlords, merchant's, and farmers can compete for whatever is left to stay in business.
No one pays the bill in money. It's paid for by the time and labor of the working stiff, who exchanges the hours of his life for currency which is just printed by governments and awarded to employers to keep a payroll running.
It keeps us all busy.
Money. Governments create it give to certain parties via grants, those parties employ workers, then the government eventually takes it all back via tax after tax after tax, so they can guide you on how you spend the money you earned. It's not really money anyway. It's currency. It's something they can give you to motivate you to do something for one of their grant recipients that employed you. Every time it changes hands, they take a portion back to fund both themselves and local governments. They have to keep enough cash in circulation to keep us on the rat-wheel so we don't organize and upset the apple cart trying to obtain something to trade. Silver and gold are in limited supply, so they pay us in bank notes, which can be simply printed by governments, with severe consequences for violation of their monopoly ( counterfeiting ). We are now entering a time where bank notes are being replaced by ledger entries and readily accessible status on your ability to pay ( credit score ).
This whole situation is made possible by something known as the economies of scale, and there is where the super wealthy operate...in things of planetary scale. It's why I can buy a cell phone for less than $100. And LED light bulbs for a buck retail. The ultra wealthy engineer the manufacturing of these things, much of which is often thrown away, without ever being used, as these mass produced things become so plentiful, with machines just rolling goods out.
And, for those who have both figured this out and have the connections into Congress, immense wealth beyond your wildest imagination.
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
(Score: 3, Informative) by c0lo on Thursday October 02, @10:30PM (2 children)
How money fail to work: https://www.youtube.com/watch?v=CzlCQ0ClbD4 [youtube.com]
https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
(Score: 1) by anubi on Friday October 03, @01:16AM (1 child)
c0lo, I would have given you all my mod points for that gem!
Thanks!
"Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
(Score: 3, Insightful) by c0lo on Friday October 03, @04:59AM
NP, mate, I thought you were gonna like it.
In any case, USD is gonna release its global reserve currency status, it's no longer tenable. May take 1-2 years, see what you can do in between.
https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
(Score: 0) by Anonymous Coward on Thursday October 02, @06:57AM
It gives a whole new meaning to the phrase :
" To err is human...
To really foul things up requires a computer!"
I can't wait for McDonalds to put one of these at the drive-thru. I could probably earn a fortune making YouTube videos. Even though at times it's hilarious watching old folks try to use the store kiosk. One got so pissed off he muttered something about a local diner across the street and left part of his order in the machine. I thought the expression on his face was priceless. I think he got his wires crossed between McDonalds and Burger King. You don't get it your way at McDonalds.
Oh, it happens. I remember trying to order a Jumbo Jack at Carl's Jr.
(Score: 5, Insightful) by hopdevil on Thursday October 02, @12:28AM
AI training has for years been using custom processors with huge data caches as near to the custom logic as possible, designed with addressing this exact issue. If you're training on a general purpose CPU, you are doing it wrong
(Score: 5, Insightful) by SomeGuy on Thursday October 02, @12:56AM
If you want to save even more power, just turn off your computer or smell phone and stick your head in an unflushed toilet. That provides the same result as AI.
(Score: 1, Offtopic) by Mojibake Tengu on Thursday October 02, @02:49AM (1 child)
This stupid speculation again.
I insist the von Neumann architecture is the best one ever invented computation model for artificial machines.
It's completely Natural.
Just look at the Universe itself: do you see some funny separation of code and data in the Natural Universe? Physics? Chemistry? Biology? Quantum Mechanics?
No. Everything executes happily as a mixed model. Code that manipulates itself as data.
I strongly suggest those foolish haters of von Neumann architecture should cease to use their computers immediately. We'll see how that goes...
"DATA IS CODE AND CODE IS DATA" is a fundamental dogma in Cybernetics Theory. Now wonder those under-educated "IT specialists" cannot comprehend.
Rust programming language offends both my Intelligence and my Spirit.
(Score: 2, Disagree) by theluggage on Thursday October 02, @11:45AM
Not even wrong - this isn't about "code" vs. "data" or the possibility of code manipulating itself. It's about the part of the computer that executes algorithms (control/ALU) being separated from the part that stores the data (whether it's code, input or results) and connected by a bus which creates a bottleneck. That split - and the resulting bottleneck - is also explicitly part of the Von Neumann architecture [wikipedia.org].
The situation in reality is that the - let's call it "the data describing the algorithm" is often a fraction of the size of the "data the algorithm needs for input & output". The bit of the "algorithm" that is currently running will typically fit comfortably in the CPUs fastest internal cache whereas the input/output frequently won't - especially problematic if it is accessed non-sequentially. The CPU's caches may well be split into instructions and data [wikipedia.org] and multiple/special-purpose caches that aren't really classic Von Neuman are pretty ubiquitous.
As for "natural" - please find one example of something that has been found to use anything like a Von Neumann architecture , anything like computer memory or even been shown to be Turing equivalent. Identifying parts of the brain that appear to be associated with memories doesn't make them equivalent to computer RAM.
Neural networks obviously don't have separate code and data - but they don't have identifiable algorithms or storage either - the "control/ALU" functionality and what appears to be "information recall" is all part and parcel of the connections and their weight - which is definitely not "von Neumann". They don't really "compute" in the Turing/Von Neumann sense of "computable numbers" - which is handy because they can 'guess' probable solutions to non-computable problems. Last I looked, artificial neural networks still rely on "traditional" computing to train them on large datasets - which is the problem here.
(Score: 2, Insightful) by Anonymous Coward on Thursday October 02, @03:29AM
What's old is new.
(Score: 5, Insightful) by Rosco P. Coltrane on Thursday October 02, @03:42AM
when I read the title - as in, the crumbling decades-old, under-maintained and under-funded power infrastructure in this country. Because if you want a bottleneck, here's one.
(Score: 4, Insightful) by Unixnut on Thursday October 02, @09:27AM (2 children)
Um... no, the von Neumann architecture [wikipedia.org] does not separate code and data (i.e. "compute and memory"), you're thinking of the Harvard Architecture [wikipedia.org], which does split data and code into separate subsystems, which never took off in general computation.
The bottleneck between "compute" and "memory" is nothing to do with the von Neumann architecture, but as side effect of CPUs getting much faster than RAM chips, Computers originally had a single system bus [wikipedia.org], where everything was intermingled, but with time the increase of CPU performance and clock speed resulted in the creation of separate buses, clocked at different speeds (so that you did not have to limit CPU performance growth by restricting clock speed to the slowest thing on the single bus). For the CPU/Memory bus it became known as the front side bus [wikipedia.org], and with time this was replaced with the serial packet based systems like Hypertransport [wikipedia.org] which is used nowadays in some form or another.
To remedy this needs work to be put into developing faster memory and interconnects, or completely new types of memory that combine the speed of RAM with the persistence of disk (like ReRAM [wikipedia.org]), allowing for your entire storage to be mapped to the CPU, with no need to shuttle data back/forth between the slow storage and the processing unit (at least as long as you don't need to fetch via the network, but then you are dealing with a distributed system which is outside of scope).
Therefore the first line of the article (and by extension its entire premise) is based on an incorrect assumption, and then the analogy with tuna sandwiches just confused me. A really poor effort IMO, and a first for me to consider an article as such either on SN or the green site of old. That this came from "IBM Research" just make me think how much IBM must have fallen over the years if this is the output quality of their researchers.
(Score: 2) by theluggage on Thursday October 02, @12:20PM
"Compute and memory" is not the same thing as "code and data". Both the VN architecture and the Harvard architecture describe separate units for control/ALU (compute) and memory. They would suffer the same bottleneck problem when processing huge, random-access data sets and the bandwidth needed for getting "code" to the compute unit is usually negligible compared to the bandwidth for other data.
Most modern processors are only pure VN "on the outside" anyway - look inside and the caches will likely separate code and data anyway. Modern "systems on a chip" (like Apple Silicon and NVIDIA Grace/Hopper) are now using on-package, multi-channel RAM shared by integrated GPUs and neural engines anyhow (which is partly what TFA is about).
(Score: -1, Troll) by Anonymous Coward on Thursday October 02, @01:33PM
Wow, for the love of all that is good, take a GD comp. sci. course before you run your mouth.