The Trillion-Transistor Chip That Just Left a Supercomputer in the Dust:
So, in a recent trial, researchers pitted the chip—which is housed in an all-in-one system about the size of a dorm room mini-fridge called the CS-1—against a supercomputer in a fluid dynamics simulation. Simulating the movement of fluids is a common supercomputer application useful for solving complex problems like weather forecasting and airplane wing design.
The trial was described in a preprint paper written by a team led by Cerebras's Michael James and NETL's Dirk Van Essendelft and presented at the supercomputing conference SC20 this week. The team said the CS-1 completed a simulation of combustion in a power plant roughly 200 times faster than it took the Joule 2.0 supercomputer to do a similar task.
The CS-1 was actually faster-than-real-time. As Cerebrus wrote in a blog post, "It can tell you what is going to happen in the future faster than the laws of physics produce the same result."
The researchers said the CS-1's performance couldn't be matched by any number of CPUs and GPUs. And CEO and cofounder Andrew Feldman told VentureBeat that would be true "no matter how large the supercomputer is." At a point, scaling a supercomputer like Joule no longer produces better results in this kind of problem. That's why Joule's simulation speed peaked at 16,384 cores, a fraction of its total 86,400 cores.
Previously:
Cerebras More than Doubles Core and Transistor Count with 2nd-Generation Wafer Scale Engine
Cerebras Systems' Wafer Scale Engine Deployed at Argonne National Labs
Cerebras "Wafer Scale Engine" Has 1.2 Trillion Transistors, 400,000 Cores
Related Stories
The five technical challenges Cerebras overcame in building the first trillion transistor chip
Superlatives abound at Cerebras, the until-today stealthy next-generation silicon chip company looking to make training a deep learning model as quick as buying toothpaste from Amazon. Launching after almost three years of quiet development, Cerebras introduced its new chip today — and it is a doozy. The "Wafer Scale Engine" is 1.2 trillion transistors (the most ever), 46,225 square millimeters (the largest ever), and includes 18 gigabytes of on-chip memory (the most of any chip on the market today) and 400,000 processing cores (guess the superlative).
It's made a big splash here at Stanford University at the Hot Chips conference, one of the silicon industry's big confabs for product introductions and roadmaps, with various levels of oohs and aahs among attendees. You can read more about the chip from Tiernan Ray at Fortune and read the white paper from Cerebras itself.
Also at BBC, VentureBeat, and PCWorld.
Cerebras Unveils First Installation of Its AI Supercomputer at Argonne National Labs
At Supercomputing 2019 in Denver, Colo., Cerebras Systems unveiled the computer powered by the world's biggest chip. Cerebras says the computer, the CS-1, has the equivalent machine learning capabilities of hundreds of racks worth of GPU-based computers consuming hundreds of kilowatts, but it takes up only one-third of a standard rack and consumes about 17 kW. Argonne National Labs, future home of what's expected to be the United States' first exascale supercomputer, says it has already deployed a CS-1. Argonne is one of two announced U.S. National Laboratories customers for Cerebras, the other being Lawrence Livermore National Laboratory.
The system "is the fastest AI computer," says CEO and cofounder Andrew Feldman. He compared it with Google's TPU clusters (the 2nd of three generations of that company's AI computers), noting that one of those "takes 10 racks and over 100 kilowatts to deliver a third of the performance of a single [CS-1] box."
The CS-1 is designed to speed the training of novel and large neural networks, a process that can take weeks or longer. Powered by a 400,000-core, 1-trillion-transistor wafer-scale processor chip, the CS-1 should collapse that task to minutes or even seconds. However, Cerebras did not provide data showing this performance in terms of standard AI benchmarks such as the new MLPerf standards. Instead it has been wooing potential customers by having them train their own neural network models on machines at Cerebras.
[...] The CS-1's first application is in predicting cancer drug response as part of a U.S. Department of Energy and National Cancer Institute collaboration. It is also being used to help understand the behavior of colliding black holes and the gravitational waves they produce. A previous instance of that problem required 1024 out of 4392 nodes of the Theta supercomputer.
Also at TechCrunch, VentureBeat, and Wccftech.
Previously: Cerebras "Wafer Scale Engine" Has 1.2 Trillion Transistors, 400,000 Cores
342 Transistors for Every Person In the World: Cerebras 2nd Gen Wafer Scale Engine Teased
One of the highlights of Hot Chips from 2019 was the startup Cerebras showcasing its product – a large 'wafer-scale' AI chip that was literally the size of a wafer. The chip itself was rectangular, but it was cut from a single wafer, and contained 400,000 cores, 1.2 trillion transistors, 46225 mm2 of silicon, and was built on TSMC's 16 nm process.
[...] Obviously when doing wafer scale, you can't just add more die area, so the only way is to optimize die area per core and take advantage of smaller process nodes. That means for TSMC 7nm, there are now 850,000 cores and 2.6 trillion transistors. Cerebras has had to develop new technologies to deal with multi-reticle designs, but they succeeded with the first gen, and transferred the learnings to the new chip. We're expecting more details about this new product later this year.
Previously: Cerebras "Wafer Scale Engine" Has 1.2 Trillion Transistors, 400,000 Cores
Cerebras Systems' Wafer Scale Engine Deployed at Argonne National Labs
Hungry for AI? New supercomputer contains 16 dinner-plate-size chips
On Monday, Cerebras Systems unveiled its 13.5 million core Andromeda AI supercomputer for deep learning, reports Reuters. According Cerebras, Andromeda delivers over one 1 exaflop (1 quintillion operations per second) of AI computational power at 16-bit half precision.
The Andromeda is itself a cluster of 16 Cerebras C-2 computers linked together. Each CS-2 contains one Wafer Scale Engine chip (often called "WSE-2"), which is currently the largest silicon chip ever made, at about 8.5-inches square and packed with 2.6 trillion transistors organized into 850,000 cores.
Cerebras built Andromeda at a data center in Santa Clara, California, for $35 million. It's tuned for applications like large language models and has already been in use for academic and commercial work. "Andromeda delivers near-perfect scaling via simple data parallelism across GPT-class large language models, including GPT-3, GPT-J and GPT-NeoX," writes Cerebras in a press release.
Previously: Cerebras "Wafer Scale Engine" Has 1.2 Trillion Transistors, 400,000 Cores
Cerebras Systems' Wafer Scale Engine Deployed at Argonne National Labs
Cerebras More than Doubles Core and Transistor Count with 2nd-Generation Wafer Scale Engine
The Trillion-Transistor Chip That Just Left a Supercomputer in the Dust
(Score: 4, Insightful) by Hartree on Monday November 23 2020, @06:19PM (13 children)
So, you're saying that a purpose built machine can beat out a general computer on a given problem.
This is hardly news. Google "Gravity Pipe" for a far older example.
(Score: 5, Insightful) by HiThere on Monday November 23 2020, @06:32PM (10 children)
IIUC, that's not what's happening. It's that really large scale integration allows faster intercommunication, and the problem has a limit as to how "embarrassingly parallel" it is. Neither is all that surprising, but manufacturing defects have limited the scale of integration. They still do, but the limit is higher.
OTOH, there's nothing that says this kind of chip will be profitable to produce.
Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
(Score: 4, Insightful) by EvilSS on Monday November 23 2020, @07:20PM (6 children)
But, of course, we need to see more independent verification of their claims, and, as you suggest, their yield is the big "if" here. If they are tossing hundreds of wafers to get one working on, it would be a problem.
(Score: 3, Informative) by takyon on Monday November 23 2020, @08:13PM (2 children)
IIRC, it's built to be tolerant of defective cores. Maybe there's a controller or some other small part that must be in perfect shape for it to work, but it could mean that almost every wafer is usable, the complete opposite of tossing out hundreds to get one good one.
Another thing is that TSMC's "7nm" yield is very good in the first place. And it costs about $9,346 [techpowerup.com].
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2) by HiThere on Monday November 23 2020, @10:43PM
Sounds promising, but that ordinary yield is based on assuming that only a small area of the surface needs to be free of defects. If they need too much error correction (or longer inter-processor routing) that, in and of itself, could slow things down a lot. There may well be only a few "grade A" chips, and a much larger number of grades B and C, which are slower, or have fewer working processors.
Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
(Score: 2) by TheRaven on Tuesday November 24 2020, @11:20AM
Most modern CPUs are designed to be tolerant of defects to a degree. It's pretty easy if the defect is in the cache: you just disable part of the cache and sell the chip as a cheaper variant. Intel started doing this aggressively around the 486: if you had a defect in the FPU, it was sold as a 486SX, if it had a defect in the CPU, it was sold as a 487, if both passed tests then it was a 486DX. Around the Pentium 3 era, yields got high enough that they (and AMD) ended up selling higher-rated parts with lower model numbers, because that made more money that lowering the price of the high-end parts.
This kind of thing is *much* easier with a regular layout. If you design your network on chip correctly, you can just route around any units that didn't work. IBM and Sony did this with the Cell: most of the chips made had at a defect in one of the SPUs, these were put in Playstations with 7 SPUs. The ones with no defects were put in IBM server parts with 8 SPUs. The ones with a defect in the CPU were put on accelerator boards. If your 'chip' is a wafer full of cores in a regular layout with a NOC routing between them, you can power the whole thing up, test each core, and then configure your NOC switches to route around areas that don't work (including entire parts of the network if there's a fault in part of the network itself). The main difficulty is that each system you produce will have subtly different topology, which will affect inter-core latency and may impact overall performance. Oh, and powering / cooling a chip that big is also nontrivial...
sudo mod me up
(Score: 5, Interesting) by driverless on Monday November 23 2020, @10:34PM (2 children)
I was at the conference where this was introduced. The consensus among the attendees, all of whom were experts in the field, was that it was yet another attempt at WSI, was an impressive proof-of-concept, and like every other time this has been tried would sink without trace after a year or two. No-one could see where this was going or who would buy it apart from one or two national labs to play with it for awhile.
(Score: 0) by Anonymous Coward on Tuesday November 24 2020, @04:38PM (1 child)
Sorry... what is a "WSI," and why is this apparently not-a-big-deal? (I'm not as familiar with this field.)
(Score: 2) by takyon on Tuesday November 24 2020, @11:11PM
https://en.wikipedia.org/wiki/Wafer-scale_integration [wikipedia.org]
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 3, Interesting) by sjames on Monday November 23 2020, @08:05PM (2 children)
Exactly. Communications latency is the big killer of performance in a supercomputer. In anything but the most embarrassingly parallel computation, the communications latency will set an upper limit on the number of cores that can be usefully used in the computation.
CS-1's approach is obvious from a theoretical standpoint, it should surprise nobody that it is faster and more efficient from a theoretical standpoint. The problem has always been practicality.
As for cost, this approach will be specialty for quite a while and likely expensive due to terrible yields. It's simply hard to make a chip that large with no defects.
If this makes it to production, it's going to require an approach like the celeron or AMD's 3 core processors but at a larger scale. That is, each chip is likely to be a little different with disabled modules. That will add complexity. It's reletively easy to have 2 banks of cache and allow one to be disabled that it is to have 400,000 cores where some arbitrary number of them may be disabled. Programs intended for the platform will likely need to be configured for the particular chip they'll be run on. In spite of that, for some jobs it may be worth it.
(Score: 5, Informative) by takyon on Monday November 23 2020, @08:16PM (1 child)
https://www.anandtech.com/show/15838/cerebras-wafer-scale-engine-scores-a-sale-5m-buys-two-for-the-pittsburgh-supercomputing-center [anandtech.com]
It looks like the yield is nearly 100%. A % of defective cores are disabled on each chip.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2) by sjames on Monday November 23 2020, @10:28PM
Good info, thanks!
Looks like they did go with something like the Celeron Strategy here.
As pricy as it is, it may actually be cheaper than the alternative.
(Score: 3, Interesting) by BsAtHome on Monday November 23 2020, @06:35PM (1 child)
Only partially purpose built. You can better describe it with the operational method of a generic system for a subset of problems. It will do very well with many simulations. Maybe primarily related to fluid-dynamics problems, but that is not a limitation in many science problems. You can also build a nice ray-tracer ;-)
(Score: 2) by Hartree on Tuesday November 24 2020, @05:15AM
Yes, I should have read it deeper before answering. The big plus is the higher rate of communication between cores. That should help with more tightly coupled physical systems that don't bust up as well into largely independent elements. I'd be interested to see how well it works on something viciously that way and nonlinear like general relativity simulations.
On the other hand, the proof is in the profits. Gene Amdahl and Trilogy Systems crashed very hard in the early 80s when they tried wafer scale integration.
(Score: 2) by krishnoid on Monday November 23 2020, @07:49PM (2 children)
Cerebras - superchip
Cerberus - dog that guards hades
Cerebus - aardvark
Cerebrus -- blog poster? I think it's a typo.
(Score: 3, Informative) by PiMuNu on Monday November 23 2020, @08:20PM
Cerebrum - the largest part of the brain containing the cerebral cortex
So yes, Cerebrus is a typo, in that cerebrum is second declension neuter whereas cerebrus would be second declension masculine were it to exist.
(Score: 0) by Anonymous Coward on Tuesday November 24 2020, @02:41AM
Isn't Cerebras what holds up Cereboobs?
(Score: 1) by soylentnewsfan1 on Monday November 23 2020, @07:56PM (2 children)
Could this "faster-than-real-time" property be used for fusion reactors to keep and contain the reaction and its fields better than current methods?
(Score: 2) by captain normal on Monday November 23 2020, @08:26PM
Or maybe use it for something profitable, like...I don't know...gaming the stock market? Or maybe something really useful like cleaning up the thin layer of atmosphere on this rock ball.
"It is easier to fool someone than it is to convince them that they have been fooled" Mark Twain
(Score: 2) by TheRaven on Tuesday November 24 2020, @11:22AM
sudo mod me up
(Score: 0) by Anonymous Coward on Monday November 23 2020, @09:57PM (3 children)
They will stack two of these with solder dots (or some other connector technology). Chiller capacity & power supply will have to roughly double, but otherwise it's a plug for plug replacement for the single wafer version.
(Score: 2) by takyon on Monday November 23 2020, @10:14PM (2 children)
TSMC Will Manufacture 3D Stacked WoW Chips In 2021 Claims Executive [wccftech.com]
Whole wafer on whole wafer (WWoWW)?
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2) by hendrikboom on Monday November 23 2020, @11:32PM (1 child)
Hmmm. How *would* one design a Word of Warfare chip? What would it do?
(Score: 2) by takyon on Monday November 23 2020, @11:43PM
In a world: kill.
WoW (wafer on wafer) could be a crude way to double performance by stacking two wafers. Heat issues could limit the applications. We should know more within a year.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 0) by Anonymous Coward on Tuesday November 24 2020, @03:26AM
vote fraud generation and detection. sell both sides! ka-ching!