The Trillion-Transistor Chip That Just Left a Supercomputer in the Dust:
So, in a recent trial, researchers pitted the chip—which is housed in an all-in-one system about the size of a dorm room mini-fridge called the CS-1—against a supercomputer in a fluid dynamics simulation. Simulating the movement of fluids is a common supercomputer application useful for solving complex problems like weather forecasting and airplane wing design.
The trial was described in a preprint paper written by a team led by Cerebras's Michael James and NETL's Dirk Van Essendelft and presented at the supercomputing conference SC20 this week. The team said the CS-1 completed a simulation of combustion in a power plant roughly 200 times faster than it took the Joule 2.0 supercomputer to do a similar task.
The CS-1 was actually faster-than-real-time. As Cerebrus wrote in a blog post, "It can tell you what is going to happen in the future faster than the laws of physics produce the same result."
The researchers said the CS-1's performance couldn't be matched by any number of CPUs and GPUs. And CEO and cofounder Andrew Feldman told VentureBeat that would be true "no matter how large the supercomputer is." At a point, scaling a supercomputer like Joule no longer produces better results in this kind of problem. That's why Joule's simulation speed peaked at 16,384 cores, a fraction of its total 86,400 cores.
Previously:
Cerebras More than Doubles Core and Transistor Count with 2nd-Generation Wafer Scale Engine
Cerebras Systems' Wafer Scale Engine Deployed at Argonne National Labs
Cerebras "Wafer Scale Engine" Has 1.2 Trillion Transistors, 400,000 Cores
(Score: 3, Interesting) by sjames on Monday November 23 2020, @08:05PM (2 children)
Exactly. Communications latency is the big killer of performance in a supercomputer. In anything but the most embarrassingly parallel computation, the communications latency will set an upper limit on the number of cores that can be usefully used in the computation.
CS-1's approach is obvious from a theoretical standpoint, it should surprise nobody that it is faster and more efficient from a theoretical standpoint. The problem has always been practicality.
As for cost, this approach will be specialty for quite a while and likely expensive due to terrible yields. It's simply hard to make a chip that large with no defects.
If this makes it to production, it's going to require an approach like the celeron or AMD's 3 core processors but at a larger scale. That is, each chip is likely to be a little different with disabled modules. That will add complexity. It's reletively easy to have 2 banks of cache and allow one to be disabled that it is to have 400,000 cores where some arbitrary number of them may be disabled. Programs intended for the platform will likely need to be configured for the particular chip they'll be run on. In spite of that, for some jobs it may be worth it.
(Score: 5, Informative) by takyon on Monday November 23 2020, @08:16PM (1 child)
https://www.anandtech.com/show/15838/cerebras-wafer-scale-engine-scores-a-sale-5m-buys-two-for-the-pittsburgh-supercomputing-center [anandtech.com]
It looks like the yield is nearly 100%. A % of defective cores are disabled on each chip.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2) by sjames on Monday November 23 2020, @10:28PM
Good info, thanks!
Looks like they did go with something like the Celeron Strategy here.
As pricy as it is, it may actually be cheaper than the alternative.