AMD released Threadripper CPUs in 2017, built on the same 14nm Zen architecture as Ryzen, but with up to 16 cores and 32 threads. Threadripper was widely believed to have pushed Intel to respond with the release of enthusiast-class Skylake-X chips with up to 18 cores. AMD also released Epyc-branded server chips with up to 32 cores.
This week at Computex 2018, Intel showed off a 28-core CPU intended for enthusiasts and high end desktop users. While the part was overclocked to 5 GHz, it required a one-horsepower water chiller to do so. The demonstration seemed to be timed to steal the thunder from AMD's own news.
Now, AMD has announced two Threadripper 2 CPUs: one with 24 cores, and another with 32 cores. They use the "12nm LP" GlobalFoundries process instead of "14nm", which could improve performance, but are currently clocked lower than previous Threadripper parts. The TDP has been pushed up to 250 W from the 180 W TDP of Threadripper 1950X. Although these new chips match the core counts of top Epyc CPUs, there are some differences:
At the AMD press event at Computex, it was revealed that these new processors would have up to 32 cores in total, mirroring the 32-core versions of EPYC. On EPYC, those processors have four active dies, with eight active cores on each die (four for each CCX). On EPYC however, there are eight memory channels, and AMD's X399 platform only has support for four channels. For the first generation this meant that each of the two active die would have two memory channels attached – in the second generation Threadripper this is still the case: the two now 'active' parts of the chip do not have direct memory access.
This also means that the number of PCIe lanes remains at 64 for Threadripper 2, rather than the 128 of Epyc.
Threadripper 1 had a "game mode" that disabled one of the two active dies, so it will be interesting to see if users of the new chips will be forced to disable even more cores in some scenarios.
It's taken longer than I had hoped, but there are now plenty of languages with the right features, and various frameworks that make it much easier to take advantage of using any number of cores to handle "embarrassingly parallel problems".
And there are plenty of embarrassingly parallel problems. Some problems can be transformed into parallel problems. Just look for long iterations over items where the processing of each item is independent from other items.
You can also re-think algorithms.
I was plotting millions, then tens of millions of data points. It was slow. I was doing the obvious but naive operation of drawing a dot for each data point. This meant a long loop, and invoking a graphics subsystem operation for every point. Even though the plot is being drawn off screen.
Then I observed a phenomena. This is like having a square tiled wall (eg, the pixels) and throwing color paint filled balloons at the wall (eg, each plot point). After many plot points, older plot points are obscured by newer ones.
So let's re-think. Imagine each plot point is now a dart with an infinitely small top. The square tiles on the wall are now "pixel buckets". Each square ("pixel bucket") accumulates the average (of the original data, not the color). The average of data points (eg, dart points) that hit the wall in that tile. Now we're throwing darts (data points) at the wall instead of paint filled balloons. (each pixel bucket has a counter and an accumulated sum, thus an average.)
At the end, compute the color (along a gradient) for the accumulated average in each pixel. Now the number of graphical operations is to set one pixel for every pixel bucket. The number of graphical operations is tied to the number of pixels, and unrelated to the number of input data points. The entire result is:1. faster2. draws a much more finely detailed view (and don't say it is because the original dot size plotted was too big)
Now I can (and did) take this further and make it parallel. Divide up the original data points into groups of "work units". When each cpu core is free, it consumes the next "work unit" in the queue. It creates a 2D array of "pixel bucket" averages. Iterates over the subset of data points in that work unit, and averages each point into which ever pixel it would land in.
At the end, pairs of these arrays of pixel values are smashed together. (Simply add the counters and sums together in corresponding pixel buckets.) Then on the final array, once again, determine colors and plot.
The result is identical, but now much faster. Not n-cores * original faster, but close. There is overhead. But using 8 cores is way worth it.
My point, if you think about it, you can find opportunities to use multiple cores. Just put your mind to it. Remember there is overhead. So each "work unit" must be far more than worth the overhead to organize and process it under this model.
Why not use OpenCL to run it on a GPU?
So many things to do, so little time. I'm sure you know the story.
The project is written in Java. There are two (that I know of) projects for Java to support OpenCL in Java. I have looked into it. It is a higher bar to jump over. I might try it with a small project first. It's a matter of time and energy to do it. I'm interested in trying it.
I have to write a C kernel, and there are examples, and have that code available as a "string". (Eg, baked into the code, retrieved from a configuration file, database, etc.) I have to think about the problem very differently to organize it for OpenCL. It is a very different programming model than conventional CPUs. Basically OpenCL is parallelism at a far finer grained approach than the "work units" I described. The work done my my "work units", and thus the code, could be arbitrarily complex. As long as work units are all independent of one another. The very same code to do the work, works on a single cpu core, or multiple cores, if you have them. With OpenCL, I need to have two sets of code to maintain. The OpenCL version, and at least a single-core version for when OpenCL is not available on a given runtime. (Remember, my Java program, the binary, runs on any machine, even ones not invented yet.)
Thus, there is a philosophical issue. What I would rather see is More Cores Please. Conventional cores. Conventional architecture programming. It seems that if you had several hundred cores that were more general purpose rather than specialized for graphics, this would STILL benefits graphics. But in a much more general way.
Let me give an example of a problem that would require serious thinking for OpenCL. A Mandelbrot set explorer. My current Mandelbrot set explorer (in Java) uses arbitrary precision. Thus it does not "peter out" once you dive deep enough to exhaust the precision of a double (eg, 64 bit float). By allowing arbitrary precision math, you can dive deeper and deeper. A Mandelbrot explorer is another embarrassingly parallel problem. "Work units" could even be distributed out to other computers on a network. You just need to launch a JAR file on each node. (And those nodes don't even have to be the same CPU architecture or OS.) In a single kernel in OpenCL, I would need to be able to iterate X number of times on a pixel, using arbitrary precision math, within the bounds of how kernels work. Multiple parameters, each parameter being a buffer (an array) where different concurrent kernels operate on different elements in the parameter buffers.
It seems that with all the silicon we have now, maybe it's time to start building larger numbers of general purpose cores. This would much more rapidly produce benefits in far more every day applications than Open CL. IMO.