Imagine if we could use vector processing on something other than just floating point problems. Today, GPUs and CPUs work tirelessly to accelerate algorithms based on floating point (FP) numbers. Algorithms can definitely benefit from basing their mathematics on bits and integers (bytes, words) if we could just accelerate them too. FPGAs can do this, but the hardware and software costs remain very high. GPUs aren't designed to operate on non-FP data. Intel AVX introduced some support, and now Intel AVX-512 is bringing a great deal of flexibility to processors. I will share why I'm convinced that the "AVX512VL" capability in particular is a hidden gem that will let AVX-512 be much more useful for compilers and developers alike.
Fortunately for software developers, Intel has done a poor job keeping the "secret" that AVX-512 is coming to Intel's recently announced Xeon Scalable processor line very soon. Amazon Web Services has publically touted AVX-512 on Skylake as coming soon!
It is timely to examine the new AVX-512 capabilities and their ability to impact beyond the more regular HPC needs for floating point only workloads. The hidden gem in all this, which enables shifting to AVX-512 more easily, is the "VL" (vector length) extensions which allow AVX-512 instructions to behave like SSE or AVX/AVX2 instructions when that suits us. This is a clever and powerful addition to enable its adoption in a wider assortment of software more quickly. The VL extensions mean that programmers (and compilers) do not need to shift immediately from 256-bits (AVX/AVX2) to 512-bits to use the new bit/byte/word manipulations. This transitional benefit is useful not only for an interim, but also for applications which find 256-bits more natural (perhaps a small, but important, subset of problems).
Will it be enough to stave off "Epyc"?
A new update to the Intel document for software developers indicates that the company will begin to introduce various AVX-512 instruction set extensions to its consumer CPUs soon. This will start from the codenamed Cannon Lake (CNL) and Ice Lake (ICL) processors, made using 10 nm process technologies. The new extensions will enable future chips to improve performance in certain applications. One of the main questions on AVX-512 is which consumer programs will actually support the AVX-512 when these CNL and ICL processors hit the market. In addition to the AVX-512, the upcoming processors will introduce a host of other new non-AVX-512 instructions.
According to the Intel Architecture Instruction Set Extensions and Future Features Programming Reference document, Intel's Cannon Lake CPUs will support AVX512F, AVX512CD, AVX512DQ, AVX512BW, and AVX512VL. This will bring the feature set of these CPUs to the current level of the Skylake-SP based processors. In addition, the Cannon Lake microarchitecture will support the AVX512_IFMA and AVX512_VBMI commands, but at this point, it is unclear whether the support will be limited to servers, or will also be featured in the consumer processors (the latter scenario is likely based on the document wording, but remains unclear).
Intel originally promised to release Cannon Lake processors in 2016 – 2017 timeframe, but delayed introduction of its 10 nm process technology to 2018, thus postponing the CPU launch as well. Initially it was expected that the Cannon Lake CPUs would generally resemble the Kaby Lake and Coffee Lake chips with some refinements, but the addition of the AVX-512 support means a rather tangible architecture improvement. For AVX-512, large the[sic] chunks of data require massive memory bandwidth, which the Skylake-SP cores get due to large caches and more memory controllers. Keeping in mind memory bandwidth and power consumption factors, the AVX-512 might not be supported by all Cannon Lake client CPUs, but only by those aimed at higher-performance machines (i.e., no AVX-512 for ULP mobile parts as well as entry-level desktop SKUs, but this is [speculation] at this point). Meanwhile, [the] good news is that by the time AVX-512-supporting Cannon Lake processors arrive, programs for client PCs that take advantage of the latest extensions will likely be available.
Intel has announced the next family of Xeon processors that it plans to ship in the first half of next year. The new parts represent a substantial upgrade over current Xeon chips, with up to 48 cores and 12 DDR4 memory channels per socket, supporting up to two sockets.
These processors will likely be the top-end Cascade Lake processors; Intel is labelling them "Cascade Lake Advanced Performance," with a higher level of performance than the Xeon Scalable Processors (SP) below them. The current Xeon SP chips use a monolithic die, with up to 28 cores and 56 threads. Cascade Lake AP will instead be a multi-chip processor with multiple dies contained with in a single package. AMD is using a similar approach for its comparable products; the Epyc processors use four dies in each package, with each die having 8 cores.
The switch to a multi-chip design is likely driven by necessity: as the dies become bigger and bigger it becomes more and more likely that they'll contain a defect. Using several smaller dies helps avoid these defects. Because Intel's 10nm manufacturing process isn't yet good enough for mass market production, the new Xeons will continue to use a version of the company's 14nm process. Intel hasn't yet revealed what the topology within each package will be, so the exact distribution of those cores and memory channels between chips is as yet unknown. The enormous number of memory channels will demand an enormous socket, currently believed to be a 5903 pin connector.
Intel also announced tinier 4-6 core E-2100 Xeons with ECC memory support.
Meanwhile, AMD is holding a New Horizon event on Nov. 6, where it is expected to announce 64-core Epyc processors.
Related: AMD Epyc 7000-Series Launched With Up to 32 Cores
AVX-512: A "Hidden Gem"?
Intel's Skylake-SP vs AMD's Epyc
Intel Teases 28 Core Chip, AMD Announces Threadripper 2 With Up to 32 Cores
TSMC Will Make AMD's "7nm" Epyc Server CPUs
Intel Announces 9th Generation Desktop Processors, Including a Mainstream 8-Core CPU
In a blunt video posted late Thursday evening, outspoken former Intel principal engineer Francois Pidnoel offered his advice on how to "fix" Intel CPUs, criticized current leadership for not being engineers, said AVX512 was a misadventure, and declared that it's only luck AMD hasn't grabbed more market share.
"First, Intel is really out of focus," Piednoel said in the nearly hour-long video presentation. "The leaders of Intel today are not engineers, they are not people who understand what to design to the market."
[...] Pidnoel flat-out dismissed including AVX512 in consumer chips as a mistake. "You had Skylake and Skylake X for a reason," Piednoel said. "AVX512 is designed for a race of throughput that is lost to the GPU already. There's two ways to get throughput. One is to get the throughput is by having larger vectors to your core, and the other way is to have more cores."
[...] "Intel is very lucky AMD cannot get the volume, to be able to compete," Piednoel. "If they were getting volume, the price difference would definitely cost Intel market share a lot more than what they are losing right now."
Related: AVX-512: A "Hidden Gem"?
Intel CEO Blames "10nm" Delays on Aggressive Density Target, Promises "7nm" for 2021
Intel's Process Nodes Will Trail Behind Competitors Until at Least Late 2021
Linus Torvalds: Don't Hide Rust in Linux Kernel; Death to AVX-512
Intel Engineering Chief Out After 7nm Product Delays
Intel Faces Class-Action Lawsuit Over "7nm" Delays
Kernel developers appear to be eager to debate the merits of potentially allowing Rust code within the Linux kernel. Linus Torvalds himself has made some initial remarks on the topic ahead of the Linux Plumbers 2020 conference where the matter will be discussed at length.
[...] Linus Torvalds chimed in though with his own opinion on the matter. Linus commented that he would like it to be effectively enabled by default to ensure there is widespread testing and not any isolated usage where developers then may do "crazy" things. He isn't calling for Rust to be a requirement for the kernel but rather if the Rust compiler is detected on the system, Kconfig would enable the Rust support and go ahead in building any hypothetical Rust kernel code in order to see it's properly built at least.
According to a mailing list post spotted by Phoronix, Linux creator Linus Torvalds has shared his strong views on the AVX-512 instruction set. The discussion arose as a result of recent news that Intel's upcoming Alder Lake processors reportedly lack support for AVX-512.
Torvalds' advice to Intel is to focus on things that matter instead of wasting resources on new instruction sets, like AVX-512, that he feels aren't beneficial outside the HPC market.
We can continue to talk about Intel's excellent mesh topology and AMD strong new Zen architecture, but at the end of the day, the "how" will not matter to infrastructure professionals. Depending on your situation, performance, performance-per-watt, and/or performance-per-dollar are what matters.
The current Intel pricing draws the first line. If performance-per-dollar matters to you, AMD's EPYC pricing is very competitive for a wide range of software applications. With the exception of database software and vectorizable HPC code, AMD's EPYC 7601 ($4200) offers slightly less or slightly better performance than Intel's Xeon 8176 ($8000+). However the real competitor is probably the Xeon 8160, which has 4 (-14%) fewer cores and slightly lower turbo clocks (-100 or -200 MHz). We expect that this CPU will likely offer 15% lower performance, and yet it still costs about $500 more ($4700) than the best EPYC. Of course, everything will depend on the final server system price, but it looks like AMD's new EPYC will put some serious performance-per-dollar pressure on the Intel line.
The Intel chip is indeed able to scale up in 8 sockets systems, but frankly that market is shrinking fast, and dual socket buyers could not care less.
Meanwhile, although we have yet to test it, AMD's single socket offering looks even more attractive. We estimate that a single EPYC 7551P would indeed outperform many of the dual Silver Xeon solutions. Overall the single-socket EPYC gives you about 8 cores more at similar clockspeeds than the 2P Intel, and AMD doesn't require explicit cross socket communication - the server board gets simpler and thus cheaper. For price conscious server buyers, this is an excellent option.
However, if your software is expensive, everything changes. In that case, you care less about the heavy price tags of the Platinum Xeons. For those scenarios, Intel's Skylake-EP Xeons deliver the highest single threaded performance (courtesy of the 3.8 GHz turbo clock), high throughput without much (hardware) tuning, and server managers get the reassurance of Intel's reliable track record. And if you use expensive HPC software, you will probably get the benefits of Intel's beefy AVX 2.0 and/or AVX-512 implementations.
AMD's flagship Epyc CPU has 32 cores, while the largest Skylake-EP Xeon CPU has 28 cores.
Quoted text is from page 23, "Closing Thoughts".
[Ed. note: Article is multiple pages with no single page version in sight.]
Previously: Google Gets its Hands on Skylake-Based Intel Xeons
Intel Announces 4 to 18-Core Skylake-X CPUs
AMD Epyc 7000-Series Launched With Up to 32 Cores
Intel's Skylake and Kaby Lake CPUs Have Nasty Microcode Bug
AVX-512: A "Hidden Gem"?
Desktop "Alder Lake-S" processors will include up to 8 "Golden Cove" performance cores (P-cores), 8 "Gracemont" (Atom) efficiency cores (E-cores), and 32 graphics execution units (Gen 12.2 EUs). A smaller die will include only up to 6 P-cores and no E-cores, to be used in lower-end products such as a 6-core Intel Core i5-12400 or a quad-core i3.
Mobile "Alder Lake-P" processors will include up to 6 P-cores, 8 E-cores, and 96 graphics EUs. A smaller "ultra mobile" die will include up to 2 P-cores and 8 E-cores.
AVX-512 is physically present on Golden Cove cores, but disabled in Alder Lake.
The guide mainly focuses on software implementations for hybrid CPUs. It provides various optimization strategies for Alder Lake, including lack of optimization, a "Good Scenario", and the "Best Scenario". According to the document, lack of optimization will not mean that the CPU will be unable to distribute workloads for hybrid CPUs, which should be handled by ThreadDirector anyway, but some may be distributed to the wrong types of cores, should the scheduling algorithm not recognize the task.
In the "Good Scenario," Intel assumes that the application will be aware of the hybrid architecture. The primary tasks should target Performance cores, whereas non-essential and background threads with lower priority should target Effcieent cores.