from the I-think-I-can...-I-think.-Therefore-I-am. dept.
Intel to Add AI Engine to All 14th-gen Meteor Lake SoCs
Intel to add AI engine to all 14th-gen Meteor Lake SoCs:
Computex Intel will use the "VPU" tech it acquired along with Movidius in 2016 to all models of its forthcoming Meteor Lake client CPUs.
[...] Curiously, Intel didn't elucidate the acronym, but has previously said it stands for Vision Processing Unit. Chipzilla is, however, clear about what it does and why it's needed – and it's more than vision.
Intel Veep and general manager of Client AI John Rayfield said dedicated AI silicon is needed because AI is now present in many PC workloads. Video conferences, he said, feature lots of AI enhancing video and making participants sounds great – and users now just expect that PCs do brilliantly when Zooming or WebExing or Teamising. Games use lots of AI. And GPT-like models, and tools like Stable Diffusion, are already popular on the PC and available as local executables.
CPUs and GPUs do the heavy lifting today, but Rayfield said they'll be overwhelmed by the demands of AI workloads.
Shifting that work to the cloud is pricey, and also impractical because buyers want PCs to perform.
Meteor Lake therefore gets VPUs and emerges as an SoC that uses Intel's Foveros packaging tech to combine the CPU, GPU, and VPU.
The VPU gets to handle "sustained AI and AI offload." CPUs will still be asked to do simple inference jobs with low latency, usually when the cost of doing so is less than the overhead of working with a driver to shunt the workload elsewhere. GPUs will get to do jobs involving performance parallelism and throughput. Other AI-related work will be offloaded to VPUs.
Intel Demos Meteor Lake's AI Acceleration for PCs, Details VPU Unit
Intel Demos Meteor Lake's AI Acceleration for PCs, Details VPU Unit:
[...] Intel will still include the Gaussian Neural Acceleration low-power AI acceleration block that already exists on its chips, marked as 'GNA 3.5' on the SoC tile in the diagram (more on this below). You can also spot the 'VPU 2.7' block that comprises the new Movidius-based VPU block.
Like Intel's stylized render, the patent image is also just a graphical rendering with no real correlation to the actual physical size of the dies. It's easy to see that with so many external interfaces, like the memory controllers, PCIe, USB, and SATA, not to mention the media and display engines and power management, that the VPU cores simply can't consume much of the die area on the SoC tile. For now, the amount of die area that Intel has dedicated to this engine is unknown.
The VPU is designed for sustained AI workloads, but Meteor Lake also includes a CPU, GPU, and GNA engine that can run various AI workloads. Intel's Intel says the VPU is primarily for background tasks, while the GPU steps in for heavier parallelized work. Meanwhile, the CPU addresses light low-latency inference work. Some AI workloads can also run on both the VPU and GPU simultaneously, and Intel has enabled mechanisms that allow developers to target the different compute layers based on the needs of the application at hand. This will ultimately result in higher performance at lower power -- a key goal of using the AI acceleration VPU.
Intel's chips currently use the GNA block for low-power AI inference for audio and video processing functions, and the GNA unit will remain on Meteor Lake. However, Intel says it is already running some of the GNA-focused code on the VPU and achieving better results, with a heavy implication that Intel will transition to the VPU entirely with future chips and remove the GNA engine.
Intel also disclosed that Meteor Lake has a coherent fabric that enables a unified memory subsystem, meaning it can easily share data among the compute elements. This is a key functionality that is similar in concept to other contenders in the CPU AI space, like Apple with its M-series and AMD's Ryzen 7040 chips.
(Score: 3, Funny) by Rosco P. Coltrane on Tuesday May 30 2023, @07:04AM
I only buy fad-du-jour SoCs if they include blockchain technology.
(Score: 2, Interesting) by pTamok on Tuesday May 30 2023, @07:28AM (3 children)
What makes AI computation so compelling that it needs space on die? On-die GPUs tend to be low-end, so it strikes me that having a separate AI processing card (like Graphics) would be a reasonable way forward. Is this simply an attempt to corner the market?
It could be that there is a fundamental technical reason why AI and Graphics processing are different in this regard, so if someone can educate me to dispel my ignorance, please do so.
If AI is so fundamental, perhaps we will end up with centralized AI processors that have general-purpose CPUs as peripherals.
(Score: 5, Informative) by takyon on Tuesday May 30 2023, @08:01AM (2 children)
The ~5-30 TOPS AI accelerators found in smartphone SoCs, and now x86 CPUs can be good enough to be useful (for inference). AMD added one to Phoenix, now Intel is adding one to Meteor Lake. They must be ubiquitous to get adoption, and they will eventually spread to all desktop CPUs and even some non-AI-focused server products:
https://www.notebookcheck.net/AMD-outlines-plans-to-integrate-AI-XDNA-IPUs-across-its-entire-processor-portfolio.717919.0.html [notebookcheck.net]
https://www.notebookcheck.net/AMD-and-Microsoft-present-AI-Developer-Tools-for-Ryzen-7040-processors.719863.0.html [notebookcheck.net]
For mobile, power efficiency is key, so it should definitely be on the same package and not another chip. Rumor has it that Meteor Lake will have a big enough boost in iGPU performance to displace some low-end laptop dGPUs off the market.
Although iGPUs could probably be used for inference, if you want to use them for graphics at the same time, that's not ideal. Eventually, games might be using AI accelerators for real-time voice synthesis or something.
I don't think the amount of die space being used up here is very much. We're probably talking about less than 20 mm2 inside of the 95mm2 "SoC tile":
https://www.semianalysis.com/p/meteor-lake-die-shot-and-architecture [semianalysis.com]
Already done:
https://www.tomshardware.com/news/new-amd-instinct-mi300-details-emerge-debuts-in-2-exaflop-el-capitan-supercomputer [tomshardware.com]
https://www.nextplatform.com/2023/05/29/nvidias-grace-hopper-hybrid-systems-bring-huge-memory-to-bear/ [nextplatform.com]
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 0) by Anonymous Coward on Tuesday May 30 2023, @10:06AM (1 child)
In most cases for "PC" usage the AIs will be running on Microsoft/Google/your servers. While the AI developers would probably be using dedicated hardware for it.
I can somewhat see the argument for smartphones since you might want your phone to have some AI stuff and use less power and still work in scenarios where you have zero/flaky data connectivity.
(Score: 4, Insightful) by takyon on Tuesday May 30 2023, @10:34AM
It could be seen as a chicken and egg problem. They have to add the accelerator before anyone will use it. Intel showed off a Stable Diffusion plugin they wrote for GIMP, and there are Photoshop tools that could probably use it relatively soon, instead of using the GPU.
The power argument is the same for laptops. People want closer to 24 hours of battery life than 3.
You can see some of the other partners in these slides:
https://images.anandtech.com/doci/18878/MTL%20AI%20Deck%20for%20May%202023%20press%20brief_13.png [anandtech.com]
https://images.anandtech.com/doci/18878/MTL%20AI%20Deck%20for%20May%202023%20press%20brief_15.png [anandtech.com]
https://images.anandtech.com/doci/18878/MTL%20AI%20Deck%20for%20May%202023%20press%20brief_12.png [anandtech.com]
https://www.anandtech.com/show/18878/intel-discloses-new-details-on-meteor-lake-vpu-block-lays-out-vision-for-client-ai [anandtech.com]
I think there is less desire to use remote servers for some of this stuff than you might think. Ignoring the privacy angle, it could just be more responsive to run inference locally. Real-time video effects for webcams is one example in the slides.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2, Touché) by Anonymous Coward on Tuesday May 30 2023, @02:00PM (1 child)
they waste time and effort trying to remove booting of 16- and 32-bit OSs. What is the point.
(Score: 2) by takyon on Tuesday May 30 2023, @07:19PM
Removing legacy cruft could make x86 more efficient, increasing performance while lowering power usage and required die area. But they haven't quantified the benefits of x86S yet. It's just a proposal. It also won't happen for years, so it's irrelevant to Meteor Lake.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 3, Interesting) by Rich on Tuesday May 30 2023, @10:04PM (1 child)
Cutting through all the marketing bullshit:
https://www.youtube.com/watch?v=WZzqbL30a8w [youtube.com] (small fraction around 2:10, the remainder is blah)
Intel's SD Demo with pipeline spread across all units (Encoder, VAE: CPU, Unet: GPU, UnetNeg(?): VPU), on a proprietary model format converted from base SD 1.5: 1.05 iterations/sec. Not too shabby for a laptop, but having to "buy into" intel's own model scheme is a major downer.
Just checked against a local 2021MY M1 with pytorch 2.0.1 on Ventura 13.3, a1111 with modified launcher to force MPS, stock SD pruned-emaonly model: 2.65 iterations/sec. Ouch, Intel. I wonder where you took the confidence to let the attendants see the raw performance data. For comparison, a mid-range discrete Nvidia card runs at about 7 and the benchmark RTX 3090 delivers about 11.
However, the power of a huge GPU is probably bandwidth limited by memory. Apple people report relatively small differences between CPU and GPU calculations (I'm not entirely briefed, but I heard that the CPU itself has access to some matrix math accelerator) and the impression was that the GPU is held back by their unified memory architecture. IDK if Apple has or plans some crossbar scheme for multiple memory channels. The M1 has 128 bit wide memory, for a total of 200 GB/s, the RTX 3090 for comparison has 384 bits width and a total GDDR bandwidth of ~940 GB/s, which would explain the numbers above.
As much as I hate the non-upgradeability of the memory, I have to concede that tacking the RAM onto the SOC makes wide and fast access easier, and the Intel world's current separate memory layout is at a disadvantage here. (N.b. prior to the M1 the soldered in memory was an offense against the customer, and the soldered in SSD still is, even more so)
But it's interesting that the M1/M2 could run models in larger than the 24GB of the pro-sumer RTX3090/4090 can handle, albeit at only a quarter or so of the speed and once you look at Nvidias datacenter GPUs, the insultingly high price of the Macbooks suddenly looks like a bargain.
(Score: 2) by Rich on Wednesday May 31 2023, @11:35AM
Update: I got the M1 specs mixed up, sorry. The tests were done on a M1 Max, which, according to Wiki, has 512 bit wide memory with 409.6 GB/s. The conclusions I made relative to the 3090 therefore have to be re-evaluated. It seems that Nvidia gets more bandwidth out of a smaller bus (thanks to GDDR), and more inference performance per bandwidth, too. One could benchmark the different M1 models to get an idea what the bottleneck is.