Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Tuesday May 30 2023, @05:07AM   Printer-friendly
from the I-think-I-can...-I-think.-Therefore-I-am. dept.

Intel to Add AI Engine to All 14th-gen Meteor Lake SoCs

Intel to add AI engine to all 14th-gen Meteor Lake SoCs:

Computex Intel will use the "VPU" tech it acquired along with Movidius in 2016 to all models of its forthcoming Meteor Lake client CPUs.

[...] Curiously, Intel didn't elucidate the acronym, but has previously said it stands for Vision Processing Unit. Chipzilla is, however, clear about what it does and why it's needed – and it's more than vision.

Intel Veep and general manager of Client AI John Rayfield said dedicated AI silicon is needed because AI is now present in many PC workloads. Video conferences, he said, feature lots of AI enhancing video and making participants sounds great – and users now just expect that PCs do brilliantly when Zooming or WebExing or Teamising. Games use lots of AI. And GPT-like models, and tools like Stable Diffusion, are already popular on the PC and available as local executables.

CPUs and GPUs do the heavy lifting today, but Rayfield said they'll be overwhelmed by the demands of AI workloads.

Shifting that work to the cloud is pricey, and also impractical because buyers want PCs to perform.

Meteor Lake therefore gets VPUs and emerges as an SoC that uses Intel's Foveros packaging tech to combine the CPU, GPU, and VPU.

The VPU gets to handle "sustained AI and AI offload." CPUs will still be asked to do simple inference jobs with low latency, usually when the cost of doing so is less than the overhead of working with a driver to shunt the workload elsewhere. GPUs will get to do jobs involving performance parallelism and throughput. Other AI-related work will be offloaded to VPUs.

Intel Demos Meteor Lake's AI Acceleration for PCs, Details VPU Unit

Intel Demos Meteor Lake's AI Acceleration for PCs, Details VPU Unit:

[...] Intel will still include the Gaussian Neural Acceleration low-power AI acceleration block that already exists on its chips, marked as 'GNA 3.5' on the SoC tile in the diagram (more on this below). You can also spot the 'VPU 2.7' block that comprises the new Movidius-based VPU block.

Like Intel's stylized render, the patent image is also just a graphical rendering with no real correlation to the actual physical size of the dies. It's easy to see that with so many external interfaces, like the memory controllers, PCIe, USB, and SATA, not to mention the media and display engines and power management, that the VPU cores simply can't consume much of the die area on the SoC tile. For now, the amount of die area that Intel has dedicated to this engine is unknown.

The VPU is designed for sustained AI workloads, but Meteor Lake also includes a CPU, GPU, and GNA engine that can run various AI workloads. Intel's Intel says the VPU is primarily for background tasks, while the GPU steps in for heavier parallelized work. Meanwhile, the CPU addresses light low-latency inference work. Some AI workloads can also run on both the VPU and GPU simultaneously, and Intel has enabled mechanisms that allow developers to target the different compute layers based on the needs of the application at hand. This will ultimately result in higher performance at lower power -- a key goal of using the AI acceleration VPU.

Intel's chips currently use the GNA block for low-power AI inference for audio and video processing functions, and the GNA unit will remain on Meteor Lake. However, Intel says it is already running some of the GNA-focused code on the VPU and achieving better results, with a heavy implication that Intel will transition to the VPU entirely with future chips and remove the GNA engine.

Intel also disclosed that Meteor Lake has a coherent fabric that enables a unified memory subsystem, meaning it can easily share data among the compute elements. This is a key functionality that is similar in concept to other contenders in the CPU AI space, like Apple with its M-series and AMD's Ryzen 7040 chips.


Original Submission #1Original Submission #2

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Funny) by Rosco P. Coltrane on Tuesday May 30 2023, @07:04AM

    by Rosco P. Coltrane (4757) on Tuesday May 30 2023, @07:04AM (#1308850)

    I only buy fad-du-jour SoCs if they include blockchain technology.

  • (Score: 2, Interesting) by pTamok on Tuesday May 30 2023, @07:28AM (3 children)

    by pTamok (3042) on Tuesday May 30 2023, @07:28AM (#1308854)

    What makes AI computation so compelling that it needs space on die? On-die GPUs tend to be low-end, so it strikes me that having a separate AI processing card (like Graphics) would be a reasonable way forward. Is this simply an attempt to corner the market?
    It could be that there is a fundamental technical reason why AI and Graphics processing are different in this regard, so if someone can educate me to dispel my ignorance, please do so.

    If AI is so fundamental, perhaps we will end up with centralized AI processors that have general-purpose CPUs as peripherals.

  • (Score: 2, Touché) by Anonymous Coward on Tuesday May 30 2023, @02:00PM (1 child)

    by Anonymous Coward on Tuesday May 30 2023, @02:00PM (#1308881)

    they waste time and effort trying to remove booting of 16- and 32-bit OSs. What is the point.

    • (Score: 2) by takyon on Tuesday May 30 2023, @07:19PM

      by takyon (881) <takyonNO@SPAMsoylentnews.org> on Tuesday May 30 2023, @07:19PM (#1308933) Journal

      Removing legacy cruft could make x86 more efficient, increasing performance while lowering power usage and required die area. But they haven't quantified the benefits of x86S yet. It's just a proposal. It also won't happen for years, so it's irrelevant to Meteor Lake.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
  • (Score: 3, Interesting) by Rich on Tuesday May 30 2023, @10:04PM (1 child)

    by Rich (945) on Tuesday May 30 2023, @10:04PM (#1308955) Journal

    Cutting through all the marketing bullshit:

    https://www.youtube.com/watch?v=WZzqbL30a8w [youtube.com] (small fraction around 2:10, the remainder is blah)

    Intel's SD Demo with pipeline spread across all units (Encoder, VAE: CPU, Unet: GPU, UnetNeg(?): VPU), on a proprietary model format converted from base SD 1.5: 1.05 iterations/sec. Not too shabby for a laptop, but having to "buy into" intel's own model scheme is a major downer.

    Just checked against a local 2021MY M1 with pytorch 2.0.1 on Ventura 13.3, a1111 with modified launcher to force MPS, stock SD pruned-emaonly model: 2.65 iterations/sec. Ouch, Intel. I wonder where you took the confidence to let the attendants see the raw performance data. For comparison, a mid-range discrete Nvidia card runs at about 7 and the benchmark RTX 3090 delivers about 11.

    However, the power of a huge GPU is probably bandwidth limited by memory. Apple people report relatively small differences between CPU and GPU calculations (I'm not entirely briefed, but I heard that the CPU itself has access to some matrix math accelerator) and the impression was that the GPU is held back by their unified memory architecture. IDK if Apple has or plans some crossbar scheme for multiple memory channels. The M1 has 128 bit wide memory, for a total of 200 GB/s, the RTX 3090 for comparison has 384 bits width and a total GDDR bandwidth of ~940 GB/s, which would explain the numbers above.

    As much as I hate the non-upgradeability of the memory, I have to concede that tacking the RAM onto the SOC makes wide and fast access easier, and the Intel world's current separate memory layout is at a disadvantage here. (N.b. prior to the M1 the soldered in memory was an offense against the customer, and the soldered in SSD still is, even more so)

    But it's interesting that the M1/M2 could run models in larger than the 24GB of the pro-sumer RTX3090/4090 can handle, albeit at only a quarter or so of the speed and once you look at Nvidias datacenter GPUs, the insultingly high price of the Macbooks suddenly looks like a bargain.

    • (Score: 2) by Rich on Wednesday May 31 2023, @11:35AM

      by Rich (945) on Wednesday May 31 2023, @11:35AM (#1309028) Journal

      Update: I got the M1 specs mixed up, sorry. The tests were done on a M1 Max, which, according to Wiki, has 512 bit wide memory with 409.6 GB/s. The conclusions I made relative to the 3090 therefore have to be re-evaluated. It seems that Nvidia gets more bandwidth out of a smaller bus (thanks to GDDR), and more inference performance per bandwidth, too. One could benchmark the different M1 models to get an idea what the bottleneck is.

(1)