Advanced Micro Devices (AMD) has shared more details about the High Bandwidth Memory (HBM) in its upcoming GPUs.
HBM in a nutshell takes the wide & slow paradigm to its fullest. Rather than building an array of high speed chips around an ASIC to deliver 7Gbps+ per pin over a 256/384/512-bit memory bus, HBM at its most basic level involves turning memory clockspeeds way down – to just 1Gbps per pin – but in exchange making the memory bus much wider. How wide? That depends on the implementation and generation of the specification, but the examples AMD has been showcasing so far have involved 4 HBM devices (stacks), each featuring a 1024-bit wide memory bus, combining for a massive 4096-bit memory bus. It may not be clocked high, but when it's that wide, it doesn't need to be.
AMD will be the only manufacturer using the first generation of HBM, and will be joined by NVIDIA in using the second generation in 2016. HBM2 will double memory bandwidth over HBM1. The benefits of HBM include increased total bandwidth (from 320 GB/s for the R9 290X to 512 GB/s in AMD's "theoretical" 4-stack example) and reduced power consumption. Although HBM1's memory bandwidth per watt is tripled compared to GDDR5, the memory in AMD's example uses a little less than half the power (30 W for the R9 290X down to 14.6 W) due to the increased bandwidth. HBM stacks will also use 5-10% as much area of the GPU to provide the same amount of memory that GDDR5 would. That could potentially halve the size of the GPU:
By AMD's own estimate, a single HBM-equipped GPU package would be less than 70mm × 70mm (4900mm2), versus 110mm × 90mm (9900mm2) for R9 290X.
HBM will likely be featured in high-performance computing GPUs as well as accelerated processing units (APUs). HotHardware reckons that Radeon 300-series GPUs featuring HBM will be released in June.
(Score: 2) by takyon on Thursday May 21 2015, @12:21PM
The way I see it HBM, which is TSV stacked, will replace GDDR5 and there will never be a GDDR6.
Everything that can be stacked will be stacked. V-NAND will solve/delay NAND endurance issues for years. Eventually processors will be stacked.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 4, Interesting) by bzipitidoo on Thursday May 21 2015, @12:57PM
The big problem with stacking is heat. Would have happened years ago if not for that. But then. heat is a big problem everywhere in circuit design.
Parallelism, going wide, is the way forward for now. Doubt we'll move up from 64bit any time soon. There was a compelling reason to move from 32bit, which is that it can address at most 4G of RAM. We're nowhere close to bumping up against the 64bit limit of nearly 2x10^19. Instead, we've been seeing the multi core CPUs. Parallel programming as originally envisioned at the source code level hasn't really happened, people aren't using programming languages explicitly designed for parallelism. Instead we're seeing it at arm's length, in libraries such as OpenCL. Parallelism is the reason Google gained such a competitive advantage. They did it better, made that MapReduce library. The hunt is still on for other places to apply more width.
(Score: 3, Informative) by takyon on Thursday May 21 2015, @02:20PM
http://gtresearchnews.gatech.edu/newsrelease/half-terahertz.htm [gatech.edu]
Just get SiGe transistors and clock them way down.
Well, it's not that simple, but it's a start.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 2) by Katastic on Thursday May 21 2015, @06:09PM
I said the same thing on Slashdot a week or two ago and got zero up mods. Snarky bastards.
The other issue is:
>It may not be clocked high, but when it's that wide, it doesn't need to be.
No, no, no, no and no. Latency does not scale with bus width. You can't get 9 women pregnant and expect a baby every month.
(Score: 0) by Anonymous Coward on Thursday May 21 2015, @08:53PM
No, no, no, no and no. Latency does not scale with bus width. You can't get 9 women pregnant and expect a baby every month.
Have you tried pipelining?
(Score: 2) by Katastic on Friday May 22 2015, @12:25AM
Pipelining by definition: at best does not change latency, and at worst, significantly increases latency. It cannot reduce latency.
Fun history: It's 50% of the reason the Pentium 4 Netburst architecture was a complete failure and slower than the Pentium III's. They added a huge pipeline, with huge chances for stalls, but they thought they could hit 10 GHz with the P4 architecture so "it wouldn't matter."
And then the 3-4 GHZ barrier happened...
Pentium 4's were heating up faster than any of their models predicted. So the primary advantage of their new architecture couldn't be utilized. The smaller and smaller they manufactured things, new "problems" that could be disregarded before all a sudden become extremely important. Heat levels exploded exponentially.
(Score: 0) by Anonymous Coward on Thursday May 21 2015, @05:04PM
> Eventually processors will be stacked.
I think we can expect to see gigabytes of ram stacked on the cpus for high-bandwidth, low-latency access. Like a sort of L4 cache.
(Score: 2) by takyon on Thursday May 21 2015, @11:02PM
Yup. This will be seen on Intel's Xeon Phi Knights Landing, with 8-16 GB of "on-package memory":
http://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing [wikipedia.org]
http://www.anandtech.com/show/8217/intels-knights-landing-coprocessor-detailed [anandtech.com]
⚔
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]