Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 12 submissions in the queue.
posted by n1 on Thursday April 10 2014, @07:39PM   Printer-friendly
from the will-it-play-crysis-though dept.

A $1,499 supercomputer on a card? That's what I thought when reading El Reg's report of AMD's Radeon R9 295X2 graphics card which is rated at 11.5 TFlop/s(*). It is water-cooled, contains 5632 stream processors, has 8 GB of DDR5 RAM, and runs at 1018MHz.

AMD's announcement claims it's "the world's fastest, period". The $1,499 MSRP compares favorably to the $2,999 NVidia GTX Titan Z which is rated at 8 TFlop/s.

From a quick skim of the reviews (at: Hard OCP, Hot Hardware, and Tom's Hardware), it appears AMD has some work to do on its drivers to get the most out of this hardware. The twice-as-expensive NVidia Titan in many cases outperformed it (especially at lower resolutions). At higher resolutions (3840x2160 and 5760x1200) the R9 295x2 really started to shine.

For comparison, consider that this 500 watt, $1,499 card is rated better than the world's fastest supercomputer listed in the top 500 list of June 2001.

(*) Trillion FLoating-point OPerations per Second.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1) by dbe on Thursday April 10 2014, @09:17PM

    by dbe (1422) on Thursday April 10 2014, @09:17PM (#29721)

    Kind of related to this, for someone familiar with standard 'linear' programming, what is the best way to approach these monsters?
    If you want to do signal/image processing or other embarrassingly parallel tasks, what would you recommend learning, openmp?
    Also after dealing with hand-optimization on modern SIMD processors (neon/arm), is it realistic to understand these cards pipeline and cache structure to really get the best performance when writing a computation kernel?

    -dbe

  • (Score: 2, Insightful) by No.Limit on Thursday April 10 2014, @10:38PM

    by No.Limit (1965) on Thursday April 10 2014, @10:38PM (#29744)

    I think for GPU computing you have the option between OpenCL, Nvidia's CUDA and OpenACC (there may be more that I don't know of). OpenMP is still CPU only as far as I know, though I believe to have read that OpenMP wants to support GPUs too sometime.

    I don't know much about OpenCL, but it's an open standard and supported on many platforms. I believe it's quite similar to CUDA.

    Nvidia's CUDA works only on Nvidia GPUs (so certainly not on this AMD one). It has lots of good tools (profilers, debuggers etc), documentation, examples, video-tutorials. It's works very well and it gives you a lot of control over the GPU.

    OpenACC is a younger standard for GPU coding. It's on a much higher level than both OpenCL and CUDA, but you still get a pretty good amount of control by specifying additional information for the compiler. There are some proprietary compilers that support it (e.g. from cray or PGI). GCC wants to support OpenACC as well, but I don't think they're very far at the moment.

    Now for SIMD instructions, pipelining and cache structures: GPUs are fundemantally different than CPUs.
    A GPU core is much much simpler than CPU core. To improve sequential execution CPUs have added a lot of complexity (branch prediction, caching, out of order execution, pipelining etc).
    However, GPUs have mostly focused on parallel performance for a long time. So instead they kept the cores simple and made sure to add more cores and made sure that adding more cores scales well.

    So because GPUs are already so well optimized for parallel computing you don't have do to a lot yourself when it comes the details. You may not even be able to code in assembly, but only in C.
    You mainly want to make sure that the overall structure is optimized well.

    So that means using caches efficiently (the usual struct of arrays instead of array of structs, cache friendly access patterns etc). In CUDA the cores are divided into blocks that have a shared faster memory (like a cache) over which you have control meaning you can load data manually.
    You want to make sure that you divide the work well over the blocks and cores. And if you have to transfer a lot of data from or to GPU memory (over the slow PCIe), then you want to make sure that you don't block computation with the transfer (you can transfer data and compute things at the same time).