A $1,499 supercomputer on a card? That's what I thought when reading El Reg's report of AMD's Radeon R9 295X2 graphics card which is rated at 11.5 TFlop/s(*). It is water-cooled, contains 5632 stream processors, has 8 GB of DDR5 RAM, and runs at 1018MHz.
AMD's announcement claims it's "the world's fastest, period".
The $1,499 MSRP compares favorably to the $2,999 NVidia GTX Titan Z which is rated at 8 TFlop/s.
From a quick skim of the reviews (at: Hard OCP, Hot Hardware, and Tom's Hardware), it appears AMD has some work to do on its drivers to get the most out of this hardware. The twice-as-expensive NVidia Titan in many cases outperformed it (especially at lower resolutions). At higher resolutions (3840x2160 and 5760x1200) the R9 295x2 really started to shine.
For comparison, consider that this 500 watt, $1,499 card is rated better than the world's fastest supercomputer listed in the top 500 list of June 2001.
(*) Trillion FLoating-point OPerations per Second.
(Score: 2, Insightful) by No.Limit on Thursday April 10 2014, @10:38PM
I think for GPU computing you have the option between OpenCL, Nvidia's CUDA and OpenACC (there may be more that I don't know of). OpenMP is still CPU only as far as I know, though I believe to have read that OpenMP wants to support GPUs too sometime.
I don't know much about OpenCL, but it's an open standard and supported on many platforms. I believe it's quite similar to CUDA.
Nvidia's CUDA works only on Nvidia GPUs (so certainly not on this AMD one). It has lots of good tools (profilers, debuggers etc), documentation, examples, video-tutorials. It's works very well and it gives you a lot of control over the GPU.
OpenACC is a younger standard for GPU coding. It's on a much higher level than both OpenCL and CUDA, but you still get a pretty good amount of control by specifying additional information for the compiler. There are some proprietary compilers that support it (e.g. from cray or PGI). GCC wants to support OpenACC as well, but I don't think they're very far at the moment.
Now for SIMD instructions, pipelining and cache structures: GPUs are fundemantally different than CPUs.
A GPU core is much much simpler than CPU core. To improve sequential execution CPUs have added a lot of complexity (branch prediction, caching, out of order execution, pipelining etc).
However, GPUs have mostly focused on parallel performance for a long time. So instead they kept the cores simple and made sure to add more cores and made sure that adding more cores scales well.
So because GPUs are already so well optimized for parallel computing you don't have do to a lot yourself when it comes the details. You may not even be able to code in assembly, but only in C.
You mainly want to make sure that the overall structure is optimized well.
So that means using caches efficiently (the usual struct of arrays instead of array of structs, cache friendly access patterns etc). In CUDA the cores are divided into blocks that have a shared faster memory (like a cache) over which you have control meaning you can load data manually.
You want to make sure that you divide the work well over the blocks and cores. And if you have to transfer a lot of data from or to GPU memory (over the slow PCIe), then you want to make sure that you don't block computation with the transfer (you can transfer data and compute things at the same time).