Stories
Slash Boxes
Comments

SoylentNews is people

posted by n1 on Thursday April 10 2014, @07:39PM   Printer-friendly
from the will-it-play-crysis-though dept.

A $1,499 supercomputer on a card? That's what I thought when reading El Reg's report of AMD's Radeon R9 295X2 graphics card which is rated at 11.5 TFlop/s(*). It is water-cooled, contains 5632 stream processors, has 8 GB of DDR5 RAM, and runs at 1018MHz.

AMD's announcement claims it's "the world's fastest, period". The $1,499 MSRP compares favorably to the $2,999 NVidia GTX Titan Z which is rated at 8 TFlop/s.

From a quick skim of the reviews (at: Hard OCP, Hot Hardware, and Tom's Hardware), it appears AMD has some work to do on its drivers to get the most out of this hardware. The twice-as-expensive NVidia Titan in many cases outperformed it (especially at lower resolutions). At higher resolutions (3840x2160 and 5760x1200) the R9 295x2 really started to shine.

For comparison, consider that this 500 watt, $1,499 card is rated better than the world's fastest supercomputer listed in the top 500 list of June 2001.

(*) Trillion FLoating-point OPerations per Second.

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1) by opinionated_science on Thursday April 10 2014, @08:03PM

    by opinionated_science (4031) on Thursday April 10 2014, @08:03PM (#29677)

    I read this review too, and it looks very exciting to have ever increasing computational density!!

    I under though, that that is a discrepancy between the Single precision and Double precision that may not be a physical limitation as a software one. I believe the 11 Tflop/s is SP, and I imagine the DP performacne might be 2.3 or so?

    Anyone know how they got that 11Tf number? Was it a real code?

    In addition, although I understand OpenCL has improved , the AMD drivers are not consider as stable as Nvidia's , and of course the proprietary CUDA tools.

    My interest is in molecular biophysics (MD simulation), and I would really like to see a supercomputer on the desktop, or at least a fraction of Anton....

    • (Score: 2, Informative) by Bytram on Thursday April 10 2014, @08:10PM

      by Bytram (4043) on Thursday April 10 2014, @08:10PM (#29681) Journal

      I have not seen what specific code AMD (or NVidia) ran to get their numbers, but here's a link to the TOP 500 list's Linpack Benchmark Page [top500.org] and to the Linpack FAQ [netlib.org].

      • (Score: 1) by opinionated_science on Thursday April 10 2014, @08:20PM

        by opinionated_science (4031) on Thursday April 10 2014, @08:20PM (#29687)

        well I was looking for LINPACK numbers as I found this:

        http://devgurus.amd.com/message/1285375#1285375 [amd.com] (OpenCL 8 GPU DGEMM (5.1 TFlop/s double precision). Heterogeneous HPL (High Performance Linpack from Top500).)

        They got > 5Tflops DP using 3 older Radeon cards, and it was posted Mar 2014, so and update will be interesting.

        One thing that LINPACK helps, is it gives a measure of *some* useful work to relate practical performance characteristics. Ok not ideal, but stops the marketing fluff getting in the way ;-)

  • (Score: 2) by n1 on Thursday April 10 2014, @08:05PM

    by n1 (993) on Thursday April 10 2014, @08:05PM (#29678) Journal

    Never ever thought i'd say this but, seems like a good deal for a $1500 graphics card.

    • (Score: 3, Insightful) by Lazarus on Thursday April 10 2014, @08:14PM

      by Lazarus (2769) on Thursday April 10 2014, @08:14PM (#29684)

      It's a pretty bad deal for a graphics card, but an excellent one for a high-speed computing platform.

      • (Score: 1) by Bytram on Friday April 11 2014, @02:34AM

        by Bytram (4043) on Friday April 11 2014, @02:34AM (#29819) Journal

        It's a pretty bad deal for a graphics card, but an excellent one for a high-speed computing platform.

        Excellent value, indeed! So it was called the ASCI White [wikipedia.org] made by IBM and installed at the Lawrence Livermore National Laboratory. (ASCI = Accelerated Strategic Computing Initiative.) LLNL has a great write-up [llnl.gov] about it including this picture [llnl.gov].

        The ASCI White system contained 8,192 375MHz processors; had 6 TB of memory and 160TB of disk storage in about 7,000 disk drives. It weighed 106 tons, needed 3 MW of electricity to run, and needed another 3 MW for cooling. The system cost $110 million and was installed in a 20,000 sq ft computer room.

        By comparison, the Radeon R9 295x2 card comes up rather short on memory and storage, but compares quite favorably when looking at weight, power consumption, price, and size. =)

    • (Score: 1) by jasassin on Friday April 11 2014, @07:17AM

      by jasassin (3566) <jasassin@gmail.com> on Friday April 11 2014, @07:17AM (#29899) Homepage Journal

      At least your 5450 still works with fglrx and the new xorgs (I think HD5450 is the oldest to still work). My 3450 is doomed to Windows.

      --
      jasassin@gmail.com GPG Key ID: 0xE6462C68A9A3DB5A
  • (Score: 1) by GlennC on Thursday April 10 2014, @08:06PM

    by GlennC (3656) on Thursday April 10 2014, @08:06PM (#29680)

    All that power, and it's on a graphics card?

    I'm sure there's a market for it. I'm just as sure I'm not part of that market.

    --
    Sorry folks...the world is bigger and more varied than you want it to be. Deal with it.
    • (Score: 4, Informative) by maxwell demon on Thursday April 10 2014, @09:13PM

      by maxwell demon (1608) on Thursday April 10 2014, @09:13PM (#29717) Journal

      The market would be scientific computing. That is, using the graphics card as parallel computing coprocessor. The fact that it also has video out (if it has, there are actually cards that don't) is irrelevant for that purpose.

      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 2, Interesting) by opinionated_science on Thursday April 10 2014, @09:24PM

        by opinionated_science (4031) on Thursday April 10 2014, @09:24PM (#29723)

        not entirely irrelevant. There is a use in molecular simulation to vizualise the system in question, and even to "steer" it while running.

      • (Score: 2, Interesting) by Kymation on Thursday April 10 2014, @09:32PM

        by Kymation (1047) Subscriber Badge on Thursday April 10 2014, @09:32PM (#29728)

        The manufacturer's page doesn't list video output in the specifications. Odd. The picture of the card shows five output connectors, so I suspect that it does actually have video out though.

    • (Score: 1) by _NSAKEY on Thursday April 10 2014, @09:55PM

      by _NSAKEY (16) on Thursday April 10 2014, @09:55PM (#29735)

      The more hardcore guys who post on hashcat.net's forum will probably have 4 of these running in one box within a week of the card's launch. Granted, it will take more than one power supply, but the kind of person who would bulk order these cards is also the same kind of person who has used multiple PSUs in the same rig before.

      • (Score: 1) by opinionated_science on Thursday April 10 2014, @10:19PM

        by opinionated_science (4031) on Thursday April 10 2014, @10:19PM (#29739)

        it would be useful for those of us want to scientific calculations, if they would run the benchmarks for computation for their rigs!! I would wager a few scientists would have a crack at replicating the best performing designs.

        We might get the vendors to start optimizing for reproducible calculation, rather than marketing numbers...

    • (Score: 2) by zim on Friday April 11 2014, @05:20AM

      by zim (1251) on Friday April 11 2014, @05:20AM (#29877)
      I AM part of that market! But the money is not part of me... So :(

      But hey. I can buy one in a few years when they're the new $150 card.
  • (Score: 3, Informative) by takyon on Thursday April 10 2014, @08:23PM

    by takyon (881) <takyonNO@SPAMsoylentnews.org> on Thursday April 10 2014, @08:23PM (#29690) Journal

    AMD: [anandtech.com]

          AMD Radeon R9 295X2        AMD Radeon R9 290X     AMD Radeon HD 7990      AMD Radeon HD 7970 GHz Edition
    FP64  1/8                        1/8                     1/4                     1/4

    NVIDIA: [anandtech.com]

          GTX Titan Black            GTX 780 Ti              GTX Titan               GTX 780
    FP64  1/3 FP32                   1/24 FP32               1/3 FP32                1/24 FP32

    "Today NVIDIA is letting its compute-at-home customers have their cake and eat it too with the GeForce GTX Titan Black. The Titan Black is a full GK110 implementation, just like the GTX 780 Ti, with all of the compute focused-ness of the old GTX Titan. That means you get FP64 performance that's only 1/3 of the card's FP32 performance (compared to 1/24 with the 780 Ti)."

    Double-precision floating-point format [wikipedia.org]
    FLOPS - Floating-point operation and integer operation [wikipedia.org]

    --
    [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 1) by opinionated_science on Thursday April 10 2014, @08:39PM

      by opinionated_science (4031) on Thursday April 10 2014, @08:39PM (#29700)

      ahh , thank you for the summary of the crippled cards ;-)

      From what I understand DP=1/2 SP from a memory bandwidth point of view.

      Is there any technical reasons that the best FP64 performance is 1/3 FP32, other than marketing?

  • (Score: 3, Insightful) by Dunbal on Thursday April 10 2014, @08:35PM

    by Dunbal (3515) on Thursday April 10 2014, @08:35PM (#29697)

    AMD has some work to do on its drivers? Seriously? OK. Hold your breath. Any decade now. AMD's shortcoming has ALWAYS been its drivers. I wouldn't expect much more because crappy drivers are just business as usual for them. Hey but maybe they'll release their code so then it's not their fault anymore, is it? It's yours and mine.

  • (Score: 1) by dbe on Thursday April 10 2014, @09:17PM

    by dbe (1422) on Thursday April 10 2014, @09:17PM (#29721)

    Kind of related to this, for someone familiar with standard 'linear' programming, what is the best way to approach these monsters?
    If you want to do signal/image processing or other embarrassingly parallel tasks, what would you recommend learning, openmp?
    Also after dealing with hand-optimization on modern SIMD processors (neon/arm), is it realistic to understand these cards pipeline and cache structure to really get the best performance when writing a computation kernel?

    -dbe

    • (Score: 2, Insightful) by No.Limit on Thursday April 10 2014, @10:38PM

      by No.Limit (1965) on Thursday April 10 2014, @10:38PM (#29744)

      I think for GPU computing you have the option between OpenCL, Nvidia's CUDA and OpenACC (there may be more that I don't know of). OpenMP is still CPU only as far as I know, though I believe to have read that OpenMP wants to support GPUs too sometime.

      I don't know much about OpenCL, but it's an open standard and supported on many platforms. I believe it's quite similar to CUDA.

      Nvidia's CUDA works only on Nvidia GPUs (so certainly not on this AMD one). It has lots of good tools (profilers, debuggers etc), documentation, examples, video-tutorials. It's works very well and it gives you a lot of control over the GPU.

      OpenACC is a younger standard for GPU coding. It's on a much higher level than both OpenCL and CUDA, but you still get a pretty good amount of control by specifying additional information for the compiler. There are some proprietary compilers that support it (e.g. from cray or PGI). GCC wants to support OpenACC as well, but I don't think they're very far at the moment.

      Now for SIMD instructions, pipelining and cache structures: GPUs are fundemantally different than CPUs.
      A GPU core is much much simpler than CPU core. To improve sequential execution CPUs have added a lot of complexity (branch prediction, caching, out of order execution, pipelining etc).
      However, GPUs have mostly focused on parallel performance for a long time. So instead they kept the cores simple and made sure to add more cores and made sure that adding more cores scales well.

      So because GPUs are already so well optimized for parallel computing you don't have do to a lot yourself when it comes the details. You may not even be able to code in assembly, but only in C.
      You mainly want to make sure that the overall structure is optimized well.

      So that means using caches efficiently (the usual struct of arrays instead of array of structs, cache friendly access patterns etc). In CUDA the cores are divided into blocks that have a shared faster memory (like a cache) over which you have control meaning you can load data manually.
      You want to make sure that you divide the work well over the blocks and cores. And if you have to transfer a lot of data from or to GPU memory (over the slow PCIe), then you want to make sure that you don't block computation with the transfer (you can transfer data and compute things at the same time).

  • (Score: 2, Insightful) by VanessaE on Thursday April 10 2014, @11:03PM

    by VanessaE (3396) <vanessa.e.dannenberg@gmail.com> on Thursday April 10 2014, @11:03PM (#29758) Journal

    Ok, fifteen hundred bucks if you BUY IT NAOW! Sure, we all know it'll come down to a more reasonable price eventually, but what about on the back end, *after* you buy it? 500 watts just for just a GPU, assuming that's when it's maxed out? I'm sorry but last I knew, hardcore gamers who would buy such a card tend to play for hours on end, let alone folks who would buy them for mining coins, and that kind of power usage is just insane.

    My three computers, four decent DFP monitors, and all their ancillary gadgetry all combined use between 790 and 815 watts (according to my Kill-a-Watt) when they're all running at full blast, and they are *not* low-end hardware at all.

  • (Score: 1) by monster on Friday April 11 2014, @10:45AM

    by monster (1260) on Friday April 11 2014, @10:45AM (#29967) Journal

    Acronym nitpicking: It's not 'Trillion FLoating-point OPerations per Second', it's 'Tera FLoating-point OPerations per Second'. The fact that Tera uses the same initial letter to the english trillion is just a coincidence. You can see the difference with other units like GFLOP (Giga) and not (BFLOP) (Billion).

    • (Score: 1) by Bytram on Friday April 11 2014, @01:08PM

      by Bytram (4043) on Friday April 11 2014, @01:08PM (#30014) Journal

      If you go to The Linpack Benchmark [top500.org] on the TOP 500 site, there's a link to Frequently Asked Questions on the Linpack Benchmark and Top500 [netlib.org]. At the entry for "What is a Mflop/s?" which you can reach directly [netlib.org], it states:

      What is a Mflop/s?

      Mflop/s is a rate of execution, millions of floating point operations per second. Whenever this term is used it will refer to 64 bit floating point operations and the operations will be either addition or multiplication. Gflop/s refers to billions of floating point operations per second and Tflop/s refers to trillions of floating point operations per second.

      As these are the folks who came up with the list, I defer to their historical and continued use of this definition.

      • (Score: 1) by monster on Friday April 11 2014, @01:18PM

        by monster (1260) on Friday April 11 2014, @01:18PM (#30022) Journal

        Thanks for the clarification, but unless they are being incoherent in their naming, their comment validates my point: It's Mflops (Mega), Gflops (Giga) and Tflops (Tera), even if they put them side by side with their numerical values.

        Anyway, enough nitpicking for now.