Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Saturday September 10 2016, @01:13PM   Printer-friendly
from the some-assembly-required dept.

Dan Luu demonstrates that even when optimizing, compilers often produce very slow code as compared to very basic source that is easily accessible to every assembly code programmer: Hand coded assembly beats intrinsics in speed and simplicity:

Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

In my own experience, I have yet to find an optimizing compiler that generates code as fast or as compact as I am able to with hand-optimized code.

Dan Luu's entire website is a treasure trove of education for experienced and novice coders alike. I look forward to studying the whole thing. His refreshingly simple HTML 1.0 design is obviously intended to educate, and is an example of my assertion that the true experts all have austere websites.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by TheRaven on Saturday September 10 2016, @02:40PM

    by TheRaven (270) on Saturday September 10 2016, @02:40PM (#399995) Journal
    I was talking to someone in the tools team at a major games studio recently. They compile with compiler optimisations disabled right up until betas, because they've found that fast compile cycles give a faster product. Algorithmic improvements are where the big speedups come from and being able to test 2-3 algorithmic improvements in the time that it used to take to test 1 is a big win. Compiler optimisations (and the kind of microoptimisation TFA is describing) give a much smaller speedup. You might get a factor of 2-3 speedup from these, but you get the factor of 10-20 speedups from picking better algorithms.
    --
    sudo mod me up
    Starting Score:    1  point
    Moderation   +3  
       Insightful=3, Total=3
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 1, Interesting) by Anonymous Coward on Saturday September 10 2016, @04:26PM

    by Anonymous Coward on Saturday September 10 2016, @04:26PM (#400017)

    You also get developers running unoptimized code, which forces them to care more about not wasting performance.

    OTOH, the optimizer can uncover bugs. Finding these late can be painful.

  • (Score: 2) by PocketSizeSUn on Saturday September 10 2016, @05:05PM

    by PocketSizeSUn (5340) on Saturday September 10 2016, @05:05PM (#400035)

    Completely agree.

    For game developers compiler speed is a critical factor for the reasons you stated.
    I do the same in my personal projects ... even when compiler speed is a non issue.
    After all, no optimizer can make a poor choice in algorithm fast.

    However the optimization pass can bite you ... if the optimization pass bleeps up it can take a while to
    figure out what went wrong. So you have to strike a balance.

    And as a special surprise .. in some cases optimizing for size can shrink you code segment enough to
    pin your hot path into cpu cache giving a massive boost in performance.

    In the game arena I think this is more pronounced because the targets are generally well known and
    have relatively long product cycles.

    YMMV

    • (Score: 2) by TheRaven on Monday September 12 2016, @07:33AM

      by TheRaven (270) on Monday September 12 2016, @07:33AM (#400558) Journal

      And as a special surprise .. in some cases optimizing for size can shrink you code segment enough to pin your hot path into cpu cache giving a massive boost in performance.

      I'm not sure if they still do, but Apple used to compile all of OS X with -Os. They found that a lot of benchmarks ran faster with -O2, but overall system performance was worse. When you're expecting your users to run a dozen apps at the same time, compiling those and the libraries and frameworks that they use to take up less i-cache space turned out to be quite a big win. Being able to keep the hot paths of the window server and one or two active applications in L2 made a big difference.

      --
      sudo mod me up
    • (Score: 2) by Wootery on Monday September 12 2016, @09:11AM

      by Wootery (2341) on Monday September 12 2016, @09:11AM (#400596)

      And as a special surprise .. in some cases optimizing for size can shrink you code segment enough to
      pin your hot path into cpu cache giving a massive boost in performance.

      This is what profile-guided optimisation is for, no? Optimise the hotspots for speed, and the rest for space (to improve cache behavior, as you say).

      I can't imagine any serious game development outfit not using PGO.

  • (Score: 2) by RamiK on Saturday September 10 2016, @09:03PM

    by RamiK (1813) on Saturday September 10 2016, @09:03PM (#400102)

    Sure, algorithms come first. But mind you, x86 game developers are saying this while their engine developers are coding in C++ and spending most of their time crafting the minutes of the GPU streams with bit precision.

    Modern out-of-order super-scalars make life really easy. Switch to GPUs & DSPs, and all those high and mighty "algorithms comes first" statements are turn to whispers as people spend their days analyzing EBBs and preventing catch saturation.

    But yeah, algorithms are a given.

    --
    compiling...