Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Saturday September 10 2016, @01:13PM   Printer-friendly
from the some-assembly-required dept.

Dan Luu demonstrates that even when optimizing, compilers often produce very slow code as compared to very basic source that is easily accessible to every assembly code programmer: Hand coded assembly beats intrinsics in speed and simplicity:

Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

In my own experience, I have yet to find an optimizing compiler that generates code as fast or as compact as I am able to with hand-optimized code.

Dan Luu's entire website is a treasure trove of education for experienced and novice coders alike. I look forward to studying the whole thing. His refreshingly simple HTML 1.0 design is obviously intended to educate, and is an example of my assertion that the true experts all have austere websites.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Insightful) by Anonymous Coward on Saturday September 10 2016, @01:32PM

    by Anonymous Coward on Saturday September 10 2016, @01:32PM (#399976)

    The article is not very interesting, but the summary title is.

    Article: Programs which use special instructions (intrinsics) instead of combinations of generic instructions run faster.
    The article example is population count to count the numbers of ones in a register.

    Summary: Hand coded assembly runs faster that C.
    I think this may be true for very small examples, but on bigger programs is generally false.
    Modern compiler optimizers can do nearly as good as an expert doing careful hand packing to keep a bunch of functional units doing useful things.
    But the compiler can do it over the whole program which an expert can do only in a very few special cases.

    These days, a good strategy is to write C that needs to run fast like a bit like assembly and provide the optimizer with many opportunities to setup a parallel operation pipeline.
    This gets most to the hand packed speed and makes portable code.

    You still won't get to the intrinsic speed, but that's not interesting, it's obvious.

    Starting Score:    0  points
    Moderation   +2  
       Insightful=2, Total=2
    Extra 'Insightful' Modifier   0  

    Total Score:   2  
  • (Score: 1, Insightful) by Anonymous Coward on Saturday September 10 2016, @01:54PM

    by Anonymous Coward on Saturday September 10 2016, @01:54PM (#399984)

    TFA is a good article because it's readable and presents actual code and performance numbers. And it tries several approaches that various readers would've asked about.

    It's easy to sit back and say "hand coded assembly is obviously faster, but it's not worth the dev time on AMD 64" w/o providing code or data.

    • (Score: 3, Interesting) by Ethanol-fueled on Saturday September 10 2016, @10:11PM

      by Ethanol-fueled (2792) on Saturday September 10 2016, @10:11PM (#400122) Homepage

      This is exactly the kind of article that SN needs more of, in my opinion; along with NCommander's recent journals about assembler. Computing discussions such as these are a lot more relatable than obscure new species of bacteria discovered in the broiled anuses of Polynesian feral hogs. Computing discussions like these are a lot more enriching than "new gadget released" articles.

      They are understood by advanced computer nerds and not so far out of reach for people like me who've received some formal education in computing but lack the autodidactic zeal of staff and more advanced members alike.

      The only problem is, since the submitter is MDC, it may be a form of advanced troll. I still haven't figured out whether or not MDC is an advanced troll or genuinely nuts. He walks a fine line.

      • (Score: 3, Interesting) by NCommander on Sunday September 11 2016, @06:53AM

        by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Sunday September 11 2016, @06:53AM (#400226) Homepage Journal

        I need to get the next article written up. I've got most of the code done (which proved to be much more difficult than expected). Life went unexpectedly pearshaped last week but I'm getting back into it now.

        --
        Still always moving
    • (Score: 2) by davester666 on Monday September 12 2016, @07:50AM

      by davester666 (155) on Monday September 12 2016, @07:50AM (#400565)

      I think this is the most important phrase in TFS:

      that is easily accessible to every assembly code programmer

      The number of assembly code programmers is a pretty small group from the entire from of people who could be classified as "programmers".

  • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @04:21PM

    by Anonymous Coward on Saturday September 10 2016, @04:21PM (#400014)

    Compiler intrinsics are supposed to be a friendly C-like substitute for assembly. They let you do things that aren't easily expressed in standard C, like vector operations.

    The article shows that intrinsics actually don't work very well. The compiler does a pretty bad job of scheduling instructions around them. BTW, I've seen it. Visual Studio tends to make things go via memory when it shouldn't need to. The article used clang and gcc. In other words, all the popular compilers suck.

  • (Score: 0) by Anonymous Coward on Sunday September 11 2016, @02:37AM

    by Anonymous Coward on Sunday September 11 2016, @02:37AM (#400172)

    >Summary: Hand coded assembly runs faster that C.
    >I think this may be true for very small examples, but on bigger programs is generally false.

    You think wrong, or not at all. Take your pick.
    What a program does 99% of its time when really working, is repeating some simple thing a multitude of times. Some *very small* simple thing. Everything outside of that thing (or a few such) is near irrelevant to performance.

    >Modern compiler optimizers can do nearly as good as an expert doing careful hand packing to keep a bunch of functional units doing useful things.

    Which does not do jack when those things are superfluous.
    You can be running twice as fast and still arrive later, if you take a route five times longer.

    >But the compiler can do it over the whole program which an expert can do only in a very few special cases.

    And nobody ever cares about the "whole program" taking a millisecond less to run its once-per-session code, when the critical calculation loop is taking two minutes/hours/days instead of one.