Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Saturday September 10 2016, @01:13PM   Printer-friendly
from the some-assembly-required dept.

Dan Luu demonstrates that even when optimizing, compilers often produce very slow code as compared to very basic source that is easily accessible to every assembly code programmer: Hand coded assembly beats intrinsics in speed and simplicity:

Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

In my own experience, I have yet to find an optimizing compiler that generates code as fast or as compact as I am able to with hand-optimized code.

Dan Luu's entire website is a treasure trove of education for experienced and novice coders alike. I look forward to studying the whole thing. His refreshingly simple HTML 1.0 design is obviously intended to educate, and is an example of my assertion that the true experts all have austere websites.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Insightful) by bradley13 on Saturday September 10 2016, @02:36PM

    by bradley13 (3053) on Saturday September 10 2016, @02:36PM (#399993) Homepage Journal

    Granted, I haven't programmed assembly in any sort of serious fashion for a very long time. However, I really don't think this article is news. Hand-crafted assembly has always been, and probably always will be, faster than compiled code.

    The problem is simply complexity. A human can craft 10 lines of assembly, keeping all important aspects of CPU architecture in mind, really easily. 100 lines isn't too hard. 1000 lines of truly complex code, and your brain starts to melt.

    The entire purpose of higher level languages is to reduce complexity. The price we pay is a loss of efficiency.

    To gain even more leverage, we add libraries and build frameworks on top of the higher level languages. This costs even more efficiency. By the time you've developed something really big and complex, you may well have lost a factor of (WAG = wild-assed guess) 100x in efficiency. But the processor runs at 3+ GHz, we mostly don't care.

    --
    Everyone is somebody else's weirdo.
    Starting Score:    1  point
    Moderation   +2  
       Insightful=2, Total=2
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 1, Insightful) by Anonymous Coward on Saturday September 10 2016, @04:39PM

    by Anonymous Coward on Saturday September 10 2016, @04:39PM (#400024)

    Yes, and the hand-crafted assembly usually has to be CPU-specific to get the best performance (pipelines and all that).

    About a million years ago (in 1999) I bought myself an AMD K6-2/400 because it had floating-point SIMD (3DNow) and I figured it would be cool to write some code for it, and I was keen to improve my very feeble coding skills, and I needed a hobby to keep me out of the pub of an evening. (There were loads of posers over on the green site raving about the AltiVec on their Power Macs in those days.)

    So I had this "great idea" of a library of code, cross-platform, with C and SIMD implementations of various simple bits and pieces so that you could get a bit of a performance boost if you had the hardware. Being young and stupid, I had no idea how much work it would be, or how bad my programming skills really were. I also was completely underwhelmed by the amount of interest the was in that sort of thing in the world in general. And anyway, I eventually found myself in situations where I had no time to work on it, instead having to focus on evil things like C++ and Perl.

    It rots away on sourceforge [sourceforge.net]. Every so often I decide to do some more to it, but I'm always thwarted by life. This time it was Java :-( The last thing I put into it was a rudimentary C unit testing framework of my own Quioxtic devising. That's probably more use than the rest of it, which doesn't actually do very much at all.

    • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @07:03PM

      by Anonymous Coward on Saturday September 10 2016, @07:03PM (#400067)
      • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @07:35PM

        by Anonymous Coward on Saturday September 10 2016, @07:35PM (#400074)

        Cool.

  • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @06:51PM

    by Anonymous Coward on Saturday September 10 2016, @06:51PM (#400065)

    wait. I thought the point of libraries was to write specialized, optimized code for well specified problems.
    are you saying I can do better in my own code than linking to FFTW?
    literally, if I copy their code into my source tree (because I know that they *do* optimize things by hand), how is that different from linking to their library?

  • (Score: 0) by Anonymous Coward on Sunday September 11 2016, @12:48AM

    by Anonymous Coward on Sunday September 11 2016, @12:48AM (#400153)
    These days, machine cycles are cheap and are still getting cheaper as technology progresses. Programmer cycles though are more or less as expensive today as they were fifty years ago. We have the silicon today that can tolerate us being slackers. If you have a problem that really needs you to hit the bare metal, you'll know it, and it's usually only a tiny inner loop that could do with optimisation, or a section that could do with an algorithm that is not easy to specify in a high-level language (e.g. a threaded bytecode interpreter).
  • (Score: 3, Interesting) by TheRaven on Sunday September 11 2016, @09:42AM

    by TheRaven (270) on Sunday September 11 2016, @09:42AM (#400245) Journal

    A human can craft 10 lines of assembly, keeping all important aspects of CPU architecture in mind, really easily

    Architecture? Maybe. Microarchitecture? No chance. A modern CPU can have around a hundred instructions in flight at a time, typically has at least half a dozen independent pipelines, and has complex dependencies between them. It also has a very complex register rename unit and some horrible interactions between that and the pipeline (for example, on a number of recent Intel microarchitectures, xor %rax, %rax provides a hint that a register is dead and so can result in a factor of 2 speedup in a tight loop).

    Even if you can keep all of this in your head, these details change significantly between microarchitectures. A little while ago, a colleague of mine wrote hand-crafted assembly routines for the C standard library memcpy, strcpy, and friends using a mixture of different AVX and SSE instructions. He found that a different one gave the best performance on each of the four most recent Intel microarchitectures and in some cases different ones gave the best performance with different microcode revisions within the same microarchitecture.

    It's also worth noting that a big part of the reason that a human can beat a compiler ever is that a compiler almost never tries to generate optimal code, because it takes too long. Using a superoptimiser and an SMT solver for scheduling will beat any assembly programmer, but no one does this in conventional compilers because most people don't want to burn 15 minutes on CPU time optimising a single function, for every function in their code. If that function happens to be the hottest code path in their program, however, it's probably a lot cheaper to burn even a few hours of CPU time than a couple of days of expert assembly programmer time.

    --
    sudo mod me up
  • (Score: 0) by Anonymous Coward on Sunday September 11 2016, @01:53PM

    by Anonymous Coward on Sunday September 11 2016, @01:53PM (#400271)

    I think much of it also depends on how many people are using the code, how much they're using it, how long will it be in use before being replaced or need serious updating and changing, and how fast are their computers.

    If you are the only one using the code and it takes you 10 more hours to code something in assembly to save 1 minute of processing a day then you may be losing time by hand coding it. Heck, even if in the long run you may technically save a little bit of time it may still not be worth it, at least while my computer is busy doing one thing I can do something else off the computer or on another computer. While coding you don't have that luxury as much since the limiting factor is your activities.

    Now if you are writing a piece of code for 1000 people and and it saves each person an average of 1 minute a day that's 1000 minutes a day saved. Then it may be prudent for an organization to pay the right people to hand code it.

    But, again, it's not always that simple. There are issues with hand coding it. Bugs, adding features, making updates and changes and general maintenance, finding the right people to code and maintain it in case one of your coders leaves since only so many people can hand code or will be willing to do so, compatibility issues (as others have pointed out).