Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Saturday September 10 2016, @01:13PM   Printer-friendly
from the some-assembly-required dept.

Dan Luu demonstrates that even when optimizing, compilers often produce very slow code as compared to very basic source that is easily accessible to every assembly code programmer: Hand coded assembly beats intrinsics in speed and simplicity:

Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

In my own experience, I have yet to find an optimizing compiler that generates code as fast or as compact as I am able to with hand-optimized code.

Dan Luu's entire website is a treasure trove of education for experienced and novice coders alike. I look forward to studying the whole thing. His refreshingly simple HTML 1.0 design is obviously intended to educate, and is an example of my assertion that the true experts all have austere websites.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @01:43PM

    by Anonymous Coward on Saturday September 10 2016, @01:43PM (#399980)

    I did tests over the years got the same results, but only a few ever listened. It is not just on the Assembly level coding it is even higher kanguages...

    Example: In C
                a(i++) += b(j++)
    vs.
                a(i) += b(j)
                i++
                j++

    which is faster? which is clearer to red?

    In assembler: shown in high level view.
              mult 80 into R1
    vs.
              copy R1 to R2
              shift left R1 by 2 bits
              add R2 to R1
              shift left R1 by 4 bits

              why?: (R1 x4 + 1) x16 = R1 x80

    Yes first case is clearer. (mostly depends how and number of registers actually used). The second is WAY faster.

  • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @01:50PM

    by Anonymous Coward on Saturday September 10 2016, @01:50PM (#399982)

    Experienced C++ programmers know that dereferencing an array and using a postfix increment or decrement at the same time is costly in terms of performance. So this isn't a great example.

    • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @09:10PM

      by Anonymous Coward on Saturday September 10 2016, @09:10PM (#400105)

      x++ vs ++x

      Well that depends (like most things in computer science). If you are using C++ then the operator ++ is a time bomb of a performance sink. As it is almost never an intrinsic it is a method call with an extra copy (another method call) depending on which way you do it.

      If it is an integer it usually compiles the same way either way as almost all compilers recognize it and optimize it away.

  • (Score: 2) by TheRaven on Saturday September 10 2016, @02:37PM

    by TheRaven (270) on Saturday September 10 2016, @02:37PM (#399994) Journal
    This is a really bad example, because after SSA construction the compiler IR for the two will be identical. Then it's down to pipelining. It also depends hugely on the addressing modes for your target exactly what you want the reassociation to do in this case.
    --
    sudo mod me up
  • (Score: 3, Informative) by wonkey_monkey on Saturday September 10 2016, @03:56PM

    by wonkey_monkey (279) on Saturday September 10 2016, @03:56PM (#400006) Homepage

    Example: In C
                            a(i++) += b(j++)
    vs.
                            a(i) += b(j)
                            i++
                            j++

    That doesn't look like C to me...

    --
    systemd is Roko's Basilisk
  • (Score: 0) by Anonymous Coward on Saturday September 10 2016, @09:37PM

    by Anonymous Coward on Saturday September 10 2016, @09:37PM (#400112)

    The second is WAY faster.
    Maybe. That depends on the CPU. For an old x86 CPU that was probably very true as the mul instruction was very slow. Many times these days it retires in less than 1 tick. If you do it right you can pair up to 2-4 (again depending on the CPU). Again it depends on the architecture you are using.

    // your asm code in C
    int r1,r2;
    r1 = 50; //your starting number
    r2 = r1;
    r1 = 2;
    r1 += r2;
    r1 = 4;

    That *should* force the compiler to spit out asm code that looks like what you wrote and be fairly portable. Would not 100% count on that being terribly fast on a newer CPU and not produce CPU stalls as things wait out due to register dependencies. It probably could be fine. But you would want to test it. It could also be faster due to the fact the CPU may be able to hoist later instructions. Like I said test it, I wouldnt count on it.

    Remember C was designed to be 'bare' metal. It has however over the years picked up lots of little things that make life easier for programmers. It can still do it. The code usually looks pretty terrible and has a poor readability.