Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Saturday September 10 2016, @01:13PM   Printer-friendly
from the some-assembly-required dept.

Dan Luu demonstrates that even when optimizing, compilers often produce very slow code as compared to very basic source that is easily accessible to every assembly code programmer: Hand coded assembly beats intrinsics in speed and simplicity:

Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

In my own experience, I have yet to find an optimizing compiler that generates code as fast or as compact as I am able to with hand-optimized code.

Dan Luu's entire website is a treasure trove of education for experienced and novice coders alike. I look forward to studying the whole thing. His refreshingly simple HTML 1.0 design is obviously intended to educate, and is an example of my assertion that the true experts all have austere websites.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by turgid on Sunday September 11 2016, @11:10AM

    by turgid (4318) Subscriber Badge on Sunday September 11 2016, @11:10AM (#400252) Journal

    Today, it does not matter so much just how fast programs run. If it did, scripting languages would not have the popularity they have now.

    Scripting languages are there to fulfil the niche "computer time cheap, person time expensive" by making it easy for the human to get a lot of as certain class of repetitive work done with orders of magnitude of person-effort less than would be required with a language like FORTRAN, C++ or similar.

    I stumbled into a job working on projects where you'd think that correct, simple, efficient code would be the highest priority. Instead, we have (in most cases) thousands of lines of very badly-written C++ (just plain broken in many cases) wasting clock cycles like they're going out of fashion. (Linus was right, the best argument against using C++ is C++ programmers).

    But you're right, just throw bigger hardware at it...

    The funny thing is, you can open a random source file and after reading it for 10 minutes, you can see the abuse of the type system, the memory leaks, the unnecessary repetition of operations, and the crazy use of memory which will probably thrash the TLBs and cache hierarchy slowing the code down by a factor of 100 to 1000. But what do I know, I'm from the countryside.

    The cool thing is, I get to play with machines with 128GB of RAM, large amounts of nVidia GPU and 24 virtual cores :-) It's amazing what you can do with a bash script on such a box...

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 3, Interesting) by mendax on Monday September 12 2016, @01:26AM

    by mendax (2840) on Monday September 12 2016, @01:26AM (#400446)

    Scripting languages are there to fulfil the niche "computer time cheap, person time expensive" by making it easy for the human to get a lot of as certain class of repetitive work done with orders of magnitude of person-effort less than would be required with a language like FORTRAN, C++ or similar.

    Agreed.

    (Linus was right, the best argument against using C++ is C++ programmers).

    I've not heard this but it's probably correct. C++ is an awful bastard of a language. I'm am glad I have long forgotten how to write programs in it. C# or Java are much more pleasant to use generally.

    But you're right, just throw bigger hardware at it...

    This statement reminds me of an article [righto.com] that was posted here a year ago or so that described the generation of a Mandelbrot fractal on an 55-year-old IBM 1401 at the Computer History Museum. The program this fellow wrote in assembly language took 12 minutes to generate. I wrote a similar program in C on my 8-year-old iMac and ran it. The Unix "time" utility reported that it took 0.001 seconds to complete. And then there is a talk on YouTube the commemorated the 40th aniversary of the IBM System 360 series of mainframes in 2004. In that talk, an IBM executive stated that a program that took seven days to run 24/7 on the IBM 360/30 computer, the first of that venerable line to be introduced, would complete in under a second on the top-of-the-line z-series mainframe available at that time, and unmodified or recompiled. We have the bigger hardware now and can afford to write some really shitty code. However, I try to write pretty clean code myself, but I don't worry about efficiency so much.

    The cool thing is, I get to play with machines with 128GB of RAM, large amounts of nVidia GPU and 24 virtual cores :-) It's amazing what you can do with a bash script on such a box...

    Why don't you tell us. It must be interesting.

    --
    It's really quite a simple choice: Life, Death, or Los Angeles.