Dan Luu demonstrates that even when optimizing, compilers often produce very slow code as compared to very basic source that is easily accessible to every assembly code programmer: Hand coded assembly beats intrinsics in speed and simplicity:
Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.
The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.
For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.
In my own experience, I have yet to find an optimizing compiler that generates code as fast or as compact as I am able to with hand-optimized code.
Dan Luu's entire website is a treasure trove of education for experienced and novice coders alike. I look forward to studying the whole thing. His refreshingly simple HTML 1.0 design is obviously intended to educate, and is an example of my assertion that the true experts all have austere websites.
(Score: 2, Insightful) by Anonymous Coward on Saturday September 10 2016, @01:32PM
The article is not very interesting, but the summary title is.
Article: Programs which use special instructions (intrinsics) instead of combinations of generic instructions run faster.
The article example is population count to count the numbers of ones in a register.
Summary: Hand coded assembly runs faster that C.
I think this may be true for very small examples, but on bigger programs is generally false.
Modern compiler optimizers can do nearly as good as an expert doing careful hand packing to keep a bunch of functional units doing useful things.
But the compiler can do it over the whole program which an expert can do only in a very few special cases.
These days, a good strategy is to write C that needs to run fast like a bit like assembly and provide the optimizer with many opportunities to setup a parallel operation pipeline.
This gets most to the hand packed speed and makes portable code.
You still won't get to the intrinsic speed, but that's not interesting, it's obvious.
(Score: 1, Insightful) by Anonymous Coward on Saturday September 10 2016, @01:54PM
TFA is a good article because it's readable and presents actual code and performance numbers. And it tries several approaches that various readers would've asked about.
It's easy to sit back and say "hand coded assembly is obviously faster, but it's not worth the dev time on AMD 64" w/o providing code or data.
(Score: 3, Interesting) by Ethanol-fueled on Saturday September 10 2016, @10:11PM
This is exactly the kind of article that SN needs more of, in my opinion; along with NCommander's recent journals about assembler. Computing discussions such as these are a lot more relatable than obscure new species of bacteria discovered in the broiled anuses of Polynesian feral hogs. Computing discussions like these are a lot more enriching than "new gadget released" articles.
They are understood by advanced computer nerds and not so far out of reach for people like me who've received some formal education in computing but lack the autodidactic zeal of staff and more advanced members alike.
The only problem is, since the submitter is MDC, it may be a form of advanced troll. I still haven't figured out whether or not MDC is an advanced troll or genuinely nuts. He walks a fine line.
(Score: 3, Interesting) by NCommander on Sunday September 11 2016, @06:53AM
I need to get the next article written up. I've got most of the code done (which proved to be much more difficult than expected). Life went unexpectedly pearshaped last week but I'm getting back into it now.
Still always moving
(Score: 2) by davester666 on Monday September 12 2016, @07:50AM
I think this is the most important phrase in TFS:
The number of assembly code programmers is a pretty small group from the entire from of people who could be classified as "programmers".
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @04:21PM
Compiler intrinsics are supposed to be a friendly C-like substitute for assembly. They let you do things that aren't easily expressed in standard C, like vector operations.
The article shows that intrinsics actually don't work very well. The compiler does a pretty bad job of scheduling instructions around them. BTW, I've seen it. Visual Studio tends to make things go via memory when it shouldn't need to. The article used clang and gcc. In other words, all the popular compilers suck.
(Score: 0) by Anonymous Coward on Sunday September 11 2016, @02:37AM
>Summary: Hand coded assembly runs faster that C.
>I think this may be true for very small examples, but on bigger programs is generally false.
You think wrong, or not at all. Take your pick.
What a program does 99% of its time when really working, is repeating some simple thing a multitude of times. Some *very small* simple thing. Everything outside of that thing (or a few such) is near irrelevant to performance.
>Modern compiler optimizers can do nearly as good as an expert doing careful hand packing to keep a bunch of functional units doing useful things.
Which does not do jack when those things are superfluous.
You can be running twice as fast and still arrive later, if you take a route five times longer.
>But the compiler can do it over the whole program which an expert can do only in a very few special cases.
And nobody ever cares about the "whole program" taking a millisecond less to run its once-per-session code, when the critical calculation loop is taking two minutes/hours/days instead of one.
(Score: -1, Offtopic) by Anonymous Coward on Saturday September 10 2016, @01:36PM
>the true experts all have austere websites
.. and kissing spying ass
<script async src='//www.google-analytics.com/analytics.js'></script>
(Score: 1, Offtopic) by number6 on Saturday September 10 2016, @02:27PM
<https://addons.mozilla.org/en-US/firefox/addon/requestpolicy/>
## RequestPolicy
## by Justin Samuel
Works with Firefox 4.0 - 31.0
Be in control of which cross-site requests are allowed. Improve the privacy
of your browsing by not letting other sites know your browsing habits.
Secure yourself from Cross-Site Request Forgery (CSRF) and other attacks.
About this Add-on:
RequestPolicy's new version is under development as RequestPolicy Continued..
<https://addons.mozilla.org/en-US/firefox/addon/requestpolicy-continued/>
Ultimately, I'd like RequestPolicy Continued to replace RequestPolicy.
RequestPolicy development has been slow because I am working on ServerPilot,
a cPanel alternative for DigitalOcean servers. Thankfully, some great
community members are taking over development of RequestPolicy.
Versions:
* Version 0.5.28.1-signed.1-signed
Released July 30, 2013
Works with Firefox 4.0 and later, SeaMonkey 2.1 and later
* [...]
* Version 0.5.8
Released June 5, 2009
Works with Firefox 3.0 - 3.6b1pre, Mobile 0.1 - 1.0.*,
SeaMonkey 2.0a - 2.1a1pre
-----------------------------------------------------------------------------
<https://addons.mozilla.org/en-US/firefox/addon/requestpolicy-continued/>
## RequestPolicy Continued
## by Martin Kimmerle
Works with Firefox 24.0 - 50.0a1
Be in control of which cross-site requests are allowed. Improve the privacy
of your browsing by not letting other sites know your browsing habits.
Secure yourself from Cross-Site Request Forgery (CSRF) and other attacks.
About this Add-on:
This add-on is the continuation of RequestPolicy by Justin Samuel.
If you have've got problems, please visit our GitHub repository..
<https://github.com/RequestPolicyContinued/requestpolicy>
RequestPolicy is not a replacement for "NoScript" (addon); each focuses on
different important issues. For the best security, we recommend using
both RequestPolicy and NoScript. More information on the difference
between the two is available here..
<https://requestpolicycontinued.github.io/#faq-noscript>
Versions:
* Version 1.0.beta12.3.1531.r1d2cd54.pre
Released August 15, 2016
Works with Firefox 24.0 and later, SeaMonkey 2.21 and later
* [...]
* Version 1.0.beta11
Released February 2, 2016
Works with Firefox 24.0 and later, SeaMonkey 2.21 and later
-----------------------------------------------------------------------------
(Score: 2) by Kilo110 on Saturday September 10 2016, @08:58PM
I don't think there's anything nefarious about analytics by themselves. I think of it as a glorified site hit counter. He probably likes to see where his users are coming from, what pages generates the most activity, etc. It's all great info to have.
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @01:43PM
I did tests over the years got the same results, but only a few ever listened. It is not just on the Assembly level coding it is even higher kanguages...
Example: In C
a(i++) += b(j++)
vs.
a(i) += b(j)
i++
j++
which is faster? which is clearer to red?
In assembler: shown in high level view.
mult 80 into R1
vs.
copy R1 to R2
shift left R1 by 2 bits
add R2 to R1
shift left R1 by 4 bits
why?: (R1 x4 + 1) x16 = R1 x80
Yes first case is clearer. (mostly depends how and number of registers actually used). The second is WAY faster.
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @01:50PM
Experienced C++ programmers know that dereferencing an array and using a postfix increment or decrement at the same time is costly in terms of performance. So this isn't a great example.
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @09:10PM
x++ vs ++x
Well that depends (like most things in computer science). If you are using C++ then the operator ++ is a time bomb of a performance sink. As it is almost never an intrinsic it is a method call with an extra copy (another method call) depending on which way you do it.
If it is an integer it usually compiles the same way either way as almost all compilers recognize it and optimize it away.
(Score: 2) by TheRaven on Saturday September 10 2016, @02:37PM
sudo mod me up
(Score: 3, Informative) by wonkey_monkey on Saturday September 10 2016, @03:56PM
Example: In C
a(i++) += b(j++)
vs.
a(i) += b(j)
i++
j++
That doesn't look like C to me...
systemd is Roko's Basilisk
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @09:37PM
The second is WAY faster.
Maybe. That depends on the CPU. For an old x86 CPU that was probably very true as the mul instruction was very slow. Many times these days it retires in less than 1 tick. If you do it right you can pair up to 2-4 (again depending on the CPU). Again it depends on the architecture you are using.
// your asm code in C
int r1,r2;
r1 = 50; //your starting number
r2 = r1;
r1 = 2;
r1 += r2;
r1 = 4;
That *should* force the compiler to spit out asm code that looks like what you wrote and be fairly portable. Would not 100% count on that being terribly fast on a newer CPU and not produce CPU stalls as things wait out due to register dependencies. It probably could be fine. But you would want to test it. It could also be faster due to the fact the CPU may be able to hoist later instructions. Like I said test it, I wouldnt count on it.
Remember C was designed to be 'bare' metal. It has however over the years picked up lots of little things that make life easier for programmers. It can still do it. The code usually looks pretty terrible and has a poor readability.
(Score: 2) by RamiK on Saturday September 10 2016, @02:12PM
That Intel popcnt false dependency bug significantly hurt's the compiler's ability to optimize the loops. Baring the manual unrolling, most of this can usually be done in the compiler.
Still, now that nodes don't really go down all that frequently, there's a real case for doing things manually. After all, if a couple of years ago a 50-100% performance increase would have been reputed with "Wait a couple of more month for better hardware", now it's almost a year's worth...
Regardless, A great read and great site. Thanks :)
compiling...
(Score: 5, Insightful) by TheRaven on Saturday September 10 2016, @02:40PM
sudo mod me up
(Score: 1, Interesting) by Anonymous Coward on Saturday September 10 2016, @04:26PM
You also get developers running unoptimized code, which forces them to care more about not wasting performance.
OTOH, the optimizer can uncover bugs. Finding these late can be painful.
(Score: 2) by PocketSizeSUn on Saturday September 10 2016, @05:05PM
Completely agree.
For game developers compiler speed is a critical factor for the reasons you stated.
I do the same in my personal projects ... even when compiler speed is a non issue.
After all, no optimizer can make a poor choice in algorithm fast.
However the optimization pass can bite you ... if the optimization pass bleeps up it can take a while to
figure out what went wrong. So you have to strike a balance.
And as a special surprise .. in some cases optimizing for size can shrink you code segment enough to
pin your hot path into cpu cache giving a massive boost in performance.
In the game arena I think this is more pronounced because the targets are generally well known and
have relatively long product cycles.
YMMV
(Score: 2) by TheRaven on Monday September 12 2016, @07:33AM
And as a special surprise .. in some cases optimizing for size can shrink you code segment enough to pin your hot path into cpu cache giving a massive boost in performance.
I'm not sure if they still do, but Apple used to compile all of OS X with -Os. They found that a lot of benchmarks ran faster with -O2, but overall system performance was worse. When you're expecting your users to run a dozen apps at the same time, compiling those and the libraries and frameworks that they use to take up less i-cache space turned out to be quite a big win. Being able to keep the hot paths of the window server and one or two active applications in L2 made a big difference.
sudo mod me up
(Score: 2) by Wootery on Monday September 12 2016, @09:11AM
And as a special surprise .. in some cases optimizing for size can shrink you code segment enough to
pin your hot path into cpu cache giving a massive boost in performance.
This is what profile-guided optimisation is for, no? Optimise the hotspots for speed, and the rest for space (to improve cache behavior, as you say).
I can't imagine any serious game development outfit not using PGO.
(Score: 2) by RamiK on Saturday September 10 2016, @09:03PM
Sure, algorithms come first. But mind you, x86 game developers are saying this while their engine developers are coding in C++ and spending most of their time crafting the minutes of the GPU streams with bit precision.
Modern out-of-order super-scalars make life really easy. Switch to GPUs & DSPs, and all those high and mighty "algorithms comes first" statements are turn to whispers as people spend their days analyzing EBBs and preventing catch saturation.
But yeah, algorithms are a given.
compiling...
(Score: 4, Insightful) by bradley13 on Saturday September 10 2016, @02:36PM
Granted, I haven't programmed assembly in any sort of serious fashion for a very long time. However, I really don't think this article is news. Hand-crafted assembly has always been, and probably always will be, faster than compiled code.
The problem is simply complexity. A human can craft 10 lines of assembly, keeping all important aspects of CPU architecture in mind, really easily. 100 lines isn't too hard. 1000 lines of truly complex code, and your brain starts to melt.
The entire purpose of higher level languages is to reduce complexity. The price we pay is a loss of efficiency.
To gain even more leverage, we add libraries and build frameworks on top of the higher level languages. This costs even more efficiency. By the time you've developed something really big and complex, you may well have lost a factor of (WAG = wild-assed guess) 100x in efficiency. But the processor runs at 3+ GHz, we mostly don't care.
Everyone is somebody else's weirdo.
(Score: 1, Insightful) by Anonymous Coward on Saturday September 10 2016, @04:39PM
Yes, and the hand-crafted assembly usually has to be CPU-specific to get the best performance (pipelines and all that).
About a million years ago (in 1999) I bought myself an AMD K6-2/400 because it had floating-point SIMD (3DNow) and I figured it would be cool to write some code for it, and I was keen to improve my very feeble coding skills, and I needed a hobby to keep me out of the pub of an evening. (There were loads of posers over on the green site raving about the AltiVec on their Power Macs in those days.)
So I had this "great idea" of a library of code, cross-platform, with C and SIMD implementations of various simple bits and pieces so that you could get a bit of a performance boost if you had the hardware. Being young and stupid, I had no idea how much work it would be, or how bad my programming skills really were. I also was completely underwhelmed by the amount of interest the was in that sort of thing in the world in general. And anyway, I eventually found myself in situations where I had no time to work on it, instead having to focus on evil things like C++ and Perl.
It rots away on sourceforge [sourceforge.net]. Every so often I decide to do some more to it, but I'm always thwarted by life. This time it was Java :-( The last thing I put into it was a rudimentary C unit testing framework of my own Quioxtic devising. That's probably more use than the rest of it, which doesn't actually do very much at all.
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @07:03PM
My project: http://sourceforge.net/projects/chaosesqueanthology/ [sourceforge.net]
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @07:35PM
Cool.
(Score: 0) by Anonymous Coward on Saturday September 10 2016, @06:51PM
wait. I thought the point of libraries was to write specialized, optimized code for well specified problems.
are you saying I can do better in my own code than linking to FFTW?
literally, if I copy their code into my source tree (because I know that they *do* optimize things by hand), how is that different from linking to their library?
(Score: 0) by Anonymous Coward on Sunday September 11 2016, @12:48AM
(Score: 3, Interesting) by TheRaven on Sunday September 11 2016, @09:42AM
A human can craft 10 lines of assembly, keeping all important aspects of CPU architecture in mind, really easily
Architecture? Maybe. Microarchitecture? No chance. A modern CPU can have around a hundred instructions in flight at a time, typically has at least half a dozen independent pipelines, and has complex dependencies between them. It also has a very complex register rename unit and some horrible interactions between that and the pipeline (for example, on a number of recent Intel microarchitectures, xor %rax, %rax provides a hint that a register is dead and so can result in a factor of 2 speedup in a tight loop).
Even if you can keep all of this in your head, these details change significantly between microarchitectures. A little while ago, a colleague of mine wrote hand-crafted assembly routines for the C standard library memcpy, strcpy, and friends using a mixture of different AVX and SSE instructions. He found that a different one gave the best performance on each of the four most recent Intel microarchitectures and in some cases different ones gave the best performance with different microcode revisions within the same microarchitecture.
It's also worth noting that a big part of the reason that a human can beat a compiler ever is that a compiler almost never tries to generate optimal code, because it takes too long. Using a superoptimiser and an SMT solver for scheduling will beat any assembly programmer, but no one does this in conventional compilers because most people don't want to burn 15 minutes on CPU time optimising a single function, for every function in their code. If that function happens to be the hottest code path in their program, however, it's probably a lot cheaper to burn even a few hours of CPU time than a couple of days of expert assembly programmer time.
sudo mod me up
(Score: 0) by Anonymous Coward on Sunday September 11 2016, @01:53PM
I think much of it also depends on how many people are using the code, how much they're using it, how long will it be in use before being replaced or need serious updating and changing, and how fast are their computers.
If you are the only one using the code and it takes you 10 more hours to code something in assembly to save 1 minute of processing a day then you may be losing time by hand coding it. Heck, even if in the long run you may technically save a little bit of time it may still not be worth it, at least while my computer is busy doing one thing I can do something else off the computer or on another computer. While coding you don't have that luxury as much since the limiting factor is your activities.
Now if you are writing a piece of code for 1000 people and and it saves each person an average of 1 minute a day that's 1000 minutes a day saved. Then it may be prudent for an organization to pay the right people to hand code it.
But, again, it's not always that simple. There are issues with hand coding it. Bugs, adding features, making updates and changes and general maintenance, finding the right people to code and maintain it in case one of your coders leaves since only so many people can hand code or will be willing to do so, compatibility issues (as others have pointed out).
(Score: 3, Insightful) by Gravis on Saturday September 10 2016, @02:57PM
One thing must be remembered above all else: compilers are dumb. I'm not saying using one is a bad idea, I'm saying they literally lack intelligence. For this reason, they will not outperform human devised assembly language. The difference is that humans can understand problems as a whole while a compiler looks at program construction in chunks. A clever human could reserve a specific register for functionality that is used on a global scope while a compiler could never even fathom such an optimization because it cannot fathom at all! When machines can invent think of their own optimizations, they'll cease to just be machines.
(Score: 2) by Scruffy Beard 2 on Saturday September 10 2016, @04:54PM
Wouldn't the 'register global' keywords hint that you want to do something like that?
(Score: 2) by maxwell demon on Saturday September 10 2016, @05:37PM
That depends on how you define "outperform". If the project is sufficiently large, the compiler will outperform the assembly programmer simply from the fact that even for unoptimized compilation, by the time the high level code has been written, compiled, tested and run, the assembly programmer is still in the implementation stage.
The Tao of math: The numbers you can count are not the real numbers.
(Score: 2) by bob_super on Sunday September 11 2016, @08:14AM
Good, fast, cheap: pick at most two.
(Score: 2) by dingus on Saturday September 10 2016, @10:01PM
A human can barely write assembly though, so it's no great loss.
(Score: 2) by TheRaven on Sunday September 11 2016, @09:49AM
A clever human could reserve a specific register for functionality that is used on a global scope while a compiler could never even fathom such an optimization because it cannot fathom at all!
A human that writes a compiler can though, and that's one of the optimisations that happens when you do interprocedural register allocation. You generally don't, because it's very processor (and RAM) intensive, but there are lots of things that compilers do now that were invented 20 or more years ago and considered infeasible. It's also worth noting that a register-register move is as cheap as a NOP on any processor that does register renaming, so the wins from this are a lot smaller than you'd think. A few years ago, I got a big speedup in some code by undoing this optimisation: depriving the register allocator of a register globally cost a lot more than occasionally having to move a value between registers or spill it to the stack.
The kinds of optimisations that humans are good at are not these mechanical transformations that can be inferred from the code, they're things that can only be done by understanding the purpose of the code. Microoptimisations may give you a factor of 2-3 speedup, but they won't change the complexity class of the code. Going from an n^2 algorithm to an n log(n) algorithm in a hot path will do far more than either a clever compiler or clever assembly optimisations will ever do.
This is why high-level languages tend to do better in the real world than in microbenchmarks. A C or C++ programmer may be able implement the same algorithm and have it run 5 times faster than someone using a higher-level language, but in the same time the high-level language programmer can try half a dozen algorithms and pick the one that performs best on the data. An inefficient implementation of a good algorithm almost always beats a good implementation of a bad one.
sudo mod me up
(Score: 3, Interesting) by opinionated_science on Saturday September 10 2016, @06:03PM
for those of us that use the supers to solve problems too big to solve properly (in joke!) , the only constant in this is that the LINPACK performance is a high water mark.
In essence, if your problem looks like LINPACK (or the BLAS subroutines), you have a shot at maxing your CPU.
If your problem doesn't it might be worth rewriting it to *look* like LINPACK!
The whole "deep mind" BS that is flowing at the moment, is simply some better "small data" manipulations and exposing some LUT stuff.
If you want a modern example of assembler beating compiler - look at the Xeon phi /AVX512 instructions. Compilers can get from a=b+c*d to 512bits but often don't understand bit width paths...(16 bit INT is close to single precision FP, with no exponent)....and mathematical reworking is an unsolved problem for the "deep minds" :-)
(Score: 3, Interesting) by turgid on Saturday September 10 2016, @07:39PM
I moderated you interesting, because there isn't one for "sounds really cool but I have no idea what it's about." Care to elaborate?
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 3, Informative) by bzipitidoo on Sunday September 11 2016, @05:22AM
Those are old FORTRAN libraries. BLAS = Basic Linear Algebra Subprograms and LINPACK is a library that uses BLAS. They've been optimized to the max, carefully checked to maintain numerical stability, and thoroughly examined for bugs. If you're going to use Linear Algebra in a program, you really, really do not want to write your own routines for matrix multiplication, Gaussian elimination, computation of eigenvalues, and so on if you can use BLAS instead.
Not that most programmers need concern themselves with Linear Algebra. The main use is for graphics, and there are so many graphics libraries on top of graphics libraries on top of GPUs that do all the Linear Algebra for you that you need not bother.
(Score: 5, Interesting) by Anonymous Coward on Saturday September 10 2016, @07:13PM
He did not give the compiler as much information as he used in his hand optimized one, and he could have. Mainly he did not specify the micro-architecture he was targeting to the compiler, but his optimizations were based around a false dependency bug in some particular intel chips.
Anyway, I went and tested his code (after fixing a typo that made it not compile...) with clang3.9. On my AMD CPU with march=native, I found that the trivial code (using the intrinsic) was far better than all but his SSE3 monstrosity, which it was about equal with. Note, the numbers here are throughput, higher is better.
Note second run uses "-march=native" and is much faster. Final run includes more versions (including my OpenMP one, the clear winner by adding one pragma)
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ cat README.md
Bytes / cycle. SSSE 3 is an implementation that's allegedly faster than using the hardware popcnt instruction. Other lines are various ways of gcc/clang to emit popcnt.
This graph is from `clang -O3 -mpopcnt`
| Algorithm | 1k | 4k | 16k | 65k | 256k | 1M | 4M | 16M |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Intrinsic | 6.9 | 7.3 | 7.4 | 7.5 | 7.5 | 7.5 | 7.5 | 7.5 |
| PSHUFB | 11.5 | 13.0 | 13.3 | 13.4 | 13.1 | 13.4 | 13.0 | 12.6 |
| Unrolled Intrinsic | 12.5 | 14.4 | 15.0 | 15.1 | 15.2 | 15.2 | 15.2 | 15.2 |
| Unrolled Intrinsic 2 | 14.3 | 16.3 | 17.0 | 17.2 | 17.2 | 17.0 | 16.8 | 16.7 |
| Assembly | 17.5 | 23.7 | 25.3 | 25.3 | 26.3 | 26.3 | 25.3 | 24.3 |
Note to self: try running n copies of this at once to see if varying n changes the relative speed of the movdqa versions.
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang -O3 -mpopcnt popcnt.c
popcnt.c:11:9: warning: 'MIN_LEN' macro redefined [-Wmacro-redefined]
#define MIN_LEN 1048576*16
^
popcnt.c:10:9: note: previous definition is here
#define MIN_LEN 256/8
^
popcnt.c:19:17: error: use of undeclared identifier 'MAX_LEN'
uint64_t buffer[MAX_LEN] __attribute__((aligned(LINE_SIZE)));
^
popcnt.c:379:34: error: use of undeclared identifier 'MAX_LEN'
for (int len = MIN_LEN; len = MAX_LEN; len *= DELTA) {
^
1 warning and 2 errors generated.
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang -O3 -mpopcnt popcnt.c
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ ls
a.out bytes_per_cycle.png crc32_popcnt.cpp data-0.csv data-1.csv make-csv.jl plot-data.jl README.md
bytes_per_cycle-1.png bytes_per_cycle.svg data-0 data-1 data-clang-sandy mula_benchmarks.c popcnt.c
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ ./a.out
bytes: 256
builtin: 4.76
builtin unrolled: 4.81
builtin errata: 4.53
builtin manual: 4.63
SSSE3: 5.44
bytes: 1024
builtin: 7.27
builtin unrolled: 6.15
builtin errata: 5.93
builtin manual: 5.96
SSSE3: 9.75
bytes: 4096
builtin: 8.71
builtin unrolled: 6.50
builtin errata: 6.30
builtin manual: 6.35
SSSE3: 11.98
bytes: 16384
builtin: 7.70
builtin unrolled: 6.33
builtin errata: 6.41
builtin manual: 6.45
SSSE3: 11.00
bytes: 65536
builtin: 7.61
builtin unrolled: 6.18
builtin errata: 6.43
builtin manual: 6.39
SSSE3: 10.87
bytes: 262144
builtin: 7.57
builtin unrolled: 6.34
builtin errata: 6.44
builtin manual: 6.34
SSSE3: 10.75
bytes: 1048576
builtin: 7.56
builtin unrolled: 6.34
builtin errata: 6.45
builtin manual: 6.35
SSSE3: 10.75
bytes: 4194304
builtin: 8.06
builtin unrolled: 6.35
builtin errata: 6.45
builtin manual: 6.35
SSSE3: 11.37
bytes: 16777216
builtin: 8.58
builtin unrolled: 6.33
builtin errata: 6.45
builtin manual: 6.35
SSSE3: 12.56
bytes: 67108864
builtin: 8.70
^C
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang -O3 -march=native popcnt.c
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ ./a.out
bytes: 256
builtin: 4.49
builtin unrolled: 3.83
builtin errata: 3.63
builtin manual: 3.70
SSSE3: 4.31
bytes: 1024
builtin: 7.65
builtin unrolled: 4.91
builtin errata: 4.73
builtin manual: 4.78
SSSE3: 7.89
bytes: 4096
builtin: 11.23
builtin unrolled: 6.45
builtin errata: 6.30
builtin manual: 6.23
SSSE3: 11.97
bytes: 16384
builtin: 10.31
builtin unrolled: 5.91
builtin errata: 6.41
builtin manual: 6.32
SSSE3: 11.00
bytes: 65536
builtin: 10.02
builtin unrolled: 6.24
builtin errata: 6.43
builtin manual: 6.50
SSSE3: 10.87
bytes: 262144
builtin: 9.85
builtin unrolled: 6.29
builtin errata: 6.44
builtin manual: 6.34
SSSE3: 10.78
bytes: 1048576
builtin: 9.82
builtin unrolled: 6.31
builtin errata: 6.45
builtin manual: 6.35
SSSE3: 10.75
bytes: 4194304
builtin: 11.12
builtin unrolled: 6.34
builtin errata: 6.45
builtin manual: 6.36
SSSE3: 11.56
bytes: 16777216
builtin: 12.34
builtin unrolled: 6.35
builtin errata: 6.45
builtin manual: 6.37
SSSE3: 12.69
bytes: 67108864
^C
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $
And as long as we are doing silly benchmarks, how about openMP? Add one line and its more than 4x faster for large data sets:
uint32_t builtin_popcnt_openMP(const uint64_t* buf, int len) {
int cnt = 0;
#pragma omp parallel for reduction(+:cnt)
for (int i = 0; i len; ++i) {
cnt += __builtin_popcountll(buf[i]);
}
return cnt;
}
I also enabled a few other implementations he included in the source:
craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang++ -O3 -march=native -fopenmp=libiomp5 popcnt.c && ./a.out
clang-3.9: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated
bytes: 256
builtin32: 2.79
builtin: 3.66
builtin_openMP: 0.08
builtin unrolled: 4.42
builtin unrolled32: 3.89
builtin errata: 4.25
builtin manual: 4.58
builtin movdq: 4.25
builtin movdq unrolled: 4.27
builtin movdq manual: 4.05
SSSE3: 4.46
bytes: 1024
builtin32: 5.20
builtin: 7.18
builtin_openMP: 0.30
builtin unrolled: 5.81
builtin unrolled32: 4.92
builtin errata: 5.64
builtin manual: 5.75
builtin movdq: 5.73
builtin movdq unrolled: 5.66
builtin movdq manual: 5.47
SSSE3: 6.01
bytes: 4096
builtin32: 7.41
builtin: 7.89
builtin_openMP: 1.22
builtin unrolled: 6.22
builtin unrolled32: 5.58
builtin errata: 6.08
builtin manual: 6.21
builtin movdq: 6.30
builtin movdq unrolled: 6.16
builtin movdq manual: 6.05
SSSE3: 7.39
bytes: 16384
builtin32: 6.04
builtin: 8.39
builtin_openMP: 4.53
builtin unrolled: 6.15
builtin unrolled32: 4.77
builtin errata: 6.17
builtin manual: 6.21
builtin movdq: 6.70
builtin movdq unrolled: 6.40
builtin movdq manual: 6.28
SSSE3: 10.67
bytes: 65536
builtin32: 6.84
builtin: 9.99
builtin_openMP: 14.10
builtin unrolled: 6.17
builtin unrolled32: 4.84
builtin errata: 6.43
builtin manual: 6.34
builtin movdq: 6.59
builtin movdq unrolled: 6.48
builtin movdq manual: 6.33
SSSE3: 10.81
bytes: 262144
builtin32: 6.78
builtin: 9.84
builtin_openMP: 30.46
builtin unrolled: 6.13
builtin unrolled32: 4.79
builtin errata: 6.44
builtin manual: 6.35
builtin movdq: 6.57
builtin movdq unrolled: 6.36
builtin movdq manual: 6.34
SSSE3: 10.74
bytes: 1048576
builtin32: 6.77
builtin: 9.82
builtin_openMP: 43.14
builtin unrolled: 6.34
builtin unrolled32: 4.82
builtin errata: 6.45
builtin manual: 6.35
builtin movdq: 6.56
builtin movdq unrolled: 6.36
builtin movdq manual: 6.35
SSSE3: 10.75
bytes: 4194304
builtin32: 7.50
builtin: 11.14
builtin_openMP: 48.01
builtin unrolled: 6.34
builtin unrolled32: 5.20
builtin errata: 6.45
builtin manual: 6.42
builtin movdq: 6.56
builtin movdq unrolled: 6.40
builtin movdq manual: 6.35
SSSE3: 11.57
bytes: 16777216
builtin32: 8.16
builtin: 12.41
builtin_openMP: 50.17
^C
Looks like manual unrolling was a loss in all 3 cases. Looks like custom assembly was a loss too, except for the SSE3 monster.
The lesson I'm taking from this is use intrinsics, and give all the info you can to the compiler. This way your code is more portable and you get to benefit from CPU microarchtecture improvements automatically, and can easily use fancy new features like openMP.
(Score: 2) by mendax on Saturday September 10 2016, @07:41PM
Compiler optimization is an old, old topic, and believe it or not goes all the way back to the first FORTRAN compiler... in 1957.
The designers of the first FORTRAN compiler (John Backus, et al.) realized that they not only had to create a compiler that could generate assembly code from high-level FORTRAN, it had to generate EXCELLENT code, as good as that which could be produced by a human. Why? They realized that in order for FORTRAN to be accepted by its potential users, it had to generate both tight and fast machine code in order. The reason for that is pretty simple. Computers of that day were very expensive to use and had very little RAM. In 1957, 4K words was typical. They also didn't know it, of course, at the time but the computers then were very slow when compared to what we use today. The generated program had to fit into that space and had to execute in about the same amount of time (after compilation) as a handcrafted assembly program.
Guess what? They did it. That compiler, while horribly slow to compile programs because of the limitations of the machines of the day, performed excellent instruction optimization.
Today, it does not matter so much just how fast programs run. If it did, scripting languages would not have the popularity they have now. Dynamic method calls, dynamically detected data types, aspect-oriented programming, and object mocking are powerful tools, but they are horribly slow compared to more traditional strongly typed languages.
It's really quite a simple choice: Life, Death, or Los Angeles.
(Score: 2) by turgid on Sunday September 11 2016, @11:10AM
Scripting languages are there to fulfil the niche "computer time cheap, person time expensive" by making it easy for the human to get a lot of as certain class of repetitive work done with orders of magnitude of person-effort less than would be required with a language like FORTRAN, C++ or similar.
I stumbled into a job working on projects where you'd think that correct, simple, efficient code would be the highest priority. Instead, we have (in most cases) thousands of lines of very badly-written C++ (just plain broken in many cases) wasting clock cycles like they're going out of fashion. (Linus was right, the best argument against using C++ is C++ programmers).
But you're right, just throw bigger hardware at it...
The funny thing is, you can open a random source file and after reading it for 10 minutes, you can see the abuse of the type system, the memory leaks, the unnecessary repetition of operations, and the crazy use of memory which will probably thrash the TLBs and cache hierarchy slowing the code down by a factor of 100 to 1000. But what do I know, I'm from the countryside.
The cool thing is, I get to play with machines with 128GB of RAM, large amounts of nVidia GPU and 24 virtual cores :-) It's amazing what you can do with a bash script on such a box...
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 3, Interesting) by mendax on Monday September 12 2016, @01:26AM
Agreed.
I've not heard this but it's probably correct. C++ is an awful bastard of a language. I'm am glad I have long forgotten how to write programs in it. C# or Java are much more pleasant to use generally.
This statement reminds me of an article [righto.com] that was posted here a year ago or so that described the generation of a Mandelbrot fractal on an 55-year-old IBM 1401 at the Computer History Museum. The program this fellow wrote in assembly language took 12 minutes to generate. I wrote a similar program in C on my 8-year-old iMac and ran it. The Unix "time" utility reported that it took 0.001 seconds to complete. And then there is a talk on YouTube the commemorated the 40th aniversary of the IBM System 360 series of mainframes in 2004. In that talk, an IBM executive stated that a program that took seven days to run 24/7 on the IBM 360/30 computer, the first of that venerable line to be introduced, would complete in under a second on the top-of-the-line z-series mainframe available at that time, and unmodified or recompiled. We have the bigger hardware now and can afford to write some really shitty code. However, I try to write pretty clean code myself, but I don't worry about efficiency so much.
Why don't you tell us. It must be interesting.
It's really quite a simple choice: Life, Death, or Los Angeles.
(Score: 2) by darkfeline on Monday September 12 2016, @05:16PM
The sad thing about this is that it's not even wrong.
I can spend a few minutes writing a hundred lines of Python than runs a couple orders of magnitudes of magnitudes slower than painstakingly optimized assembly written in a few months, but programmer time is so much more expensive than CPU time that there isn't even a choice to be made here.
Join the SDF Public Access UNIX System today!