Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Saturday September 10 2016, @01:13PM   Printer-friendly
from the some-assembly-required dept.

Dan Luu demonstrates that even when optimizing, compilers often produce very slow code as compared to very basic source that is easily accessible to every assembly code programmer: Hand coded assembly beats intrinsics in speed and simplicity:

Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

In my own experience, I have yet to find an optimizing compiler that generates code as fast or as compact as I am able to with hand-optimized code.

Dan Luu's entire website is a treasure trove of education for experienced and novice coders alike. I look forward to studying the whole thing. His refreshingly simple HTML 1.0 design is obviously intended to educate, and is an example of my assertion that the true experts all have austere websites.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by Anonymous Coward on Saturday September 10 2016, @07:13PM

    by Anonymous Coward on Saturday September 10 2016, @07:13PM (#400070)

    He did not give the compiler as much information as he used in his hand optimized one, and he could have. Mainly he did not specify the micro-architecture he was targeting to the compiler, but his optimizations were based around a false dependency bug in some particular intel chips.

    Anyway, I went and tested his code (after fixing a typo that made it not compile...) with clang3.9. On my AMD CPU with march=native, I found that the trivial code (using the intrinsic) was far better than all but his SSE3 monstrosity, which it was about equal with. Note, the numbers here are throughput, higher is better.

    Note second run uses "-march=native" and is much faster. Final run includes more versions (including my OpenMP one, the clear winner by adding one pragma)

    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ cat README.md
    Bytes / cycle. SSSE 3 is an implementation that's allegedly faster than using the hardware popcnt instruction. Other lines are various ways of gcc/clang to emit popcnt.

    This graph is from `clang -O3 -mpopcnt`

    | Algorithm | 1k | 4k | 16k | 65k | 256k | 1M | 4M | 16M |
    | --- | --- | --- | --- | --- | --- | --- | --- | --- |
    | Intrinsic | 6.9 | 7.3 | 7.4 | 7.5 | 7.5 | 7.5 | 7.5 | 7.5 |
    | PSHUFB | 11.5 | 13.0 | 13.3 | 13.4 | 13.1 | 13.4 | 13.0 | 12.6 |
    | Unrolled Intrinsic | 12.5 | 14.4 | 15.0 | 15.1 | 15.2 | 15.2 | 15.2 | 15.2 |
    | Unrolled Intrinsic 2 | 14.3 | 16.3 | 17.0 | 17.2 | 17.2 | 17.0 | 16.8 | 16.7 |
    | Assembly | 17.5 | 23.7 | 25.3 | 25.3 | 26.3 | 26.3 | 25.3 | 24.3 |

    Note to self: try running n copies of this at once to see if varying n changes the relative speed of the movdqa versions.
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang -O3 -mpopcnt popcnt.c
    popcnt.c:11:9: warning: 'MIN_LEN' macro redefined [-Wmacro-redefined]
    #define MIN_LEN 1048576*16
                    ^
    popcnt.c:10:9: note: previous definition is here
    #define MIN_LEN 256/8
                    ^
    popcnt.c:19:17: error: use of undeclared identifier 'MAX_LEN'
    uint64_t buffer[MAX_LEN] __attribute__((aligned(LINE_SIZE)));
                                    ^
    popcnt.c:379:34: error: use of undeclared identifier 'MAX_LEN'
        for (int len = MIN_LEN; len = MAX_LEN; len *= DELTA) {
                                                                      ^
    1 warning and 2 errors generated.
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang -O3 -mpopcnt popcnt.c
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ ls
    a.out bytes_per_cycle.png crc32_popcnt.cpp data-0.csv data-1.csv make-csv.jl plot-data.jl README.md
    bytes_per_cycle-1.png bytes_per_cycle.svg data-0 data-1 data-clang-sandy mula_benchmarks.c popcnt.c
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ ./a.out
    bytes: 256
    builtin: 4.76
    builtin unrolled: 4.81
    builtin errata: 4.53
    builtin manual: 4.63
    SSSE3: 5.44
    bytes: 1024
    builtin: 7.27
    builtin unrolled: 6.15
    builtin errata: 5.93
    builtin manual: 5.96
    SSSE3: 9.75
    bytes: 4096
    builtin: 8.71
    builtin unrolled: 6.50
    builtin errata: 6.30
    builtin manual: 6.35
    SSSE3: 11.98
    bytes: 16384
    builtin: 7.70
    builtin unrolled: 6.33
    builtin errata: 6.41
    builtin manual: 6.45
    SSSE3: 11.00
    bytes: 65536
    builtin: 7.61
    builtin unrolled: 6.18
    builtin errata: 6.43
    builtin manual: 6.39
    SSSE3: 10.87
    bytes: 262144
    builtin: 7.57
    builtin unrolled: 6.34
    builtin errata: 6.44
    builtin manual: 6.34
    SSSE3: 10.75
    bytes: 1048576
    builtin: 7.56
    builtin unrolled: 6.34
    builtin errata: 6.45
    builtin manual: 6.35
    SSSE3: 10.75
    bytes: 4194304
    builtin: 8.06
    builtin unrolled: 6.35
    builtin errata: 6.45
    builtin manual: 6.35
    SSSE3: 11.37
    bytes: 16777216
    builtin: 8.58
    builtin unrolled: 6.33
    builtin errata: 6.45
    builtin manual: 6.35
    SSSE3: 12.56
    bytes: 67108864
    builtin: 8.70
    ^C
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang -O3 -march=native popcnt.c
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ ./a.out
    bytes: 256
    builtin: 4.49
    builtin unrolled: 3.83
    builtin errata: 3.63
    builtin manual: 3.70
    SSSE3: 4.31
    bytes: 1024
    builtin: 7.65
    builtin unrolled: 4.91
    builtin errata: 4.73
    builtin manual: 4.78
    SSSE3: 7.89
    bytes: 4096
    builtin: 11.23
    builtin unrolled: 6.45
    builtin errata: 6.30
    builtin manual: 6.23
    SSSE3: 11.97
    bytes: 16384
    builtin: 10.31
    builtin unrolled: 5.91
    builtin errata: 6.41
    builtin manual: 6.32
    SSSE3: 11.00
    bytes: 65536
    builtin: 10.02
    builtin unrolled: 6.24
    builtin errata: 6.43
    builtin manual: 6.50
    SSSE3: 10.87
    bytes: 262144
    builtin: 9.85
    builtin unrolled: 6.29
    builtin errata: 6.44
    builtin manual: 6.34
    SSSE3: 10.78
    bytes: 1048576
    builtin: 9.82
    builtin unrolled: 6.31
    builtin errata: 6.45
    builtin manual: 6.35
    SSSE3: 10.75
    bytes: 4194304
    builtin: 11.12
    builtin unrolled: 6.34
    builtin errata: 6.45
    builtin manual: 6.36
    SSSE3: 11.56
    bytes: 16777216
    builtin: 12.34
    builtin unrolled: 6.35
    builtin errata: 6.45
    builtin manual: 6.37
    SSSE3: 12.69
    bytes: 67108864
    ^C
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $

    And as long as we are doing silly benchmarks, how about openMP? Add one line and its more than 4x faster for large data sets:

    uint32_t builtin_popcnt_openMP(const uint64_t* buf, int len) {
        int cnt = 0;
        #pragma omp parallel for reduction(+:cnt)
        for (int i = 0; i len; ++i) {
            cnt += __builtin_popcountll(buf[i]);
        }
        return cnt;
    }

    I also enabled a few other implementations he included in the source:
    craig@large /mnt/DATA/Scratch/cscraps/dump/popcnt-speed-comparison $ clang++ -O3 -march=native -fopenmp=libiomp5 popcnt.c && ./a.out
    clang-3.9: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated
    bytes: 256
    builtin32: 2.79
    builtin: 3.66
    builtin_openMP: 0.08
    builtin unrolled: 4.42
    builtin unrolled32: 3.89
    builtin errata: 4.25
    builtin manual: 4.58
    builtin movdq: 4.25
    builtin movdq unrolled: 4.27
    builtin movdq manual: 4.05
    SSSE3: 4.46
    bytes: 1024
    builtin32: 5.20
    builtin: 7.18
    builtin_openMP: 0.30
    builtin unrolled: 5.81
    builtin unrolled32: 4.92
    builtin errata: 5.64
    builtin manual: 5.75
    builtin movdq: 5.73
    builtin movdq unrolled: 5.66
    builtin movdq manual: 5.47
    SSSE3: 6.01
    bytes: 4096
    builtin32: 7.41
    builtin: 7.89
    builtin_openMP: 1.22
    builtin unrolled: 6.22
    builtin unrolled32: 5.58
    builtin errata: 6.08
    builtin manual: 6.21
    builtin movdq: 6.30
    builtin movdq unrolled: 6.16
    builtin movdq manual: 6.05
    SSSE3: 7.39
    bytes: 16384
    builtin32: 6.04
    builtin: 8.39
    builtin_openMP: 4.53
    builtin unrolled: 6.15
    builtin unrolled32: 4.77
    builtin errata: 6.17
    builtin manual: 6.21
    builtin movdq: 6.70
    builtin movdq unrolled: 6.40
    builtin movdq manual: 6.28
    SSSE3: 10.67
    bytes: 65536
    builtin32: 6.84
    builtin: 9.99
    builtin_openMP: 14.10
    builtin unrolled: 6.17
    builtin unrolled32: 4.84
    builtin errata: 6.43
    builtin manual: 6.34
    builtin movdq: 6.59
    builtin movdq unrolled: 6.48
    builtin movdq manual: 6.33
    SSSE3: 10.81
    bytes: 262144
    builtin32: 6.78
    builtin: 9.84
    builtin_openMP: 30.46
    builtin unrolled: 6.13
    builtin unrolled32: 4.79
    builtin errata: 6.44
    builtin manual: 6.35
    builtin movdq: 6.57
    builtin movdq unrolled: 6.36
    builtin movdq manual: 6.34
    SSSE3: 10.74
    bytes: 1048576
    builtin32: 6.77
    builtin: 9.82
    builtin_openMP: 43.14
    builtin unrolled: 6.34
    builtin unrolled32: 4.82
    builtin errata: 6.45
    builtin manual: 6.35
    builtin movdq: 6.56
    builtin movdq unrolled: 6.36
    builtin movdq manual: 6.35
    SSSE3: 10.75
    bytes: 4194304
    builtin32: 7.50
    builtin: 11.14
    builtin_openMP: 48.01
    builtin unrolled: 6.34
    builtin unrolled32: 5.20
    builtin errata: 6.45
    builtin manual: 6.42
    builtin movdq: 6.56
    builtin movdq unrolled: 6.40
    builtin movdq manual: 6.35
    SSSE3: 11.57
    bytes: 16777216
    builtin32: 8.16
    builtin: 12.41
    builtin_openMP: 50.17
    ^C

    Looks like manual unrolling was a loss in all 3 cases. Looks like custom assembly was a loss too, except for the SSE3 monster.

    The lesson I'm taking from this is use intrinsics, and give all the info you can to the compiler. This way your code is more portable and you get to benefit from CPU microarchtecture improvements automatically, and can easily use fancy new features like openMP.

    Starting Score:    0  points
    Moderation   +5  
       Insightful=1, Interesting=4, Total=5
    Extra 'Interesting' Modifier   0  

    Total Score:   5