Stories
Slash Boxes
Comments

SoylentNews is people

posted by cmn32480 on Thursday February 16 2017, @03:36PM   Printer-friendly
from the for-all-you-code-writing-types-out-there dept.

John Regehr, Professor of Computer Science, University of Utah, writes:

Undefined behavior (UB) in C and C++ is a clear and present danger to developers, especially when they are writing code that will execute near a trust boundary. A less well-known kind of undefined behavior exists in the intermediate representation (IR) for most optimizing, ahead-of-time compilers. For example, LLVM IR has undef and poison in addition to true explodes-in-your-face C-style UB. When people become aware of this, a typical reaction is: "Ugh, why? LLVM IR is just as bad as C!" This piece explains why that is not the correct reaction.

Undefined behavior is the result of a design decision: the refusal to systematically trap program errors at one particular level of a system. The responsibility for avoiding these errors is delegated to a higher level of abstraction. For example, it is obvious that a safe programming language can be compiled to machine code, and it is also obvious that the unsafety of machine code in no way compromises the high-level guarantees made by the language implementation. Swift and Rust are compiled to LLVM IR; some of their safety guarantees are enforced by dynamic checks in the emitted code, other guarantees are made through type checking and have no representation at the LLVM level. Either way, UB at the LLVM level is not a problem for, and cannot be detected by, code in the safe subsets of Swift and Rust. Even C can be used safely if some tool in the development environment ensures that it will not execute UB. The L4.verified project does exactly this.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by DannyB on Thursday February 16 2017, @04:22PM

    by DannyB (5839) Subscriber Badge on Thursday February 16 2017, @04:22PM (#467848) Journal

    In order to write a program, you need defined behavior. Every "Hello World" program ever written assumes that the language and underlying system provide certain defined behavior guarantees that under normal operating conditions will result the famous greeting.

    When the programmer writes even a simple assignment, such as x := y; it is assumed that the behavior is defined.

    Now, I can see a case for pushing potentially optimized operations up into the language so that they are additional tools in the hands of a programmer who knows how to use them. Most of the time I want an addition operation to spectacularly fail with an exception if it overflows. But there may be times where I don't care what happens in the event of overflow because I can guarantee before the addition is done that overflow simply cannot occur. The simple example is that the operands are already restricted to a smaller range making overflow impossible in the data type that the addition will use. (eg, adding two bytes that are widened to ints) And depending on the purpose I may not even care about any overflow bits. Maybe wanting "mod 256" arithmetic widened to ints when the addition is performed.

    I agree with the title that undefined behavior does not mean that programming is unsafe. But most of the time you don't want undefined behavior. Therefore, if you're using operations that have weird, undefined or surprising behavior, those functions or operations ought to have unusual names. The well known functions or operators such as '+' should have no surprising or undefined behaviors.

    Another approach might be to have compiler switches or annotations that can be used locally on certain statements to indicate to the compiler that on the next line I simply don't care about what happens for integer overflow. If the compiler is able to use that information to do a more optimized addition operation on a certain instruction set, then great. If not, then fine. And even if the compiler ignores the annotation and simply compiles the addition with all of the checking and guard code around it, that is acceptable. It merely indicates the lower quality of the compiler. Yet the compiler still ensures correctness.

    As for making an ordinary common operator have undefined behavior, I think that is a stupid idea. It simply means that generations of programmers, for decades of time, will have to invent and re-invent their own defenses around what should be a simple common operation. Or, they will simply ignore the problem completely. And we end up with obscure bugs, even security vulnerabilities hidden in code that are due to the combination of the programmer, the particular machine instruction set, and how the compiler, or this version of the compiler (!) chose to emit code for that operation.

    --
    The lower I set my standards the more accomplishments I have.
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by Pino P on Thursday February 16 2017, @04:56PM

    by Pino P (4721) on Thursday February 16 2017, @04:56PM (#467865) Journal

    Most of the time I want an addition operation to spectacularly fail with an exception if it overflows. But there may be times where I don't care what happens in the event of overflow because I can guarantee before the addition is done that overflow simply cannot occur. The simple example is that the operands are already restricted to a smaller range making overflow impossible in the data type that the addition will use. (eg, adding two bytes that are widened to ints) And depending on the purpose I may not even care about any overflow bits. Maybe wanting "mod 256" arithmetic widened to ints when the addition is performed.

    I just searched for gcc trap add overflow on Google, and the second result [robertelder.org] states that -ftrapv in GCC is supposed to enable behavior similar to what you describe. But it was broken until 2014 when GCC 4.8.4 fixed a serious bug [gnu.org].

    aUsing -ftrapv in GCC 4.8.4 or later enables the following rules:

    • Results of arithmetic on unsigned integers are reduced modulo 2^number of bits. The C standard requires this modulo behavior.
    • Arithmetic on signed integers is performed with overflow trapping. The C standard treats this as undefined behavior; the -ftrapv option turns it into an abort.
  • (Score: 0) by Anonymous Coward on Thursday February 16 2017, @05:03PM

    by Anonymous Coward on Thursday February 16 2017, @05:03PM (#467869)

    You are confusing implementation-specific behaviors with undefined. They are NOT the same. Printing "hello world" is not undefined.

    • (Score: 2) by DannyB on Thursday February 16 2017, @05:23PM

      by DannyB (5839) Subscriber Badge on Thursday February 16 2017, @05:23PM (#467885) Journal

      I used Hello World to point out that programmers have expectations of well defined behaviors being defined to achieve the desired result. Common operations should not have undefined behaviors. If it is useful to do so, then introduce an annotation or differently named operator which has undefined behaviors for a potential gain in performance.

      Implementation specific behaviors come in two flavors that I can think of.

      1. The specification says that standing on one foot, jumping three times while shouting Foo has an implementation defined behavior.

      2. The specification says that standing on one foot, jumping three times while shouting Bar has an undefined behavior.

      In case 1, the implementer typically documents the behavior. (or not! making it effectively undefined)

      In case 2, the implementer may or may not document it, but the programmer cannot depend on the behavior because it is undefined. The implementation could change in a subsequent release. Of course an implementation change could happen in case 1, but is usually more public. The spec says it's implementation specific, and programmers ask, so what does my implementation do?

      I would say implementation specific specifications are almost as bad as specifications that define something as undefined.

      I am of the opinion that portability across compilers, let alone operating systems is something that language specifications should strive for. Predictability. Repeatability. Programmers should be able to rely on the language and its compilers to always do one thing. Compiler vendors, or better the language specification, could include optional annotation directives that allow possible optimizations, some of which may rely on undefined edge case behavior.

      --
      The lower I set my standards the more accomplishments I have.
      • (Score: 0) by Anonymous Coward on Thursday February 16 2017, @06:24PM

        by Anonymous Coward on Thursday February 16 2017, @06:24PM (#467905)

        I think you completely misunderstand the meaning of "undefined" and "implementation defined".
        Just because it is not documented does not turn "implementation defined" into "undefined".
        "implementation defined" means it has a specific, reproducible behaviour. So if you exhaustively test that your code behaves correctly with a certain implementation, you can know you are fine. "implementation defined" also usually is attached to a RESULT, which means the absolute worst case is that you cannot know what the result will be, but you do know there is a result and the surrounding code will work (e.g. if you clamp the result into 0 - 1 range you know it will be in that range afterwards).
        "Undefined" is a completely different thing. There are NO guarantees about undefined behaviour. Your program may crash, abort, start deleting random files, that's all perfectly valid behaviour.
        In particular, there is also NO guarantee that the the code BEFORE whatever triggers the undefined behaviour will be executed, do what it was meant to do or anything like that.
        C code like this:

        char c[10];
        int a = 12;
        int valid = a sizeof(c);
        char *dummy = c + a;
        return valid;

        Is undefined behaviour, and the compiler would be allowed to just replace it by "return true" for example. The fact that the out-of-bounds address is never used, that it has nothing to do with the calculation of "valid" etc. does not matter.
        If it was "implementation defined" anything might happen if you e.g. tried to dereference dummy, but merely calculating c + a would not matter if you never used the result (or worst case, if allowed, it might crash right there. But it cannot result in everything working perfectly except that later in the code 1+2 evalutes to 5).

        • (Score: 2) by DannyB on Thursday February 16 2017, @07:19PM

          by DannyB (5839) Subscriber Badge on Thursday February 16 2017, @07:19PM (#467919) Journal

          I understand exactly what you describe as undefined and implementation defined behavior. I have understood it for decades, across different languages and compilers.

          I think a language specification that leaves anything undefined is a bad idea. That is an opinion.

          I think a language specification that leaves anything implementation defined is also a bad idea. Almost as bad as undefined.

          I hope that is sufficiently clear.

          --
          The lower I set my standards the more accomplishments I have.
          • (Score: 0) by Anonymous Coward on Thursday February 16 2017, @08:59PM

            by Anonymous Coward on Thursday February 16 2017, @08:59PM (#467959)

            Well, have fun with your toy languages. Any real language will have corners that are undefined, implementation-specific, or unspecified. It's just the nature of the beast.

            • (Score: 2) by DannyB on Friday February 17 2017, @02:05PM

              by DannyB (5839) Subscriber Badge on Friday February 17 2017, @02:05PM (#468205) Journal

              No language is perfect, or everyone would be using it. But some languages have sharp edges where they should not.

              --
              The lower I set my standards the more accomplishments I have.
          • (Score: 2) by TheRaven on Friday February 17 2017, @12:20PM

            by TheRaven (270) on Friday February 17 2017, @12:20PM (#468179) Journal

            I think a language specification that leaves anything undefined is a bad idea. That is an opinion.

            It's also very hard if you want good or deterministic performance. To give a simple example, using a pointer after it has been free'd is undefined behaviour in C. If this were not, then the compiler would be required to do something specific in the case of a use-after free. This would require that it check before each dereference that a pointer is still valid. You basically need garbage collection.

            The same is true for out-of-bounds accesses. By making them undefined, the compiler is free to assume that all accesses are in bounds and so doesn't need to do any checking. Again, this gives much better code.

            I think a language specification that leaves anything implementation defined is also a bad idea

            The same applies. For example, in C the size of long is implementation defined. When C was created, typically C was 1 byte, int and short were 2 bytes, long was 4 bytes. Now, typically long is 8 bytes. If you want your language to work on different substrates, then you need some implementation-defined behaviour.

            --
            sudo mod me up
            • (Score: 2) by fnj on Friday February 17 2017, @02:11PM

              by fnj (1654) on Friday February 17 2017, @02:11PM (#468209)

              If you want your language to work on different substrates, then you need some implementation-defined behaviour.

              It's not clear what you mean. The C specification (just to pick one example) chose to make sizeof char, short, int, and long loosely defined. They didn't have to. The Free Pascal specification says that sizeof Byte and ShortInt are exactly 1, SmallInt and Word are exactly 2, Integer is either 2 or 4 depending on mode, LongInt and LongWord are exactly 4.

              Even C99 formalized typedefs (in stdint.h) for int8_t (1), int16_t (2), int32_t (4), int64_t (8) and permutations of each for unsigned and other variations. Those are not implementation-defined. They are standard-defined. The programmer can choose to use them or not.

              • (Score: 2) by TheRaven on Friday February 17 2017, @05:52PM

                by TheRaven (270) on Friday February 17 2017, @05:52PM (#468274) Journal
                How big is a pointer in Pascal? C has supported 16-bit, 32-bit, and 64-bit pointers that are represented purely as integers, as 36-bit values including a segment id, as fat pointers including a base and a range, and so on. In some languages, such as Java, these details are not exposed through the abstract machine and so the fact that it's implementation defined is hidden from programmers, but the more that you want to expose, the harder it is.
                --
                sudo mod me up
        • (Score: 2) by c0lo on Thursday February 16 2017, @11:00PM

          by c0lo (156) Subscriber Badge on Thursday February 16 2017, @11:00PM (#468002) Journal

          "Undefined" is a completely different thing. There are NO guarantees about undefined behaviour. Your program may crash, abort, start deleting random files, that's all perfectly valid behaviour.

          See also nasal demons [catb.org]

          --
          https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
        • (Score: 2) by fnj on Friday February 17 2017, @01:58PM

          by fnj (1654) on Friday February 17 2017, @01:58PM (#468200)

          You've got something missing between a and sizeof, perhaps a < or >

          Fix it.