Stories
Slash Boxes
Comments

SoylentNews is people

posted by CoolHand on Thursday May 07 2015, @07:03AM   Printer-friendly
from the not-as-promoted-as-y2k-bug dept.

A surprisingly simple bug afflicts computers controlling planes, spacecraft and more – they get confused by big numbers. As Chris Baraniuk discovers, the glitch has led to explosions, missing space probes and more.

Tuesday, 4 June 1996 will forever be remembered as a dark day for the European Space Agency (Esa). The first flight of the crewless Ariane 5 rocket, carrying with it four very expensive scientific satellites, ended after 39 seconds in an unholy ball of smoke and fire. It's estimated that the explosion resulted in a loss of $370m (£240m).

What happened? It wasn't a mechanical failure or an act of sabotage. No, the launch ended in disaster thanks to a simple software bug. A computer getting its maths wrong – essentially getting overwhelmed by a number bigger than it expected.

How is it possible that computers get befuddled by numbers in this way? It turns out such errors are answerable for a series of disasters and mishaps in recent years, destroying rockets, making space probes go missing, and sending missiles off-target. So what are these bugs, and why do they happen?

Imagine trying to represent a value of, say, 105,350 miles on an odometer that has a maximum value of 99,999. The counter would "roll over" to 00,000 and then count up to 5,350, the remaining value. This is the same species of inaccuracy that doomed the 1996 Ariane 5 launch. More technically, it's called "integer overflow", essentially meaning that numbers are too big to be stored in a computer system, and sometimes this can cause malfunction.

Such glitches emerge with surprising frequency. It's suspected that the reason why Nasa lost contact with the Deep Impact space probe in 2013 was an integer limit being reached.

And just last week it was reported that Boeing 787 aircraft may suffer from a similar issue. The control unit managing the delivery of power to the plane's engines will automatically enter a failsafe mode – and shut down the engines – if it has been left on for over 248 days.

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Funny) by Geotti on Thursday May 07 2015, @07:15AM

    by Geotti (1146) on Thursday May 07 2015, @07:15AM (#179791) Journal

    Add an array of integers to count the rollovers and embed this functionality in the type itself so it's transparent to the coder. 65536^array.length "should be enough for everyone" (tm) ;)

    • (Score: 5, Informative) by stormwyrm on Thursday May 07 2015, @07:56AM

      by stormwyrm (717) on Thursday May 07 2015, @07:56AM (#179800) Journal

      It's called GMP [gmplib.org]. The only trouble is there are a lot of applications, such as in embedded systems like those described in the article, where arbitrary precision arithmetic is unrealistic to add in. Not everything can run on the powerful processors with vast quantities of RAM that we take for granted these days. Many mass-market embedded systems use processors that are about as powerful as the 6502-variant inside my old Commodore 64, because microcontrollers like that can be had for almost literally a dime a dozen in this day and age. Some embedded systems in aerospace applications make use of weaker CPUs because radiation hardening a CPU like a modern i7 is impossible, and the large quantities of memory that are required for arbitrary precision arithmetic simply aren't feasible because of hardening/weight/size/power constraints in the package. Handling integer overflow properly is par for the course for embedded systems design.

      --
      Numquam ponenda est pluralitas sine necessitate.
      • (Score: 2) by FatPhil on Thursday May 07 2015, @08:35AM

        by FatPhil (863) <pc-soylentNO@SPAMasdf.fi> on Thursday May 07 2015, @08:35AM (#179808) Homepage
        It's called LISP. Which was first implemented on an IBM 704. Which used vacuum tubes, and could execute up to 12,000 floating-point additions per second, and had a Magnetic Core Storage Unit of 4096 36-bit words (so 18432 bytes) as RAM.

        So the 6510 wins that contest.

        Of course, the other alternative is to use trapping overflows, such as possible in the hopefully-forthcoming Mill architecture, so that the errors are flagged immediately as an error. Of course, what do you do when the unintended happens, you still need to address that question. This is why majority-of-three redundancy is a good thing in critical environments - if one of the units says "I dunno, I gave up trying", then you go with the other two units' results. If they disagree, well, you're screwed. But you shouldn't have employed two sets of shitty software engineers in the first place. (Off the top of my head, there might be ways around such situations - you could add some noise to the inputs, try again, and favour the unit which has the better stability in its output. But what should have been a single simple calculation has turned into a lengthy procedure which itself may have bugs.)
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 3, Informative) by Anonymous Coward on Thursday May 07 2015, @09:01AM

          by Anonymous Coward on Thursday May 07 2015, @09:01AM (#179816)

          Except that the Ariane did test for overflow. However the correct flight path of Ariane 5 simply was guaranteed to overflow the modules designed for Ariane 4. The modules then gave error messages instead of flight data values. Since that was true for all the modules (which of course all used the same code), the redundancy didn't help.

          • (Score: 3, Interesting) by PiMuNu on Thursday May 07 2015, @11:57AM

            by PiMuNu (3823) on Thursday May 07 2015, @11:57AM (#179855)

            The mantra my colleague in nuclear industry has is "diversity and redundancy". So the coolant water has to have a backup pump and you need a backup air coolant system as well. Multiple failures then require some common failure mode like a Tsunami to break the system...

            • (Score: 0) by Anonymous Coward on Thursday May 07 2015, @02:27PM

              by Anonymous Coward on Thursday May 07 2015, @02:27PM (#179922)

              Redundancy does not help, when the data really IS outside the allowed range, because someone reused the code from an older generation rocket with less power.

      • (Score: 2) by c0lo on Thursday May 07 2015, @12:38PM

        by c0lo (156) Subscriber Badge on Thursday May 07 2015, @12:38PM (#179869) Journal

        because microcontrollers like that can be had for almost literally a dime a dozen in this day and age.

        I can't believe a Boeing 747 can't afford a modern CPU (pricewise or energywise). What's going to add on the price of the aircraft, a few hundred dollars?
        Is the ass of the free market fairy that tight? (even after the banksters abused it repeatedly?)

        the large quantities of memory that are required for arbitrary precision arithmetic

        A 64bit uint=1.84e+19 - let's put down some numbers for comparison:

        Why would one need arbitrary precision arithmetic for the usual time/distances/speeds to fly in Earth's vicinity is beyond me.

        --
        https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
        • (Score: 2) by Reziac on Sunday May 10 2015, @02:16AM

          by Reziac (2489) on Sunday May 10 2015, @02:16AM (#180940) Homepage

          Regardless of what it runs on, it should be restarted before every flight as part of the pre-flight check. Why would it run for 248 days in the first place??

          --
          And there is no Alkibiades to come back and save us from ourselves.
      • (Score: 3, Informative) by TheB on Thursday May 07 2015, @03:20PM

        by TheB (1538) on Thursday May 07 2015, @03:20PM (#179946)

        Handling integer overflow properly SHOULD BE par for the course for embedded systems design.

        Too often code I review has potential overflow bugs in it. It's one of the first things I look for, and find it just about everywhere.
        Even the Arduino libraries have them.

        From Stepper.cpp

        while(steps_left > 0) {
          // move only if the appropriate delay has passed:
          if (millis() - this->last_step_time >= this->step_delay) {
            ...
          }
          ...
        }

        In this code if "this->last_step_time" + "this->step_delay" is greater than 4,294,967,295(theoretical max value of millis) it will loop forever.
        Values slightly less can become stuck in the loop, but not always. It's dependent on the return of millis()

        Since millis() returns unsigned long, and no program will be running for 4,294,967+ days this sounds reasonable.
        However I was asked to debug a prototype that was malfunctioning. They needed 1/4 ms accuracy and modified the library to use micros() instead of millis(). The machine would randomly freeze in ~70 min intervals. New to Arduino I looked up micros() and saw "This number will overflow (go back to zero), after approximately 70 minutes." Problem found, and easily fixed.

        Later I found they were having similar troubles with other machines used in production. While not using Arduinos they still would malfunction in ~50 day intervals. Multimillion dollar machines with unfixed overrun bugs. Their solution was to reset the machine every month, and were not interested in fixing the code. "It needs to be cleaned anyway."
        Sad.

  • (Score: 4, Informative) by Anonymous Coward on Thursday May 07 2015, @07:38AM

    by Anonymous Coward on Thursday May 07 2015, @07:38AM (#179795)

    As far as I know, Ariane was not a bug. When coding for rockets, everything is checked and double checked. That includes inputs. For example, if the rockets can accelerate the rockets by N m/s^2, the inputs from the speed measuring equipment is checked against this value, and if a speed measuring device returns a much higher value, this device is marked as defective. This is not a problem, because there are three or four of everything.

    The problem was (again, as far as I know) that the code in question was written for Ariane 4, not 5. The Ariane 5 was more powerful, and could just accelerate faster. The computer received an out of range value from one of the speed measuring devices, and marked it as defective. It then received an out of range value from the next device, and marked it as defective too. Another two readings, and all four devices were marked as defective, leaving the system with no way to measure the speed.

    In short: Not a bug, everything behaved as designed. Unfortunately, somebody decided to reuse the design without adjusting the valid range.

    • (Score: 5, Insightful) by maxwell demon on Thursday May 07 2015, @08:18AM

      by maxwell demon (1608) on Thursday May 07 2015, @08:18AM (#179802) Journal

      Design bugs are bugs, too. And the design bug here was to reuse the Ariane 4 module unchanged.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 0) by Anonymous Coward on Thursday May 07 2015, @09:41AM

      by Anonymous Coward on Thursday May 07 2015, @09:41AM (#179825)

      There is only a dual modular redundancy for the Inertial Reference System (SRI) in Ariane 5.
      See: http://esamultimedia.esa.int/docs/esa-x-1819eng.pdf [esa.int] section 2.1 page 3 for the report.

      As far as I know, dual modular redundancy (https://en.wikipedia.org/wiki/Dual_modular_redundant [wikipedia.org]) seems to be a standard practice in European space systems, except maybe where voting is needed/desired (e.g. thermal measurements...)

    • (Score: 3, Informative) by bootsy on Thursday May 07 2015, @12:15PM

      by bootsy (3440) on Thursday May 07 2015, @12:15PM (#179861)

      I had always been told this was due to the Ada language the software was written in throwing an exception and this exception overwriting an area of memory that held the rocket direction variables. Since the whole thing was embedded there was only a small working area of memory. If you've ever programmed in Ada you will know it is very fussy ( read type safe ) and I believe it was the first language to implement exceptions although I'm sure a reply will turn up showing an example before this.

      • (Score: 2) by darkfeline on Thursday May 07 2015, @08:17PM

        by darkfeline (1030) on Thursday May 07 2015, @08:17PM (#180055) Homepage

        Like all revolutionary programming paradigms, exception handling was first implemented/invented in Lisp.

        https://en.wikipedia.org/wiki/Exception_handling#Exception_handling_in_software [wikipedia.org]

        --
        Join the SDF Public Access UNIX System today!
        • (Score: 2) by bootsy on Friday May 08 2015, @08:21AM

          by bootsy (3440) on Friday May 08 2015, @08:21AM (#180238)

          Thanks for link. I love this quote about Ada from it, so relevant.

          "...a plethora of features and notational conventions, many of them unnecessary and some of them, like exception handling, even dangerous. [...] Do not allow this language in its present state to be used in applications where reliability is critical[...]. The next rocket to go astray as a result of a programming language error may not be an exploratory space rocket on a harmless trip to Venus: It may be a nuclear warhead exploding over one of our own cities."

  • (Score: 5, Interesting) by Anonymous Coward on Thursday May 07 2015, @07:43AM

    by Anonymous Coward on Thursday May 07 2015, @07:43AM (#179797)

    The Linux kernel has such a counter, and at some point somebody realized that drivers often used that counter in an incorrect way - when the counter reaches the highest possible number, it starts over, and if the driver only checks for higher values, it will not work after the rollover, when the new values are lower than the value before the rollover.

    It was decided to initialize the counter to ten minutes before rollover instead of zero. That way, those drivers would stop working ten minutes after boot, rather than several weeks after, forcing the people responsible for the drivers to deal with the problem.

  • (Score: 0) by Anonymous Coward on Thursday May 07 2015, @07:49AM

    by Anonymous Coward on Thursday May 07 2015, @07:49AM (#179799)

    ...they had used systemd

  • (Score: 3, Funny) by kaszz on Thursday May 07 2015, @09:44AM

    by kaszz (4211) on Thursday May 07 2015, @09:44AM (#179826) Journal

    This should not be problem if the programmers on the project has talent [soylentnews.org] and did their homework [soylentnews.org]. :P
    We can't ever admit people has different gifts.

  • (Score: 4, Insightful) by bradley13 on Thursday May 07 2015, @11:53AM

    by bradley13 (3053) on Thursday May 07 2015, @11:53AM (#179853) Homepage Journal

    The workings of integers in CPUs is unchanged since earliest days; except for very specialized cases, it is not going to change.

    The main factor at work here is programmer competence (remember our discussion of a few days ago? [soylentnews.org]). I teach students about the behavior of integers no later than the second week of the very first programming course. This comes up again from time-to-time (along with other practical "gotchas").

    Any programmer who creates an integer variable and doesn't consider "what is the largest value that this integer will ever have" has made a fundamental mistake. If not, then ieas like having the hardware throw an exception will not help, because the programmer won't think to build in exception handling. If the integer value is something that counts upwards forever, then the obvious question is "when is forever finished?". A counter that will overflow in 20 years - one may deliberately decide to accept that risk, as long as it is documented (lots of software will still be used in 20 years). A counter that overflows after 248 days? That's maybe not so smart.

    This is also an obvious area where a bit of whitebox testing would catch the problem: look inside the program and find weak spots, like counters that might overflow. Lots of places think blackbox testing is all that is required - in lots of cases, that's true enough. But for anything really important, you need both.

    tl;dr - You can't fix stupid.

    --
    Everyone is somebody else's weirdo.
    • (Score: 2) by c0lo on Thursday May 07 2015, @12:50PM

      by c0lo (156) Subscriber Badge on Thursday May 07 2015, @12:50PM (#179870) Journal

      If the integer value is something that counts upwards forever, then the obvious question is "when is forever finished?"

      Huh! Elementary!

      for(uint i=maxVal-1; i>=0; i--) {
         // do something
      }

      -----
      (a bug that eat my soul for 2 days in my early years in software).

      --
      https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
      • (Score: 3, Funny) by PiMuNu on Thursday May 07 2015, @01:25PM

        by PiMuNu (3823) on Thursday May 07 2015, @01:25PM (#179886)

        > (a bug that eat my soul for 2 days in my early years in software).

        if (condition);
                do_something();

        • (Score: 0) by Anonymous Coward on Thursday May 07 2015, @01:39PM

          by Anonymous Coward on Thursday May 07 2015, @01:39PM (#179894)
          retry
    • (Score: 2) by Jesus_666 on Thursday May 07 2015, @06:48PM

      by Jesus_666 (3044) on Thursday May 07 2015, @06:48PM (#180020)
      Then again, a counter that overflows after 248 days when you can assume that any sane person will shut down the system every ten days or so is probably not that terrible a design decision. An airplane that is kept active for 248 days non-stop without any kind of major maintenance is probably going to fall out of the sky long before that counter overflows. So in this case forever can be reasonably assumed to be finished long before overflow becomes an issue.

      Of course you still want the system to fail gracefully just in case you ever make that counter faster.
    • (Score: 2) by tangomargarine on Thursday May 07 2015, @07:33PM

      by tangomargarine (667) on Thursday May 07 2015, @07:33PM (#180041)

      BigInteger [oracle.com]

      So from the above Arianne story, if they lose all detectors the vehicle self-destructs? Hrm...what could possibly go wrong...

      --
      "Is that really true?" "I just spent the last hour telling you to think for yourself! Didn't you hear anything I said?"
  • (Score: 2) by sjames on Thursday May 07 2015, @07:55PM

    by sjames (2882) on Thursday May 07 2015, @07:55PM (#180050) Journal

    Many languages including C and FORTRAN rely on the hardware's natural integer size and the hardware math operations because those are much faster. At one time, there was simply no choice, software math was too slow to even consider.

    Other languages like Python have arbitrary sized integers.You will never roll a counter over in python without an explicit use of modulus (though you could potentially run out of memory). You can use an arbitrary precision library in C as well, but at a similar cost in speed. These days, it's not as critical as it used to be because of advances in hardware, but there are still cases where it matters and since big ints aren't first class in C, they tend not to be used unless they MUST be rather than using them unless they cannot be.

    In some ways, I wish C would gain arbitrary ints and floats, but that would be problematic in the case of arrays or structs of ints.

    Of course, some of the old mainframes supported BCD in hardware and COBOL could specify an arbitrary (but fixed) variable size. Of course, that resulted in a less efficient encoding and tended to result in smaller rather than larger variables. That became really relevant for y2k.

  • (Score: 2) by istartedi on Thursday May 07 2015, @09:29PM

    by istartedi (123) on Thursday May 07 2015, @09:29PM (#180078) Journal

    I've seen people say, "just use arbitrary precision and be done". The counter-point
    to this is that it doesn't perform as well. You could also raise overflow exceptions; but I've
    never been a big fan of exceptions due to the unwinding problem (Linus Torvalds is in that camp, IIRC).

    The solution you hear less often is to enhance the type system so that every
    integer actually has something like "type Int | Fail". This actually seems like the
    best compromise to me. You do take some performance hit, but it's probably not
    as bad as a bignum library. You don't throw exceptions, so unwinding isn't a problem.

    As others have said though, this problem isn't going away any time soon.
    There's just too much code floating around that uses bare integers, and too many
    people thinking that "big enough now" == "big enough later".

    --
    Appended to the end of comments you post. Max: 120 chars.