SoylentNews Comments | The Number Glitch That Can Lead to Catastrophe

The Number Glitch That Can Lead to Catastrophe

posted by CoolHand on Thursday May 07 2015, @07:03AM

from the not-as-promoted-as-y2k-bug dept.

A surprisingly simple bug afflicts computers controlling planes, spacecraft and more – they get confused by big numbers. As Chris Baraniuk discovers, the glitch has led to explosions, missing space probes and more.
Tuesday, 4 June 1996 will forever be remembered as a dark day for the European Space Agency (Esa). The first flight of the crewless Ariane 5 rocket, carrying with it four very expensive scientific satellites, ended after 39 seconds in an unholy ball of smoke and fire. It's estimated that the explosion resulted in a loss of $370m (£240m).
What happened? It wasn't a mechanical failure or an act of sabotage. No, the launch ended in disaster thanks to a simple software bug. A computer getting its maths wrong – essentially getting overwhelmed by a number bigger than it expected.
How is it possible that computers get befuddled by numbers in this way? It turns out such errors are answerable for a series of disasters and mishaps in recent years, destroying rockets, making space probes go missing, and sending missiles off-target. So what are these bugs, and why do they happen?
Imagine trying to represent a value of, say, 105,350 miles on an odometer that has a maximum value of 99,999. The counter would "roll over" to 00,000 and then count up to 5,350, the remaining value. This is the same species of inaccuracy that doomed the 1996 Ariane 5 launch. More technically, it's called "integer overflow", essentially meaning that numbers are too big to be stored in a computer system, and sometimes this can cause malfunction.
Such glitches emerge with surprising frequency. It's suspected that the reason why Nasa lost contact with the Deep Impact space probe in 2013 was an integer limit being reached.
And just last week it was reported that Boeing 787 aircraft may suffer from a similar issue. The control unit managing the delivery of power to the plane's engines will automatically enter a failsafe mode – and shut down the engines – if it has been left on for over 248 days.

This discussion has been archived. No new comments can be posted.

The Number Glitch That Can Lead to Catastrophe | Log In/Create an Account | Top | 26 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Simple Solution Simple Solution (Score: 3, Funny) by Geotti on Thursday May 07 2015, @07:15AM

by Geotti (1146) on Thursday May 07 2015, @07:15AM (#179791) Journal

Add an array of integers to count the rollovers and embed this functionality in the type itself so it's transparent to the coder. 65536^array.length "should be enough for everyone" (tm) ;)

Starting Score:	1		point
Moderation		+1
Troll=1, Interesting=1, Funny=1, Total=3
Extra 'Funny' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		3

Re:Simple Solution Re:Simple Solution (Score: 5, Informative) by stormwyrm on Thursday May 07 2015, @07:56AM

by stormwyrm (717) on Thursday May 07 2015, @07:56AM (#179800) Journal

It's called GMP [gmplib.org]. The only trouble is there are a lot of applications, such as in embedded systems like those described in the article, where arbitrary precision arithmetic is unrealistic to add in. Not everything can run on the powerful processors with vast quantities of RAM that we take for granted these days. Many mass-market embedded systems use processors that are about as powerful as the 6502-variant inside my old Commodore 64, because microcontrollers like that can be had for almost literally a dime a dozen in this day and age. Some embedded systems in aerospace applications make use of weaker CPUs because radiation hardening a CPU like a modern i7 is impossible, and the large quantities of memory that are required for arbitrary precision arithmetic simply aren't feasible because of hardening/weight/size/power constraints in the package. Handling integer overflow properly is par for the course for embedded systems design.

--
Numquam ponenda est pluralitas sine necessitate.

Parent
- Re:Simple Solution Re:Simple Solution (Score: 2) by FatPhil on Thursday May 07 2015, @08:35AM
  
  by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Thursday May 07 2015, @08:35AM (#179808) Homepage
  
  It's called LISP. Which was first implemented on an IBM 704. Which used vacuum tubes, and could execute up to 12,000 floating-point additions per second, and had a Magnetic Core Storage Unit of 4096 36-bit words (so 18432 bytes) as RAM.
  
  So the 6510 wins that contest.
  
  Of course, the other alternative is to use trapping overflows, such as possible in the hopefully-forthcoming Mill architecture, so that the errors are flagged immediately as an error. Of course, what do you do when the unintended happens, you still need to address that question. This is why majority-of-three redundancy is a good thing in critical environments - if one of the units says "I dunno, I gave up trying", then you go with the other two units' results. If they disagree, well, you're screwed. But you shouldn't have employed two sets of shitty software engineers in the first place. (Off the top of my head, there might be ways around such situations - you could add some noise to the inputs, try again, and favour the unit which has the better stability in its output. But what should have been a single simple calculation has turned into a lengthy procedure which itself may have bugs.)
  
  --
  Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  
  Parent
  - Re:Simple Solution Re:Simple Solution (Score: 3, Informative) by Anonymous Coward on Thursday May 07 2015, @09:01AM
    
    by Anonymous Coward on Thursday May 07 2015, @09:01AM (#179816)
    
    Except that the Ariane did test for overflow. However the correct flight path of Ariane 5 simply was guaranteed to overflow the modules designed for Ariane 4. The modules then gave error messages instead of flight data values. Since that was true for all the modules (which of course all used the same code), the redundancy didn't help.
    
    Parent
    - Re:Simple Solution Re:Simple Solution (Score: 3, Interesting) by PiMuNu on Thursday May 07 2015, @11:57AM
      
      by PiMuNu (3823) on Thursday May 07 2015, @11:57AM (#179855)
      
      The mantra my colleague in nuclear industry has is "diversity and redundancy". So the coolant water has to have a backup pump and you need a backup air coolant system as well. Multiple failures then require some common failure mode like a Tsunami to break the system...
      
      Parent
      - Re:Simple Solution (Score: 0) by Anonymous Coward on Thursday May 07 2015, @02:27PM
        
        by Anonymous Coward on Thursday May 07 2015, @02:27PM (#179922)
        
        Redundancy does not help, when the data really IS outside the allowed range, because someone reused the code from an older generation rocket with less power.
        
        Parent
- Re:Simple Solution Re:Simple Solution (Score: 2) by c0lo on Thursday May 07 2015, @12:38PM
  
  by c0lo (156) on Thursday May 07 2015, @12:38PM (#179869) Journal
  because microcontrollers like that can be had for almost literally a dime a dozen in this day and age.
  I can't believe a Boeing 747 can't afford a modern CPU (pricewise or energywise). What's going to add on the price of the aircraft, a few hundred dollars?
  Is the ass of the free market fairy that tight? (even after the banksters abused it repeatedly?)
  the large quantities of memory that are required for arbitrary precision arithmetic
  A 64bit uint=1.84e+19 - let's put down some numbers for comparison:
  estimated age of the Universe [wikipedia.org] in seconds= 4e17
  1 Astronomical Unit in micrometers = 1.5e17
  speed of light in hydrogen atom diameters [wikipedia.org]/s = 1.14e19
  Why would one need arbitrary precision arithmetic for the usual time/distances/speeds to fly in Earth's vicinity is beyond me.
  --
  https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
  Parent
  - Re:Simple Solution (Score: 2) by Reziac on Sunday May 10 2015, @02:16AM
    
    by Reziac (2489) on Sunday May 10 2015, @02:16AM (#180940) Homepage
    
    Regardless of what it runs on, it should be restarted before every flight as part of the pre-flight check. Why would it run for 248 days in the first place??
    
    --
    And there is no Alkibiades to come back and save us from ourselves.
    
    Parent
- Re:Simple Solution (Score: 3, Informative) by TheB on Thursday May 07 2015, @03:20PM
  
  by TheB (1538) on Thursday May 07 2015, @03:20PM (#179946)
  
  Handling integer overflow properly SHOULD BE par for the course for embedded systems design.
  Too often code I review has potential overflow bugs in it. It's one of the first things I look for, and find it just about everywhere.
  Even the Arduino libraries have them.
  From Stepper.cpp
  while(steps_left > 0) { // move only if the appropriate delay has passed: if (millis() - this->last_step_time >= this->step_delay) { ... } ... }
  
  In this code if "this->last_step_time" + "this->step_delay" is greater than 4,294,967,295(theoretical max value of millis) it will loop forever.
  Values slightly less can become stuck in the loop, but not always. It's dependent on the return of millis()
  Since millis() returns unsigned long, and no program will be running for 4,294,967+ days this sounds reasonable.
  However I was asked to debug a prototype that was malfunctioning. They needed 1/4 ms accuracy and modified the library to use micros() instead of millis(). The machine would randomly freeze in ~70 min intervals. New to Arduino I looked up micros() and saw "This number will overflow (go back to zero), after approximately 70 minutes." Problem found, and easily fixed.
  Later I found they were having similar troubles with other machines used in production. While not using Arduinos they still would malfunction in ~50 day intervals. Multimillion dollar machines with unfixed overrun bugs. Their solution was to reset the machine every month, and were not interested in fixing the code. "It needs to be cleaned anyway."
  Sad.
  
  Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

The Number Glitch That Can Lead to Catastrophe

Simple Solution Simple Solution (Score: 3, Funny) by Geotti on Thursday May 07 2015, @07:15AM

Re:Simple Solution Re:Simple Solution (Score: 5, Informative) by stormwyrm on Thursday May 07 2015, @07:56AM

Re:Simple Solution Re:Simple Solution (Score: 2) by FatPhil on Thursday May 07 2015, @08:35AM

Re:Simple Solution Re:Simple Solution (Score: 3, Informative) by Anonymous Coward on Thursday May 07 2015, @09:01AM

Re:Simple Solution Re:Simple Solution (Score: 3, Interesting) by PiMuNu on Thursday May 07 2015, @11:57AM

Re:Simple Solution (Score: 0) by Anonymous Coward on Thursday May 07 2015, @02:27PM

Re:Simple Solution Re:Simple Solution (Score: 2) by c0lo on Thursday May 07 2015, @12:38PM

Re:Simple Solution (Score: 2) by Reziac on Sunday May 10 2015, @02:16AM

Re:Simple Solution (Score: 3, Informative) by TheB on Thursday May 07 2015, @03:20PM