Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.
posted by Fnord666 on Friday July 26 2019, @06:07AM   Printer-friendly
from the have-you-tried-turning-it-off-and-back-on-again? dept.

Submitted via IRC for Bytram

Airbus A350 software bug forces airlines to turn planes off and on every 149 hours

Some models of Airbus A350 airliners still need to be hard rebooted after exactly 149 hours, despite warnings from the EU Aviation Safety Agency (EASA) first issued two years ago.

In a mandatory airworthiness directive (AD) reissued earlier this week, EASA urged operators to turn their A350s off and on again to prevent "partial or total loss of some avionics systems or functions".

The revised AD, effective from tomorrow (26 July), exempts only those new A350-941s which have had modified software pre-loaded on the production line. For all other A350-941s, operators need to completely power the airliner down before it reaches 149 hours of continuous power-on time.

Concerningly, the original 2017 AD was brought about by "in-service events where a loss of communication occurred between some avionics systems and avionics network" (sic). The impact of the failures ranged from "redundancy loss" to "complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules".

In layman's English, this means that prior to 2017, at least some A350s flying passengers were suffering unexplained failures of potentially flight-critical digital systems.

Not a power of two. I wonder why 149 hours?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Informative) by sshelton76 on Friday July 26 2019, @07:06AM (39 children)

    by sshelton76 (7978) on Friday July 26 2019, @07:06AM (#871360)

    Hmmm I've seen this behavior before.

    149 hours is 536,400 seconds, but what is the value in milliseconds?
    What's the range of a signed 32 bit integer? And why signed? Because who has time to really remember that counters should only ever count upwards and do a little range checking, or type out that extra keyword "unsigned" amirite? I mean who ever heard of "unsigning" something, the name tells you right away that's probably going to be slower.

    Now imagine you're a low paid coder in say India or China and you're building this really cool system and for your efforts you are being paid almost enough to buy a box of crackers in the USA, but hey that's rent and food for the next week there.

    You have to put your code up against testing though and that means it has to work even though you have no idea how to engineer software, but you do know how to copy/paste out of stack overflow and well that's gotta count for something, right?

    Because you're forced to get it to pass tests and now you see an overflow happen in testing. You can't fix the overflow because you have no idea what's causing it. But all you have to do is get it to pass testing and then you get paid.

    So instead of changing the value type to a properly large unsigned int type like double, long, or long long (because it's unlikely you had ever heard of these, why would silly Americans want Dragons or Dragon Dragons in their code? It just makes no sense!). So instead of doing that, you just take the absolute value of the counter instead and *poof* the overflow problem is solved and testing suddenly passes. You my friend are now a hero!

    What you don't know is that instead of an underflow at around 3 to 4 days (reasonably easy to test for) or a wraparound at 8 days (harder to test for), it takes almost 149 hours (around 6 days) for the counter / ms sampling timer to reach 0.

    But no one is likely to even notice this since no one runs hard tests for 6 or 7 days straight.
    As far as you're concerned everything is fine.

    But somewhere else along the line is another critical piece that was also farmed out and that piece looks something like...

    z = y/x

    Wherein x is your ms accurate sampling timer.

    Now at about 149 hours you get a divide by zero error as soon as x = 0 and all hell breaks loose.

    Not joking, I fixed exactly this bug on some electronic signage in Vegas that sampled the outdoor lighting conditions and changed the monitor contrast to match.
    I would hard crash every 149 hours like clock work.

    Also not joking, it was caused by the company accepting the multi-million dollar contract, farming the work out to a "partner", who farmed it out to a subsidiary who farmed it out to some temp workers some in India and some in China they found on a freelancing website, all of whom evidently knew how to talk the talk but had no idea what they were doing. In most cases, they had copied and pasted the code directly from stackoverflow which at least in this case was a remarkably prescient name.

    Anyways, mystery solved!
     

    Starting Score:    1  point
    Moderation   +4  
       Insightful=2, Informative=3, Overrated=1, Disagree=1, Total=7
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 3, Informative) by coolgopher on Friday July 26 2019, @07:10AM (7 children)

    by coolgopher (1157) on Friday July 26 2019, @07:10AM (#871361)

    double is neither integral, nor unsigned. At least not on any system/language I've ever worked on (and it's been a few by now).

    Other than that, sadly true.

    • (Score: 2) by sshelton76 on Friday July 26 2019, @07:13AM (6 children)

      by sshelton76 (7978) on Friday July 26 2019, @07:13AM (#871362)

      Haha, good catch! I'm tired and writing from memory.

      • (Score: 1, Touché) by Anonymous Coward on Friday July 26 2019, @09:53AM (5 children)

        by Anonymous Coward on Friday July 26 2019, @09:53AM (#871390)

        Yeah, inventing colorful stories from mishmashed memories is tiresome. Like fitting an square peg int type in a double round hole.
        In actual real life stories that kind of lapsus just don't happen.

        • (Score: 0) by Anonymous Coward on Friday July 26 2019, @10:37AM

          by Anonymous Coward on Friday July 26 2019, @10:37AM (#871404)

          What are you, nutty? We're reading a story about it happening to an international airline manufacturer, and you say it doesn't happen!

          ROTFL

        • (Score: 0) by Anonymous Coward on Friday July 26 2019, @02:52PM (1 child)

          by Anonymous Coward on Friday July 26 2019, @02:52PM (#871490)

          You're right. No one would ever be stupid enough to farm out work to the retarded monkeys in India.

          • (Score: 0) by Anonymous Coward on Friday July 26 2019, @08:57PM

            by Anonymous Coward on Friday July 26 2019, @08:57PM (#871620)

            Even more retarded - letting the same monkeys INTO other countries and allowing them to TAKE OVER "services"...

        • (Score: 0) by Anonymous Coward on Friday July 26 2019, @03:03PM (1 child)

          by Anonymous Coward on Friday July 26 2019, @03:03PM (#871495)

          in a double round hole.

          Moebius has been here?

          • (Score: 0) by Anonymous Coward on Saturday July 27 2019, @12:03PM

            by Anonymous Coward on Saturday July 27 2019, @12:03PM (#871872)

            But Möebius one side coin would fit in Pythagoras' square circle?

  • (Score: 5, Funny) by MostCynical on Friday July 26 2019, @07:22AM (2 children)

    by MostCynical (2589) on Friday July 26 2019, @07:22AM (#871363) Journal

    Congratulations; your comment is interesting, insightful, informative, and on topic.

    What are you doing posting on SN?

    --
    "I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
    • (Score: 0, Funny) by Anonymous Coward on Friday July 26 2019, @07:42AM (1 child)

      by Anonymous Coward on Friday July 26 2019, @07:42AM (#871368)

      Must be building karma so his Viagra spam gets on the front page Journal list. Or maybe there is an overflow bug he saw in the moderation system with just the right combination that he is hoping to exploit to spread his Viagra spam.

      *FDA disclaimer: Viagra spam is not to be consumed by children, women who are pregnant or may become pregnant, people on medicinal marijuana, people on certain types of blood thinners or take a daily aspirin, people who suffer from depression, or anxiety, people who live in the Bronx, those not on medicinal marijuana, or people who know what the Sun looks like.*

      • (Score: 1, Informative) by Anonymous Coward on Friday July 26 2019, @07:27PM

        by Anonymous Coward on Friday July 26 2019, @07:27PM (#871594)

        Oops, I meant that as a joke, but guess I was not obvious enough in the disclaimer (or took into consideration a good chunk of people would not read it).

        Also, "flamebait" as the mod? Not sure that is what I think of when I think of that term. Flamebaiting is more like, "Of course he would say that, he's a Jew." Basically trying to prod someone into losing their cool or, like Wikitionary put it: "Content in an online forum, such as a newsgroup, with the intent of provoking anger, resulting in flames and sometimes flamewars."

        Either way, apologies for any offense.

  • (Score: 0) by Anonymous Coward on Friday July 26 2019, @07:34AM (5 children)

    by Anonymous Coward on Friday July 26 2019, @07:34AM (#871365)

    I'm confused. According to Wikipedia, 32-bit integers that are signed have a range of "From −2,147,483,648 to 2,147,483,647, from −(2^31) to 2^31 − 1." So why would 536,000,000 (give or take) be the magic number for it to hit that error? I don't really have to worry about that stuff in Python, except when using ctypes or the like, but libraries always tell you what to put in. Maybe if you could translate your general idea into some iterative loop or something, I might understand this puzzle.

    • (Score: 4, Insightful) by coolgopher on Friday July 26 2019, @08:14AM (4 children)

      by coolgopher (1157) on Friday July 26 2019, @08:14AM (#871372)

      Fair point, the math is a bit off there. The general gist certainly applies though.

      If we assume an unsigned 32bit integer is used somewhere, say for a timer counter. Said counter is running off an 8kHz crystal. It will wrap around to zero after 149.13 hours. If the counter was stored in a signed 32 bit integer variable, you'd hit the 149-and-a-bit hours with a 4kHz clock (not as common as an 8kHz clock, I'd say). My guess is that someone simply didn't bother coding for the wrap-around/overflow scenario, and there's an 8kHz clock involved here somewhere.

      • (Score: 2) by sshelton76 on Friday July 26 2019, @11:08AM (3 children)

        by sshelton76 (7978) on Friday July 26 2019, @11:08AM (#871414)

        It's like 3 AM when I posted, the math isn't going to be exact and doesn't need to be. It's just something I encountered once and seemed relevant. 140 to 160 hour reboot cycle it's going to be milliseconds accumulating in an int somewhere. The rest is how I know that :)

        • (Score: 3, Informative) by coolgopher on Friday July 26 2019, @01:07PM (2 children)

          by coolgopher (1157) on Friday July 26 2019, @01:07PM (#871434)

          Not ragging on you, I just got curious about the number discrepancy and when I looked at it an 8kHz multiplier fell out of it. Which, on top of milliseconds of course makes it an 8MHz clock, which is rather common for middle-of-the-range mcus. Seems eminently plausible to me.

          Pretty sure anyone who has done any serious amount of work on embedded systems has had to fix up somebody else's timer/counter code that didn't handle wrap around correctly. And their own. Though I had certainly expected better from the avionics industry.

          • (Score: 0) by Anonymous Coward on Friday July 26 2019, @09:50PM

            by Anonymous Coward on Friday July 26 2019, @09:50PM (#871633)

            Ok. That makes more sense. I was thinking of a tick rate of one per some standard or base 10 unit of time. Any other rate didn't occur to me. But yes 2^32 events divided by 8000 events per second is equal to 149 hours, 7 minutes, 50 seconds, and 114/125ths (7296 events).

            Given that, the explanation given makes sense, something counts up somewhere and then divides by zero after rolling over. Although I could have sworn C had a way to detect overflows (or maybe C++ or C#? Don't know, am a mostly-Python guy), I always do a check on ctypes to make sure that the greater than or less than relationship I'm expecting holds. The same holds with the languages I know that do check that anyway. So, I can't believe they would miss something so obvious; but, then again, I've seen people cause all sorts of problems by not understanding the basics (/ returns a float, // returns floored integer, but the code uses both (usually verbatim copies from the internet)).

          • (Score: 2) by sshelton76 on Saturday July 27 2019, @03:19AM

            by sshelton76 (7978) on Saturday July 27 2019, @03:19AM (#871740)

            This is exactly correct.

            I should also state that these are distinct hardware components.
            There was an ambient light sensor attached to a box that sampled the sensor and turned it into data for the bus.
            Then there was the receiving end which fetched data off the bus and took action to adjust the brightness and contrast of the sign.

            The fetch rate of the receiver was once per ms. The receiver had the divide by zero error.
            The transmitter had the rollover issue.

            So by the time anyone thought to read data off the bus, we would just see a continuously incrementing (or decrementing) counter value being broadcast and literally no one suspected the receiver wasn't prepared to handle a roll over.

  • (Score: 2, Insightful) by Anonymous Coward on Friday July 26 2019, @08:27AM (10 children)

    by Anonymous Coward on Friday July 26 2019, @08:27AM (#871375)

    Now imagine you're a low paid coder in say India or China and you're building this really cool system and for your efforts you are being paid almost enough to buy a box of crackers in the USA, but hey that's rent and food for the next week there.

    Yes, bullshit. If you actually were to any of these countries, you'll soon realize that food costs the same everywhere. Heck, some poor places have food more expensive than in US while still only making $200 or $300 per month. Now tell that to the elitists in America bitching they can't live off $5000/mo because "food costs"

    Also not joking, it was caused by the company accepting the multi-million dollar contract, farming the work out to a "partner", who farmed it out to a subsidiary who farmed it out to some temp workers some in India and some in China they found on a freelancing website, all of whom evidently knew how to talk the talk but had no idea what they were doing. In most cases, they had copied and pasted the code directly from stackoverflow which at least in this case was a remarkably prescient name.

    Yeah, I think you are full of it. I see these mistakes too, but it has nothing to do with "freelancing" and "stackoverflow". Software is written by people and people that make mistakes. Lack of quality control is where the problem lies, not in the in your "skillz".

    If I got a dime for every time I see some hot-shot 20-something thinking they are the best and don't make mistakes, I'd be retired already.

    Furthermore, divide by 0 is not a stack overflow issue.

    z = y/x

    Wherein x is your ms accurate sampling timer.

    So hot shot, WTF is the sampling timer doing turning 0? Oh wait, most likely, it turns negative on the overflow here. Seems the bug is in the timing logic, not the divide routine.

    Now if you want an example of actual divide by 0 happening in software that ran on hundreds of millions of machines, just start WIndows 98 in a modern VM and see a nice divide by zero BSOD in its delay loop calibration logic....

    • (Score: 3, Insightful) by janrinok on Friday July 26 2019, @09:12AM (8 children)

      by janrinok (52) Subscriber Badge on Friday July 26 2019, @09:12AM (#871384) Journal

      WTF is the sampling timer doing turning 0?

      Go back and read what he wrote. To overcome a testing failure the software was (probably) modified to return the absolute value. At some point in an integers range it will return 0.

      • (Score: 1, Informative) by Anonymous Coward on Friday July 26 2019, @09:39AM

        by Anonymous Coward on Friday July 26 2019, @09:39AM (#871386)

        WTF is the sampling timer doing turning 0?

        Go back and read what he wrote. To overcome a testing failure the software was (probably) modified to return the absolute value. At some point in an integers range it will return 0.

        abs(-30) =

        I leave this as an exercise for the reader.

      • (Score: 0) by Anonymous Coward on Friday July 26 2019, @11:44AM (6 children)

        by Anonymous Coward on Friday July 26 2019, @11:44AM (#871419)

        Yeah, the absolute value will return 0 at the exact same point than the original one did. So what?

        • (Score: 5, Informative) by janrinok on Friday July 26 2019, @01:04PM (5 children)

          by janrinok (52) Subscriber Badge on Friday July 26 2019, @01:04PM (#871433) Journal

          Aircraft avionic equipement sends data around the system using a data bus. There are several proprietary busses that are commonly used and it doesn't really matter which one is used, but I believe Airbus use [aviationtoday.com] the Mil-Std-1553 [wikipedia.org] variant.

          When sending data via a bus there has to be some form of data validation i.e. the system must ensure that data from various sources are received in the order that they are sent and that no data is being lost. Each message from each specific equipment will have some form of identification number which is commonly simply an increasing counter. Each transmitter on the bus will have its own counter and counters need not remain in sync with other transmitter. The receiver will process the data from each transmitter and ensure that data is handled correctly. As long as the counter from each specific transmitter is increasing and no values are missing (which would indicate that data is being lost) then the receiver is happy.

          Imagine that, in the discussion above, that the counter suddenly goes from a value of 'n' to 0. To the receiver this indicates that something is wrong with the data. After all, the receiver was expecting n+1 but that is not what the transmitter sent. The receiver will take some kind of action to signal that the transmitter is not functioning correctly and will, in many avionic systems, ignore subsequent data from that transmitter. The software in the transmitter is at fault for not complying with some specification or another. But perhaps nobody ever thought that the transmitter would be expected to operate for more than 149 days...

          Why wouldn't this show up in testing? Well an aircraft that is used for testing usually completes a test flight, records the data, and then lands. The data is downloaded and subsequently analysed. The aircraft is then powered down until its next flight, which might be days or more away. The transmitters are each restarted at the next power-up and so never run continuously for 149 days. But aircraft in service are frequently kept powered up for servicing or maintenance between flights. So it is not until an aircraft is operated in this way that the problem manifests itself. When the problem appears, the usual remedial action is the hot-swap the avionics concerned, which would mean the replacement is starting from a power down status, the databus system is re-initialised, and everything works as expect. The 'faulty' box is returned for fault investigation but, when restarted, also appears to be working correctly because the very act of moving it from the aircraft to the workshop has meant the its power has been recycled and this it works as expected.

          Testing is very difficult because different black boxes report data at different rates and could, conceivably, use different length of counters. How long do you run a system for? Maybe 149 days, 365 days, 4 years (to take into account leap years etc?

          If you look up my bio details on this site you will see that I spent some of my career writing software for real-time military avionic systems. This sort of problem - which admittedly should never occur - is more common than you might imagine and is very difficult to diagnose.

          • (Score: 2) by driverless on Saturday July 27 2019, @08:47AM

            by driverless (4770) on Saturday July 27 2019, @08:47AM (#871818)

            Done the same thing, although in my case it was slightly different, the system used a 64-bit counter and we were supposed to check for overflow. We ended up not checking for overflow because the chances that the check would be screwed up in some way were vastly, vastly higher than that of the counter ever actually overflowing.

            Sometimes less code is more...

          • (Score: 2) by driverless on Saturday July 27 2019, @08:51AM

            by driverless (4770) on Saturday July 27 2019, @08:51AM (#871819)

            Oh, another thing, we typically found these odd problems in low-level components that we didn't control. The software itself was written extremely carefully - think something like MISRA on steroids, with some parts accompanied by formal PROMELA/SPIN proofs - but then you'd get some low-level transceiver with a tiny built-in state machine that the model saw as a black box which would end up in an unexpected state under some circumstances. So it was the low-level gunk you didn't directly control that ended up biting you.

            Point is, it may not be something that Airbus has any direct control over that's causing this.

          • (Score: 1) by wArlOrd on Saturday July 27 2019, @01:55PM (2 children)

            by wArlOrd (2142) on Saturday July 27 2019, @01:55PM (#871908)

            If one reviews the archives from the First Oil War, articles describing a similar situation with the Patriot Missile Batteries should be found.

    • (Score: 5, Insightful) by sshelton76 on Friday July 26 2019, @11:04AM

      by sshelton76 (7978) on Friday July 26 2019, @11:04AM (#871413)

      Wow, way to miss the boat completely.

      Yes, bullshit. If you actually were to any of these countries, you'll soon realize that food costs the same everywhere. Heck, some poor places have food more expensive than in US while still only making $200 or $300 per month. Now tell that to the elitists in America bitching they can't live off $5000/mo because "food costs"

      I've lived in a number of countries, your comment is bullocks. I can feed a family of 4 on $40 a week in most of the countries I've lived in outside the USA. (Mostly beans and rice, so they won't be starving at least) But in truth, my comment about food and rent was quasi-sarcastic and meant to drive home a point that you clearly missed. A box of crackers is $15 to $20 if you shop certain places or are buying bulk cuz you're feeding kids... https://www.samsclub.com/p/snack-box-pros-on-the-go-snack-box/prod21122959?xid=plp_product_1_34 [samsclub.com]

      Look on freelancer sometime, see what projects like this are actually bid at, you'll find plenty in the $20 and under range.
      https://www.freelancer.com/projects/firmware/experience-firmware-development-using/ [freelancer.com]

      Why these get filled predominantly by people in India and China is up in the air, but if you work for a living, food and shelter are usually your primary concerns, so if they are bidding that low for a project it must mean they are paying the bills right?

      Yeah, I think you are full of it. I see these mistakes too, but it has nothing to do with "freelancing" and "stackoverflow". Software is written by people and people that make mistakes. Lack of quality control is where the problem lies, not in the in your "skillz".

      My whole point was about lack of quality control, the fact you could read my post and not see that makes me realize replying to you is probably pointless. Not sure why I'm continuing other than to set the record straight. I never said anything about my "skillz", I only explained what I discovered while working to fix the system. Yeah people make mistakes.

      If I got a dime for every time I see some hot-shot 20-something thinking they are the best and don't make mistakes, I'd be retired already.

      It's cute you think I'm 20 something. I'm neither a hotshot, nor have I been 20 something for a few decades, but umm thanks?

      Furthermore, divide by 0 is not a stack overflow issue.

      Ya think Einstein??? But the counter "overflowed" triggering an eventual divide by zero that went unchecked. From there it all went to hell. Also I was being facetious, about "stackoverflow" being an appropriate name for that place. Which you would have realized if you had the ability to both read and comprehend instead of trying to be a pedant who is evidently overly worried about me casting shade on our industry's tendency to hire people who don't know WTF they are doing, because it's believed they can "still get the job done cheaper even if we have to toss ten of them at the project to get it done on time."

      So hot shot, WTF is the sampling timer doing turning 0? Oh wait, most likely, it turns negative on the overflow here. Seems the bug is in the timing logic, not the divide routine.

      Your lack of reading comprehension is frightening me. At this point the bug is clearly your ability to read and follow along with a description I would expect a High School student to be able to understand, which makes me question why you would feel qualified to comment, but I digress. I already explained this, but it comes down to the fact that the implementer was paid to implement a feature he didn't fully understand, but his contract only required that the code pass tests that were written by someone else.

      He implemented the sampling counter as an "int", because that was what the code he copied and pasted from used. Literally something along the lines of

      return i++;

      During testing he discovered his code was overflowing producing a negative number. This manifested as the test suite failing and rejecting his output as out of range because it was negative. To get the code to pass testing he returned the absolute of the counter instead of the raw value, while still keeping the original logic in place.

      return abs(i++);

      No one bothered to test the direction this thing was counting, so internally it was counting something like...
      +2,147,483,645
      +2,147,483,646
      +2,147,483,647
      −2,147,483,647
      −2,147,483,646
      −2,147,483,645
      ...
      But since x, i.e. the value being returned to the consuming function was the absolute value of the counter i++ the consumer was getting...
      2,147,483,645
      2,147,483,646
      2,147,483,647
      2,147,483,647
      2,147,483,646
      2,147,483,645
      ...

      From there it was being used as the divisor in a complex equation meant to debounce sudden changes in luminosity by computing the rolling average.
      But for simplicity sake let's just say it read

      z = y/x

      Because that's what the error boiled down. Of course it isn't the entire function, nor even the bulk of the math involved.
      You can't make this shit up, but it does need to be simplified so folks can follow along.

      I was brought in to troubleshoot after the thing had been in and out of service for about 90 days. I had to find the bug and for that I had to have conversations with the actual developers because the code didn't "look" buggy at first gloss, and it passed unit and integration tests that all seemed pretty well thought out.

      But when I analyzed it further I realized the mistake(s) and as I kept analyzing things, I found there were many, many more. I'm only regaling you with the ones relevant to the 149 hour topic.

      I had to find out why they made the decisions they made, because a big part of troubleshooting and recommending fixes is to make damn good and sure you don't recommend a fix that breaks something else. Finding this out requires communication and there were a lot of humorous and not so humorous misunderstandings involved, but at the end of the day the root cause was there were 2 people involved who didn't even know how to code, let alone engineer software and yes these are very different skills.

      Anyways, both items passed muster because...

      There was no ahead of time checking for x = 0, because a core design assumption was that there was no way x could ever be 0, it was just a simple counter after all. It's only job was in counting how many times a sensor had been sampled.

      In fact it wasn't until I discovered the sensors were being sampled once a millisecond that I was turned onto the idea there might be an overflow somewhere, because I've seen poorly implemented millisecond accumulators outstrip the capacity for an int to hold them in the past. But this one manifested differently and did so in a way I felt was relevant to the topic at hand.

      As far as tests are concerned, there was also no ahead of time checking to ensure x2 > x1 because another core design assumption was that there was no way that x would fail to increment, because again it's just supposed to be "how many times has the sensor been sampled"
      There was nothing in the test suite checking either of these. Ergo, it passed and the implementer got paid.

      The whole system would run trouble free for days, but once a week, the whole system would get cocked up.
      This was a multi-million dollar marquee and the trouble was eventually traced to a recent project to ensure the relative brightness of the sign could be seen in direct sunlight while not completely blinding people at night. But there were many other "upgrades" added at the same time so it did take time to dig in and start ruling things out.
      Nothing elite about it, but skill is skill and I've been at this game a long time.

      Now if you want an example of actual divide by 0 happening in software that ran on hundreds of millions of machines, just start WIndows 98 in a modern VM and see a nice divide by zero BSOD in its delay loop calibration logic....

      Does it do it approx every 149 hours? Or do you mean to say the airplane in question is running windows 98?
      Because unless one of those is your point, your point is as irrelevant as this...
      https://www.youtube.com/watch?v=IW7Rqwwth84 [youtube.com]

      My point was only that programming is a bit of a dark art, troubleshooting even more so for many people it borders on black magic and they are content to simply "reboot it from time to time".

      If you're someone like me, someone who likes to delve into complex systems and figure out what broke, both in engineering terms and in people terms, then troubleshooting something like this is fun and interesting work especially when you're talking about the sign over a casino in Las Vegas. Spending a month or so in Vegas on the casino's dime trying to track this down was a blast quite frankly and it had nothing to do with "skillz" and everything to do with the fact that we were dealing with several people who genuinely had not learned the art of software engineering and a couple who didn't even have rudimentary coding skills . I had these conversions, hence the "long long" to "dragon dragon" translation quirk, which in retrospect is funny but at the time it was frustrating.

      But none of this is funny if we're talking about equipment meant to control systems where lives are on the line. It's life or death when you're talking about avionics because avionics must be hard realtime and therefore should be done in Spark/Ravenscar/Ada and not C nor any dialect of it. These types of errors would have been caught in the prover if the code compiled at all. Sadly, safety first, hard real-time languages take planning and real engineering skill to work with. Therefore if it turns out they farmed this out to the lowest bidder I'm going to stop flying.

  • (Score: 5, Touché) by isostatic on Friday July 26 2019, @09:02AM (2 children)

    by isostatic (365) on Friday July 26 2019, @09:02AM (#871382) Journal

    536,400 seconds, but what is the value in milliseconds?

    If you can't multiply a number by 1,000 you have no business commenting on complex software

    • (Score: 0) by Anonymous Coward on Friday July 26 2019, @11:51AM (1 child)

      by Anonymous Coward on Friday July 26 2019, @11:51AM (#871420)

      But are binary milliseconds like in ram or decimal ones like storage marketing use to boost product size?
      And after formatting those milliseconds how much time really you have left?

      • (Score: 0) by Anonymous Coward on Saturday July 27 2019, @07:00AM

        by Anonymous Coward on Saturday July 27 2019, @07:00AM (#871805)

        Any proper wielded of the SI system knows to label those milbiseconds.

  • (Score: 2) by DannyB on Friday July 26 2019, @01:58PM

    by DannyB (5839) Subscriber Badge on Friday July 26 2019, @01:58PM (#871467) Journal

    > 149 hours is 536,400 seconds, but what is the value in milliseconds?

    Given all the attempts here to try to fit this into some kind of integer overflow, I'll go a different direction.

    Maybe they've looked at when the overflow would occur, looked at all of the flight schedules, and picked a standard reboot time that just happens NOT to be during any particular airline fright.

    Now we only hope none of the flights are delayed, or early.

    But everyone knows that the proper procedure to fix Windows bugs it so reinstall.

    --
    To transfer files: right-click on file, pick Copy. Unplug mouse, plug mouse into other computer. Right-click, paste.
  • (Score: 1) by VacuumTube on Friday July 26 2019, @04:26PM

    by VacuumTube (7693) on Friday July 26 2019, @04:26PM (#871538) Journal

    I don't know. . . By your calculation the system crashes at exactly 149 hours, and they take it right up to the wire and tell customers to reboot at the last possible mS? You may be right, but I'd like to think that where air travel is concerned they'd leave a little breathing room. Rebooting at 100 hours, or 125 would be safer.

  • (Score: 2, Funny) by Anonymous Coward on Friday July 26 2019, @04:42PM

    by Anonymous Coward on Friday July 26 2019, @04:42PM (#871544)

    why would silly Americans want Dragons or Dragon Dragons in their code?

    I'm an American and have been accused of being silly on multiple occasions. I wand Dragons in my code so that I can have a block in any flow chart that says "Here there be Dragons", so much so I'm going to put that in place starting tomorrow.

  • (Score: 3, Funny) by istartedi on Friday July 26 2019, @10:00PM (3 children)

    by istartedi (123) on Friday July 26 2019, @10:00PM (#871637) Journal

    stackoverflow which at least in this case was a remarkably prescient name

    I must take exception to that.

    --
    Appended to the end of comments you post. Max: 120 chars.
    • (Score: 2) by sshelton76 on Saturday July 27 2019, @02:35AM

      by sshelton76 (7978) on Saturday July 27 2019, @02:35AM (#871726)

      Best joke of that day! You owe me a new keyboard. Mine is now covered in coffee.

    • (Score: 2) by driverless on Saturday July 27 2019, @08:55AM (1 child)

      by driverless (4770) on Saturday July 27 2019, @08:55AM (#871821)

      stackoverflow which at least in this case was a remarkably prescient name

      I must take exception to that.

      I can try that approach, but I just can't catch the exception you're referring to.

      • (Score: 3, Touché) by Bot on Saturday July 27 2019, @01:06PM

        by Bot (3902) on Saturday July 27 2019, @01:06PM (#871901) Journal

        finally, joke's over.

        --
        Account abandoned.
  • (Score: 2) by Rivenaleem on Monday July 29 2019, @01:19PM

    by Rivenaleem (3400) on Monday July 29 2019, @01:19PM (#872613)

    Double Long was a great game.