Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday July 26 2019, @06:07AM   Printer-friendly
from the have-you-tried-turning-it-off-and-back-on-again? dept.

Submitted via IRC for Bytram

Airbus A350 software bug forces airlines to turn planes off and on every 149 hours

Some models of Airbus A350 airliners still need to be hard rebooted after exactly 149 hours, despite warnings from the EU Aviation Safety Agency (EASA) first issued two years ago.

In a mandatory airworthiness directive (AD) reissued earlier this week, EASA urged operators to turn their A350s off and on again to prevent "partial or total loss of some avionics systems or functions".

The revised AD, effective from tomorrow (26 July), exempts only those new A350-941s which have had modified software pre-loaded on the production line. For all other A350-941s, operators need to completely power the airliner down before it reaches 149 hours of continuous power-on time.

Concerningly, the original 2017 AD was brought about by "in-service events where a loss of communication occurred between some avionics systems and avionics network" (sic). The impact of the failures ranged from "redundancy loss" to "complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules".

In layman's English, this means that prior to 2017, at least some A350s flying passengers were suffering unexplained failures of potentially flight-critical digital systems.

Not a power of two. I wonder why 149 hours?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Insightful) by Anonymous Coward on Friday July 26 2019, @08:27AM (10 children)

    by Anonymous Coward on Friday July 26 2019, @08:27AM (#871375)

    Now imagine you're a low paid coder in say India or China and you're building this really cool system and for your efforts you are being paid almost enough to buy a box of crackers in the USA, but hey that's rent and food for the next week there.

    Yes, bullshit. If you actually were to any of these countries, you'll soon realize that food costs the same everywhere. Heck, some poor places have food more expensive than in US while still only making $200 or $300 per month. Now tell that to the elitists in America bitching they can't live off $5000/mo because "food costs"

    Also not joking, it was caused by the company accepting the multi-million dollar contract, farming the work out to a "partner", who farmed it out to a subsidiary who farmed it out to some temp workers some in India and some in China they found on a freelancing website, all of whom evidently knew how to talk the talk but had no idea what they were doing. In most cases, they had copied and pasted the code directly from stackoverflow which at least in this case was a remarkably prescient name.

    Yeah, I think you are full of it. I see these mistakes too, but it has nothing to do with "freelancing" and "stackoverflow". Software is written by people and people that make mistakes. Lack of quality control is where the problem lies, not in the in your "skillz".

    If I got a dime for every time I see some hot-shot 20-something thinking they are the best and don't make mistakes, I'd be retired already.

    Furthermore, divide by 0 is not a stack overflow issue.

    z = y/x

    Wherein x is your ms accurate sampling timer.

    So hot shot, WTF is the sampling timer doing turning 0? Oh wait, most likely, it turns negative on the overflow here. Seems the bug is in the timing logic, not the divide routine.

    Now if you want an example of actual divide by 0 happening in software that ran on hundreds of millions of machines, just start WIndows 98 in a modern VM and see a nice divide by zero BSOD in its delay loop calibration logic....

    Starting Score:    0  points
    Moderation   +2  
       Flamebait=1, Insightful=1, Interesting=1, Underrated=1, Total=4
    Extra 'Insightful' Modifier   0  

    Total Score:   2  
  • (Score: 3, Insightful) by janrinok on Friday July 26 2019, @09:12AM (8 children)

    by janrinok (52) Subscriber Badge on Friday July 26 2019, @09:12AM (#871384) Journal

    WTF is the sampling timer doing turning 0?

    Go back and read what he wrote. To overcome a testing failure the software was (probably) modified to return the absolute value. At some point in an integers range it will return 0.

    • (Score: 1, Informative) by Anonymous Coward on Friday July 26 2019, @09:39AM

      by Anonymous Coward on Friday July 26 2019, @09:39AM (#871386)

      WTF is the sampling timer doing turning 0?

      Go back and read what he wrote. To overcome a testing failure the software was (probably) modified to return the absolute value. At some point in an integers range it will return 0.

      abs(-30) =

      I leave this as an exercise for the reader.

    • (Score: 0) by Anonymous Coward on Friday July 26 2019, @11:44AM (6 children)

      by Anonymous Coward on Friday July 26 2019, @11:44AM (#871419)

      Yeah, the absolute value will return 0 at the exact same point than the original one did. So what?

      • (Score: 5, Informative) by janrinok on Friday July 26 2019, @01:04PM (5 children)

        by janrinok (52) Subscriber Badge on Friday July 26 2019, @01:04PM (#871433) Journal

        Aircraft avionic equipement sends data around the system using a data bus. There are several proprietary busses that are commonly used and it doesn't really matter which one is used, but I believe Airbus use [aviationtoday.com] the Mil-Std-1553 [wikipedia.org] variant.

        When sending data via a bus there has to be some form of data validation i.e. the system must ensure that data from various sources are received in the order that they are sent and that no data is being lost. Each message from each specific equipment will have some form of identification number which is commonly simply an increasing counter. Each transmitter on the bus will have its own counter and counters need not remain in sync with other transmitter. The receiver will process the data from each transmitter and ensure that data is handled correctly. As long as the counter from each specific transmitter is increasing and no values are missing (which would indicate that data is being lost) then the receiver is happy.

        Imagine that, in the discussion above, that the counter suddenly goes from a value of 'n' to 0. To the receiver this indicates that something is wrong with the data. After all, the receiver was expecting n+1 but that is not what the transmitter sent. The receiver will take some kind of action to signal that the transmitter is not functioning correctly and will, in many avionic systems, ignore subsequent data from that transmitter. The software in the transmitter is at fault for not complying with some specification or another. But perhaps nobody ever thought that the transmitter would be expected to operate for more than 149 days...

        Why wouldn't this show up in testing? Well an aircraft that is used for testing usually completes a test flight, records the data, and then lands. The data is downloaded and subsequently analysed. The aircraft is then powered down until its next flight, which might be days or more away. The transmitters are each restarted at the next power-up and so never run continuously for 149 days. But aircraft in service are frequently kept powered up for servicing or maintenance between flights. So it is not until an aircraft is operated in this way that the problem manifests itself. When the problem appears, the usual remedial action is the hot-swap the avionics concerned, which would mean the replacement is starting from a power down status, the databus system is re-initialised, and everything works as expect. The 'faulty' box is returned for fault investigation but, when restarted, also appears to be working correctly because the very act of moving it from the aircraft to the workshop has meant the its power has been recycled and this it works as expected.

        Testing is very difficult because different black boxes report data at different rates and could, conceivably, use different length of counters. How long do you run a system for? Maybe 149 days, 365 days, 4 years (to take into account leap years etc?

        If you look up my bio details on this site you will see that I spent some of my career writing software for real-time military avionic systems. This sort of problem - which admittedly should never occur - is more common than you might imagine and is very difficult to diagnose.

        • (Score: 2) by driverless on Saturday July 27 2019, @08:47AM

          by driverless (4770) on Saturday July 27 2019, @08:47AM (#871818)

          Done the same thing, although in my case it was slightly different, the system used a 64-bit counter and we were supposed to check for overflow. We ended up not checking for overflow because the chances that the check would be screwed up in some way were vastly, vastly higher than that of the counter ever actually overflowing.

          Sometimes less code is more...

        • (Score: 2) by driverless on Saturday July 27 2019, @08:51AM

          by driverless (4770) on Saturday July 27 2019, @08:51AM (#871819)

          Oh, another thing, we typically found these odd problems in low-level components that we didn't control. The software itself was written extremely carefully - think something like MISRA on steroids, with some parts accompanied by formal PROMELA/SPIN proofs - but then you'd get some low-level transceiver with a tiny built-in state machine that the model saw as a black box which would end up in an unexpected state under some circumstances. So it was the low-level gunk you didn't directly control that ended up biting you.

          Point is, it may not be something that Airbus has any direct control over that's causing this.

        • (Score: 1) by wArlOrd on Saturday July 27 2019, @01:55PM (2 children)

          by wArlOrd (2142) on Saturday July 27 2019, @01:55PM (#871908)

          If one reviews the archives from the First Oil War, articles describing a similar situation with the Patriot Missile Batteries should be found.

  • (Score: 5, Insightful) by sshelton76 on Friday July 26 2019, @11:04AM

    by sshelton76 (7978) on Friday July 26 2019, @11:04AM (#871413)

    Wow, way to miss the boat completely.

    Yes, bullshit. If you actually were to any of these countries, you'll soon realize that food costs the same everywhere. Heck, some poor places have food more expensive than in US while still only making $200 or $300 per month. Now tell that to the elitists in America bitching they can't live off $5000/mo because "food costs"

    I've lived in a number of countries, your comment is bullocks. I can feed a family of 4 on $40 a week in most of the countries I've lived in outside the USA. (Mostly beans and rice, so they won't be starving at least) But in truth, my comment about food and rent was quasi-sarcastic and meant to drive home a point that you clearly missed. A box of crackers is $15 to $20 if you shop certain places or are buying bulk cuz you're feeding kids... https://www.samsclub.com/p/snack-box-pros-on-the-go-snack-box/prod21122959?xid=plp_product_1_34 [samsclub.com]

    Look on freelancer sometime, see what projects like this are actually bid at, you'll find plenty in the $20 and under range.
    https://www.freelancer.com/projects/firmware/experience-firmware-development-using/ [freelancer.com]

    Why these get filled predominantly by people in India and China is up in the air, but if you work for a living, food and shelter are usually your primary concerns, so if they are bidding that low for a project it must mean they are paying the bills right?

    Yeah, I think you are full of it. I see these mistakes too, but it has nothing to do with "freelancing" and "stackoverflow". Software is written by people and people that make mistakes. Lack of quality control is where the problem lies, not in the in your "skillz".

    My whole point was about lack of quality control, the fact you could read my post and not see that makes me realize replying to you is probably pointless. Not sure why I'm continuing other than to set the record straight. I never said anything about my "skillz", I only explained what I discovered while working to fix the system. Yeah people make mistakes.

    If I got a dime for every time I see some hot-shot 20-something thinking they are the best and don't make mistakes, I'd be retired already.

    It's cute you think I'm 20 something. I'm neither a hotshot, nor have I been 20 something for a few decades, but umm thanks?

    Furthermore, divide by 0 is not a stack overflow issue.

    Ya think Einstein??? But the counter "overflowed" triggering an eventual divide by zero that went unchecked. From there it all went to hell. Also I was being facetious, about "stackoverflow" being an appropriate name for that place. Which you would have realized if you had the ability to both read and comprehend instead of trying to be a pedant who is evidently overly worried about me casting shade on our industry's tendency to hire people who don't know WTF they are doing, because it's believed they can "still get the job done cheaper even if we have to toss ten of them at the project to get it done on time."

    So hot shot, WTF is the sampling timer doing turning 0? Oh wait, most likely, it turns negative on the overflow here. Seems the bug is in the timing logic, not the divide routine.

    Your lack of reading comprehension is frightening me. At this point the bug is clearly your ability to read and follow along with a description I would expect a High School student to be able to understand, which makes me question why you would feel qualified to comment, but I digress. I already explained this, but it comes down to the fact that the implementer was paid to implement a feature he didn't fully understand, but his contract only required that the code pass tests that were written by someone else.

    He implemented the sampling counter as an "int", because that was what the code he copied and pasted from used. Literally something along the lines of

    return i++;

    During testing he discovered his code was overflowing producing a negative number. This manifested as the test suite failing and rejecting his output as out of range because it was negative. To get the code to pass testing he returned the absolute of the counter instead of the raw value, while still keeping the original logic in place.

    return abs(i++);

    No one bothered to test the direction this thing was counting, so internally it was counting something like...
    +2,147,483,645
    +2,147,483,646
    +2,147,483,647
    −2,147,483,647
    −2,147,483,646
    −2,147,483,645
    ...
    But since x, i.e. the value being returned to the consuming function was the absolute value of the counter i++ the consumer was getting...
    2,147,483,645
    2,147,483,646
    2,147,483,647
    2,147,483,647
    2,147,483,646
    2,147,483,645
    ...

    From there it was being used as the divisor in a complex equation meant to debounce sudden changes in luminosity by computing the rolling average.
    But for simplicity sake let's just say it read

    z = y/x

    Because that's what the error boiled down. Of course it isn't the entire function, nor even the bulk of the math involved.
    You can't make this shit up, but it does need to be simplified so folks can follow along.

    I was brought in to troubleshoot after the thing had been in and out of service for about 90 days. I had to find the bug and for that I had to have conversations with the actual developers because the code didn't "look" buggy at first gloss, and it passed unit and integration tests that all seemed pretty well thought out.

    But when I analyzed it further I realized the mistake(s) and as I kept analyzing things, I found there were many, many more. I'm only regaling you with the ones relevant to the 149 hour topic.

    I had to find out why they made the decisions they made, because a big part of troubleshooting and recommending fixes is to make damn good and sure you don't recommend a fix that breaks something else. Finding this out requires communication and there were a lot of humorous and not so humorous misunderstandings involved, but at the end of the day the root cause was there were 2 people involved who didn't even know how to code, let alone engineer software and yes these are very different skills.

    Anyways, both items passed muster because...

    There was no ahead of time checking for x = 0, because a core design assumption was that there was no way x could ever be 0, it was just a simple counter after all. It's only job was in counting how many times a sensor had been sampled.

    In fact it wasn't until I discovered the sensors were being sampled once a millisecond that I was turned onto the idea there might be an overflow somewhere, because I've seen poorly implemented millisecond accumulators outstrip the capacity for an int to hold them in the past. But this one manifested differently and did so in a way I felt was relevant to the topic at hand.

    As far as tests are concerned, there was also no ahead of time checking to ensure x2 > x1 because another core design assumption was that there was no way that x would fail to increment, because again it's just supposed to be "how many times has the sensor been sampled"
    There was nothing in the test suite checking either of these. Ergo, it passed and the implementer got paid.

    The whole system would run trouble free for days, but once a week, the whole system would get cocked up.
    This was a multi-million dollar marquee and the trouble was eventually traced to a recent project to ensure the relative brightness of the sign could be seen in direct sunlight while not completely blinding people at night. But there were many other "upgrades" added at the same time so it did take time to dig in and start ruling things out.
    Nothing elite about it, but skill is skill and I've been at this game a long time.

    Now if you want an example of actual divide by 0 happening in software that ran on hundreds of millions of machines, just start WIndows 98 in a modern VM and see a nice divide by zero BSOD in its delay loop calibration logic....

    Does it do it approx every 149 hours? Or do you mean to say the airplane in question is running windows 98?
    Because unless one of those is your point, your point is as irrelevant as this...
    https://www.youtube.com/watch?v=IW7Rqwwth84 [youtube.com]

    My point was only that programming is a bit of a dark art, troubleshooting even more so for many people it borders on black magic and they are content to simply "reboot it from time to time".

    If you're someone like me, someone who likes to delve into complex systems and figure out what broke, both in engineering terms and in people terms, then troubleshooting something like this is fun and interesting work especially when you're talking about the sign over a casino in Las Vegas. Spending a month or so in Vegas on the casino's dime trying to track this down was a blast quite frankly and it had nothing to do with "skillz" and everything to do with the fact that we were dealing with several people who genuinely had not learned the art of software engineering and a couple who didn't even have rudimentary coding skills . I had these conversions, hence the "long long" to "dragon dragon" translation quirk, which in retrospect is funny but at the time it was frustrating.

    But none of this is funny if we're talking about equipment meant to control systems where lives are on the line. It's life or death when you're talking about avionics because avionics must be hard realtime and therefore should be done in Spark/Ravenscar/Ada and not C nor any dialect of it. These types of errors would have been caught in the prover if the code compiled at all. Sadly, safety first, hard real-time languages take planning and real engineering skill to work with. Therefore if it turns out they farmed this out to the lowest bidder I'm going to stop flying.