Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday July 26 2019, @06:07AM   Printer-friendly
from the have-you-tried-turning-it-off-and-back-on-again? dept.

Submitted via IRC for Bytram

Airbus A350 software bug forces airlines to turn planes off and on every 149 hours

Some models of Airbus A350 airliners still need to be hard rebooted after exactly 149 hours, despite warnings from the EU Aviation Safety Agency (EASA) first issued two years ago.

In a mandatory airworthiness directive (AD) reissued earlier this week, EASA urged operators to turn their A350s off and on again to prevent "partial or total loss of some avionics systems or functions".

The revised AD, effective from tomorrow (26 July), exempts only those new A350-941s which have had modified software pre-loaded on the production line. For all other A350-941s, operators need to completely power the airliner down before it reaches 149 hours of continuous power-on time.

Concerningly, the original 2017 AD was brought about by "in-service events where a loss of communication occurred between some avionics systems and avionics network" (sic). The impact of the failures ranged from "redundancy loss" to "complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules".

In layman's English, this means that prior to 2017, at least some A350s flying passengers were suffering unexplained failures of potentially flight-critical digital systems.

Not a power of two. I wonder why 149 hours?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by sshelton76 on Friday July 26 2019, @11:04AM

    by sshelton76 (7978) on Friday July 26 2019, @11:04AM (#871413)

    Wow, way to miss the boat completely.

    Yes, bullshit. If you actually were to any of these countries, you'll soon realize that food costs the same everywhere. Heck, some poor places have food more expensive than in US while still only making $200 or $300 per month. Now tell that to the elitists in America bitching they can't live off $5000/mo because "food costs"

    I've lived in a number of countries, your comment is bullocks. I can feed a family of 4 on $40 a week in most of the countries I've lived in outside the USA. (Mostly beans and rice, so they won't be starving at least) But in truth, my comment about food and rent was quasi-sarcastic and meant to drive home a point that you clearly missed. A box of crackers is $15 to $20 if you shop certain places or are buying bulk cuz you're feeding kids... https://www.samsclub.com/p/snack-box-pros-on-the-go-snack-box/prod21122959?xid=plp_product_1_34 [samsclub.com]

    Look on freelancer sometime, see what projects like this are actually bid at, you'll find plenty in the $20 and under range.
    https://www.freelancer.com/projects/firmware/experience-firmware-development-using/ [freelancer.com]

    Why these get filled predominantly by people in India and China is up in the air, but if you work for a living, food and shelter are usually your primary concerns, so if they are bidding that low for a project it must mean they are paying the bills right?

    Yeah, I think you are full of it. I see these mistakes too, but it has nothing to do with "freelancing" and "stackoverflow". Software is written by people and people that make mistakes. Lack of quality control is where the problem lies, not in the in your "skillz".

    My whole point was about lack of quality control, the fact you could read my post and not see that makes me realize replying to you is probably pointless. Not sure why I'm continuing other than to set the record straight. I never said anything about my "skillz", I only explained what I discovered while working to fix the system. Yeah people make mistakes.

    If I got a dime for every time I see some hot-shot 20-something thinking they are the best and don't make mistakes, I'd be retired already.

    It's cute you think I'm 20 something. I'm neither a hotshot, nor have I been 20 something for a few decades, but umm thanks?

    Furthermore, divide by 0 is not a stack overflow issue.

    Ya think Einstein??? But the counter "overflowed" triggering an eventual divide by zero that went unchecked. From there it all went to hell. Also I was being facetious, about "stackoverflow" being an appropriate name for that place. Which you would have realized if you had the ability to both read and comprehend instead of trying to be a pedant who is evidently overly worried about me casting shade on our industry's tendency to hire people who don't know WTF they are doing, because it's believed they can "still get the job done cheaper even if we have to toss ten of them at the project to get it done on time."

    So hot shot, WTF is the sampling timer doing turning 0? Oh wait, most likely, it turns negative on the overflow here. Seems the bug is in the timing logic, not the divide routine.

    Your lack of reading comprehension is frightening me. At this point the bug is clearly your ability to read and follow along with a description I would expect a High School student to be able to understand, which makes me question why you would feel qualified to comment, but I digress. I already explained this, but it comes down to the fact that the implementer was paid to implement a feature he didn't fully understand, but his contract only required that the code pass tests that were written by someone else.

    He implemented the sampling counter as an "int", because that was what the code he copied and pasted from used. Literally something along the lines of

    return i++;

    During testing he discovered his code was overflowing producing a negative number. This manifested as the test suite failing and rejecting his output as out of range because it was negative. To get the code to pass testing he returned the absolute of the counter instead of the raw value, while still keeping the original logic in place.

    return abs(i++);

    No one bothered to test the direction this thing was counting, so internally it was counting something like...
    +2,147,483,645
    +2,147,483,646
    +2,147,483,647
    −2,147,483,647
    −2,147,483,646
    −2,147,483,645
    ...
    But since x, i.e. the value being returned to the consuming function was the absolute value of the counter i++ the consumer was getting...
    2,147,483,645
    2,147,483,646
    2,147,483,647
    2,147,483,647
    2,147,483,646
    2,147,483,645
    ...

    From there it was being used as the divisor in a complex equation meant to debounce sudden changes in luminosity by computing the rolling average.
    But for simplicity sake let's just say it read

    z = y/x

    Because that's what the error boiled down. Of course it isn't the entire function, nor even the bulk of the math involved.
    You can't make this shit up, but it does need to be simplified so folks can follow along.

    I was brought in to troubleshoot after the thing had been in and out of service for about 90 days. I had to find the bug and for that I had to have conversations with the actual developers because the code didn't "look" buggy at first gloss, and it passed unit and integration tests that all seemed pretty well thought out.

    But when I analyzed it further I realized the mistake(s) and as I kept analyzing things, I found there were many, many more. I'm only regaling you with the ones relevant to the 149 hour topic.

    I had to find out why they made the decisions they made, because a big part of troubleshooting and recommending fixes is to make damn good and sure you don't recommend a fix that breaks something else. Finding this out requires communication and there were a lot of humorous and not so humorous misunderstandings involved, but at the end of the day the root cause was there were 2 people involved who didn't even know how to code, let alone engineer software and yes these are very different skills.

    Anyways, both items passed muster because...

    There was no ahead of time checking for x = 0, because a core design assumption was that there was no way x could ever be 0, it was just a simple counter after all. It's only job was in counting how many times a sensor had been sampled.

    In fact it wasn't until I discovered the sensors were being sampled once a millisecond that I was turned onto the idea there might be an overflow somewhere, because I've seen poorly implemented millisecond accumulators outstrip the capacity for an int to hold them in the past. But this one manifested differently and did so in a way I felt was relevant to the topic at hand.

    As far as tests are concerned, there was also no ahead of time checking to ensure x2 > x1 because another core design assumption was that there was no way that x would fail to increment, because again it's just supposed to be "how many times has the sensor been sampled"
    There was nothing in the test suite checking either of these. Ergo, it passed and the implementer got paid.

    The whole system would run trouble free for days, but once a week, the whole system would get cocked up.
    This was a multi-million dollar marquee and the trouble was eventually traced to a recent project to ensure the relative brightness of the sign could be seen in direct sunlight while not completely blinding people at night. But there were many other "upgrades" added at the same time so it did take time to dig in and start ruling things out.
    Nothing elite about it, but skill is skill and I've been at this game a long time.

    Now if you want an example of actual divide by 0 happening in software that ran on hundreds of millions of machines, just start WIndows 98 in a modern VM and see a nice divide by zero BSOD in its delay loop calibration logic....

    Does it do it approx every 149 hours? Or do you mean to say the airplane in question is running windows 98?
    Because unless one of those is your point, your point is as irrelevant as this...
    https://www.youtube.com/watch?v=IW7Rqwwth84 [youtube.com]

    My point was only that programming is a bit of a dark art, troubleshooting even more so for many people it borders on black magic and they are content to simply "reboot it from time to time".

    If you're someone like me, someone who likes to delve into complex systems and figure out what broke, both in engineering terms and in people terms, then troubleshooting something like this is fun and interesting work especially when you're talking about the sign over a casino in Las Vegas. Spending a month or so in Vegas on the casino's dime trying to track this down was a blast quite frankly and it had nothing to do with "skillz" and everything to do with the fact that we were dealing with several people who genuinely had not learned the art of software engineering and a couple who didn't even have rudimentary coding skills . I had these conversions, hence the "long long" to "dragon dragon" translation quirk, which in retrospect is funny but at the time it was frustrating.

    But none of this is funny if we're talking about equipment meant to control systems where lives are on the line. It's life or death when you're talking about avionics because avionics must be hard realtime and therefore should be done in Spark/Ravenscar/Ada and not C nor any dialect of it. These types of errors would have been caught in the prover if the code compiled at all. Sadly, safety first, hard real-time languages take planning and real engineering skill to work with. Therefore if it turns out they farmed this out to the lowest bidder I'm going to stop flying.

    Starting Score:    1  point
    Moderation   +3  
       Insightful=2, Interesting=1, Total=3
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5