Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday July 26 2019, @06:07AM   Printer-friendly
from the have-you-tried-turning-it-off-and-back-on-again? dept.

Submitted via IRC for Bytram

Airbus A350 software bug forces airlines to turn planes off and on every 149 hours

Some models of Airbus A350 airliners still need to be hard rebooted after exactly 149 hours, despite warnings from the EU Aviation Safety Agency (EASA) first issued two years ago.

In a mandatory airworthiness directive (AD) reissued earlier this week, EASA urged operators to turn their A350s off and on again to prevent "partial or total loss of some avionics systems or functions".

The revised AD, effective from tomorrow (26 July), exempts only those new A350-941s which have had modified software pre-loaded on the production line. For all other A350-941s, operators need to completely power the airliner down before it reaches 149 hours of continuous power-on time.

Concerningly, the original 2017 AD was brought about by "in-service events where a loss of communication occurred between some avionics systems and avionics network" (sic). The impact of the failures ranged from "redundancy loss" to "complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules".

In layman's English, this means that prior to 2017, at least some A350s flying passengers were suffering unexplained failures of potentially flight-critical digital systems.

Not a power of two. I wonder why 149 hours?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Insightful) by janrinok on Friday July 26 2019, @09:12AM (8 children)

    by janrinok (52) Subscriber Badge on Friday July 26 2019, @09:12AM (#871384) Journal

    WTF is the sampling timer doing turning 0?

    Go back and read what he wrote. To overcome a testing failure the software was (probably) modified to return the absolute value. At some point in an integers range it will return 0.

    Starting Score:    1  point
    Moderation   +1  
       Insightful=1, Total=1
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 1, Informative) by Anonymous Coward on Friday July 26 2019, @09:39AM

    by Anonymous Coward on Friday July 26 2019, @09:39AM (#871386)

    WTF is the sampling timer doing turning 0?

    Go back and read what he wrote. To overcome a testing failure the software was (probably) modified to return the absolute value. At some point in an integers range it will return 0.

    abs(-30) =

    I leave this as an exercise for the reader.

  • (Score: 0) by Anonymous Coward on Friday July 26 2019, @11:44AM (6 children)

    by Anonymous Coward on Friday July 26 2019, @11:44AM (#871419)

    Yeah, the absolute value will return 0 at the exact same point than the original one did. So what?

    • (Score: 5, Informative) by janrinok on Friday July 26 2019, @01:04PM (5 children)

      by janrinok (52) Subscriber Badge on Friday July 26 2019, @01:04PM (#871433) Journal

      Aircraft avionic equipement sends data around the system using a data bus. There are several proprietary busses that are commonly used and it doesn't really matter which one is used, but I believe Airbus use [aviationtoday.com] the Mil-Std-1553 [wikipedia.org] variant.

      When sending data via a bus there has to be some form of data validation i.e. the system must ensure that data from various sources are received in the order that they are sent and that no data is being lost. Each message from each specific equipment will have some form of identification number which is commonly simply an increasing counter. Each transmitter on the bus will have its own counter and counters need not remain in sync with other transmitter. The receiver will process the data from each transmitter and ensure that data is handled correctly. As long as the counter from each specific transmitter is increasing and no values are missing (which would indicate that data is being lost) then the receiver is happy.

      Imagine that, in the discussion above, that the counter suddenly goes from a value of 'n' to 0. To the receiver this indicates that something is wrong with the data. After all, the receiver was expecting n+1 but that is not what the transmitter sent. The receiver will take some kind of action to signal that the transmitter is not functioning correctly and will, in many avionic systems, ignore subsequent data from that transmitter. The software in the transmitter is at fault for not complying with some specification or another. But perhaps nobody ever thought that the transmitter would be expected to operate for more than 149 days...

      Why wouldn't this show up in testing? Well an aircraft that is used for testing usually completes a test flight, records the data, and then lands. The data is downloaded and subsequently analysed. The aircraft is then powered down until its next flight, which might be days or more away. The transmitters are each restarted at the next power-up and so never run continuously for 149 days. But aircraft in service are frequently kept powered up for servicing or maintenance between flights. So it is not until an aircraft is operated in this way that the problem manifests itself. When the problem appears, the usual remedial action is the hot-swap the avionics concerned, which would mean the replacement is starting from a power down status, the databus system is re-initialised, and everything works as expect. The 'faulty' box is returned for fault investigation but, when restarted, also appears to be working correctly because the very act of moving it from the aircraft to the workshop has meant the its power has been recycled and this it works as expected.

      Testing is very difficult because different black boxes report data at different rates and could, conceivably, use different length of counters. How long do you run a system for? Maybe 149 days, 365 days, 4 years (to take into account leap years etc?

      If you look up my bio details on this site you will see that I spent some of my career writing software for real-time military avionic systems. This sort of problem - which admittedly should never occur - is more common than you might imagine and is very difficult to diagnose.

      • (Score: 2) by driverless on Saturday July 27 2019, @08:47AM

        by driverless (4770) on Saturday July 27 2019, @08:47AM (#871818)

        Done the same thing, although in my case it was slightly different, the system used a 64-bit counter and we were supposed to check for overflow. We ended up not checking for overflow because the chances that the check would be screwed up in some way were vastly, vastly higher than that of the counter ever actually overflowing.

        Sometimes less code is more...

      • (Score: 2) by driverless on Saturday July 27 2019, @08:51AM

        by driverless (4770) on Saturday July 27 2019, @08:51AM (#871819)

        Oh, another thing, we typically found these odd problems in low-level components that we didn't control. The software itself was written extremely carefully - think something like MISRA on steroids, with some parts accompanied by formal PROMELA/SPIN proofs - but then you'd get some low-level transceiver with a tiny built-in state machine that the model saw as a black box which would end up in an unexpected state under some circumstances. So it was the low-level gunk you didn't directly control that ended up biting you.

        Point is, it may not be something that Airbus has any direct control over that's causing this.

      • (Score: 1) by wArlOrd on Saturday July 27 2019, @01:55PM (2 children)

        by wArlOrd (2142) on Saturday July 27 2019, @01:55PM (#871908)

        If one reviews the archives from the First Oil War, articles describing a similar situation with the Patriot Missile Batteries should be found.