Stories
Slash Boxes
Comments

SoylentNews is people

posted by n1 on Thursday October 23 2014, @10:57AM   Printer-friendly
from the person-you-are-trying-to-reach-is-not-available dept.

Brian Fung reports at the Washington Post that earlier this year emergency services went dark for over six hours for more than 11 million people across seven states. "The outage may have gone unnoticed by some, but for the more than 6,000 people trying to reach help, April 9 may well have been the scariest time of their lives." In a 40-page report, the FCC found that an entirely preventable software error was responsible for causing 911 service to drop. "It could have been prevented. But it was not (PDF)," the FCC's report reads. "The causes of this outage highlight vulnerabilities of networks as they transition from the long-familiar methods of reaching 911 to [Internet Protocol]-supported technologies."

On April 9, the software responsible for assigning the identifying code to each incoming 911 call maxed out at a pre-set limit; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure. Adm. David Simpson, the FCC's chief of public safety and homeland security, says that having a single backup does not provide the kind of reliability that is ideal for 911. “Miami is kind of prone to hurricanes. Had a hurricane come at the same time [as the multi-state outage], we would not have had that failover, perhaps. So I think there needs to be more [distribution of 911 capabilities].”

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Insightful) by FatPhil on Thursday October 23 2014, @01:07PM

    by FatPhil (863) <pc-soylentNO@SPAMasdf.fi> on Thursday October 23 2014, @01:07PM (#109142) Homepage
    A software glitch is unexpected behaviour, typically caused by something out of the control of the software itself.

    If a counter maxed out legitimately, then whoever specified the range of that counter did it inappropriately. This is shoddy design.

    The resolution implies this is the case:
    """
    Intrado implemented a number of new features to fix the original problem with the PTM and to
    prevent recurrence of the same or similar problems. The most important changes include:
            • Significantly increasing the PTM counter limit for both ECMCs (i.e., in Englewood and in
                  Miami) to reduce the possibility of reaching the maximum threshold, and checking the PTM
                  counter value weekly to ensure the value is not nearing the higher, maximum threshold;
    """

    Which makes you wonder why an upper limit is necessary at all. Or if a limit is necessary, why not one that can never physically be achieved? Surely a limit of 2^64-1 would be set-once-and-forget?
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    Starting Score:    1  point
    Moderation   +1  
       Insightful=1, Total=1
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 2) by Sir Garlon on Thursday October 23 2014, @02:11PM

    by Sir Garlon (1264) on Thursday October 23 2014, @02:11PM (#109164)

    Which makes you wonder why an upper limit is necessary at all. Or if a limit is necessary, why not one that can never physically be achieved? Surely a limit of 2^64-1 would be set-once-and-forget?

    In my experience, there is often a big gap between what a designer should have done in the first place (setting the limit to a 2^64 - 1 or having no limit at all) and what is a practical patch to a running system. The fix needs to be done quickly, without breaking other parts of the system, and at limited cost and downtime. That means it's usually a nasty kluge.

    My concern is that "significantly" increasing the counter limit may, indeed, mean setting it to a satisfyingly large number, or it may mean increasing it by a factor of 2 or 5 ... just long enough for everyone who worked on the fix to retire or change jobs, so the problem can resurface after it's passed from living memory.

    --
    [Sir Garlon] is the marvellest knight that is now living, for he destroyeth many good knights, for he goeth invisible.
    • (Score: 2) by strattitarius on Thursday October 23 2014, @04:28PM

      by strattitarius (3191) on Thursday October 23 2014, @04:28PM (#109237) Journal
      So increase it by 2 or 5 or 2 ^ 64, then setup a graceful failure. Let the call through, send a ton of alerts to admins, programmers, hell send it to 911 itself:

      Operator: "911, what is your emergency?"
      Automated Voice: "The 911 system is broken and you can't identify where calls are coming from - go ahead and check the screen for this call. Error 12AB34."

      In a system like 911, there should always be a way for the main action (the call) to occur, even if all the supporting actions (the locator service, call recording, etc) fail.
      --
      Slashdot Beta Sucks. Soylent Alpha Rules. News at 11.
    • (Score: 1) by TK-421 on Thursday October 23 2014, @04:45PM

      by TK-421 (3235) on Thursday October 23 2014, @04:45PM (#109245) Journal

      I agree with your experience and can't resist adding my own. So a limit was set, in and of itself that was a mistake. These mistakes happen and they are going to happen. The fact that the mistake was made doesn't surprise me. The fact that it took several hours (some of the 6 hours include the time the problem was identified and handled) between the TDM to IP system failing to route calls to the PSAPs and the end users before anyone had the thought that it was odd they weren't receiving their normal share of the 40+ million calls coming into the system. It was seven states and an unknown number of agents handling the calls. Lets assume 42 million calls and an even distribution of the calls to the seven states. That means 6 million calls had been handled and then all of a sudden stopped coming in. I've worked in those environments (contact centers), not as an agent but as a solution provider, and the agents start to notice pretty quick when they don't get calls. Did they alert their supervisors and were ignored? Did they just think, "Cool! Downtime! I can get paid for sitting!" I couldn't blame them for the latter, for the amount they get paid I would probably sit when I had the chance, but I would at least alert the supervisor. My conscience would be clear then.

      Back to mistakes happening...was there seriously no monitoring at the Englewood data center. Did no one notice...frick I had to dig deeper and bother reading the supplementing documents from TFA links. There was monitoring and it was badly implemented. The severity of the alerts was default and in this case inappropriately so. It took a full hour for the alerts to reach the NOC. When the NOC got the alerts they had no idea what they really meant. This is just the testimony of Englewood. There are contracted carriers galore in this system (again from TFA) and their individual alerts were in no better shape.

      TFA tries to blame the IP migration from TDM (I read it that way at least) and I couldn't disagree more. The larger problem is the Englewood NOC and the management who staffed it. I think we can all agree that 911 is a top priority service. If your NOC can't get basic understanding of a problem in less than an hour then you haven't staffed and/or trained your NOC properly. For crying out loud, escalate the damned thing to the next tier where hopefully there is someone with the ability to understand the alerts. You cannot contract out what I call the "steady hand on the throttle." That being person or persons who have functional knowledge of the whole system who can perform necessary sanity checks at the right time.

      • (Score: 2) by Sir Garlon on Thursday October 23 2014, @08:15PM

        by Sir Garlon (1264) on Thursday October 23 2014, @08:15PM (#109347)

        The 911 system is just one example of critical software that was badly implemented and is therefore a house of cards. We cannot blame the government who commissioned the system, because non-engineers can't be expected to properly define the fault tolerance requirements. As to the senior engineers and project managers who implemented the system, there's some serious incompetence and I daresay negligence there. The system was probably implemented by the lowest bidder.

        If this bothers you (it does me), you might want to take a look at I Am The Cavalry [iamthecavalry.org] movement.

        --
        [Sir Garlon] is the marvellest knight that is now living, for he destroyeth many good knights, for he goeth invisible.
        • (Score: 1) by TK-421 on Friday October 24 2014, @01:52AM

          by TK-421 (3235) on Friday October 24 2014, @01:52AM (#109442) Journal

          I checked the link, and I wasn't disappointed. Thanks for sharing.

  • (Score: 0) by Anonymous Coward on Thursday October 23 2014, @06:44PM

    by Anonymous Coward on Thursday October 23 2014, @06:44PM (#109296)

    > A software glitch is unexpected behaviour, typically caused by something out of the control of the software itself.

    Your nit-picking is unhelpful because [i]your[/i] personal definition of a software glitch describes something that does not exist.

    All bugs are created by humans because all software is created by humans, be it typos, logic errors, design errors. They are all human failures manifested in software.