Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by martyb on Monday June 24 2019, @12:37PM   Printer-friendly
We are aware of issues when trying to access the site. First noticed at approx. 0300 UTC. Our servers look okay. It appears there may be issues with upstream connectivity.

Also, Linode is planning some server reboots over the next week or so. We will try to give advance notice and keep downtime to a minimum.

Update: Everything seems to have quieted down. Many many thanks to NotSanguine for jumping in and lending his expertise to help identify and isolate where things were borked.

Indications are that a bad BGP (Border Gateway Protocol) route was published causing a relatively small AS (Autonomous System) to have all traffic to/from a large fraction of the internet attempt to go through its routers.

Related Stories

Another BGP Outage Thanks to Verizon and a BGP Optimizer 4 comments

Another Monday and another BGP (Border Gateway Protocol) misconfiguration causing large parts of the Internet to stop working. More on the cause and effects from the Cloudflare blog How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline:

[Monday] at 10:30UTC, the Internet had a small heart attack. A small company in Northern Pennsylvania became a preferred path of many Internet routes through Verizon (AS701), a major Internet transit provider. This was the equivalent of Waze routing an entire freeway down a neighborhood street — resulting in many websites on Cloudflare, and many other providers, to be unavailable from large parts of the Internet. This should never have happened because Verizon should never have forwarded those routes to the rest of the Internet. To understand why, read on.

SoylentNews was also affected — alongside other prominent sites.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2) by NotSanguine on Monday June 24 2019, @12:40PM (4 children)

    Based on discussions on IRC and my own experiences, the upstream network issue appears to be resolved.

    That doesn't mean it will stay that way, but hopefully all is good now.

    Would Soylentils please keep their eyes open for reports about this outage.

    I'd be very interested to hear what the cause might have been and what other networks were impacted.

    --
    No, no, you're not thinking; you're just being logical. --Niels Bohr
    • (Score: 5, Funny) by NPC-131072 on Monday June 24 2019, @01:01PM (3 children)

      by NPC-131072 (7144) on Monday June 24 2019, @01:01PM (#859328) Journal

      Route leak [cloudflarestatus.com] affecting CF, AWS and also big league players like SN.

      • (Score: 2) by inertnet on Monday June 24 2019, @02:21PM (2 children)

        by inertnet (4071) on Monday June 24 2019, @02:21PM (#859352) Journal

        That may explain why I got a cloudflare "I'm not a robot" screen, before I could connect to some site through a VPN. Which I didn't get without using the VPN.

        • (Score: 2, Funny) by NPC-131072 on Monday June 24 2019, @02:37PM (1 child)

          by NPC-131072 (7144) on Monday June 24 2019, @02:37PM (#859358) Journal

          I think that'll just be standard DOS countermeasure due to malicious traffic originating from the VPN netblock.

          • (Score: 3, Touché) by Thexalon on Monday June 24 2019, @11:22PM

            by Thexalon (636) on Monday June 24 2019, @11:22PM (#859537)

            As a Linux advocate, I am happy to see any countermeasures to DOS deployed widely!

            --
            The only thing that stops a bad guy with a compiler is a good guy with a compiler.
  • (Score: 2, Interesting) by Mer on Monday June 24 2019, @12:41PM (17 children)

    by Mer (8009) on Monday June 24 2019, @12:41PM (#859323)

    Seems to be a widespread problem. I can't access some other sites and services.
    I changed DNS, but it seems that's not it since SN is back up but not most of the others.

    --
    Shut up!, he explained.
    • (Score: 2) by Runaway1956 on Monday June 24 2019, @12:49PM (2 children)

      by Runaway1956 (2926) Subscriber Badge on Monday June 24 2019, @12:49PM (#859324) Journal

      From where I sit, traceroute and MTR made it look like Verizon was falling on it's face. That may or may not mean anything at all, but that's where my traceroute was ending.

      • (Score: 2) by NotSanguine on Monday June 24 2019, @01:11PM (1 child)

        It's possible that Verizon was the source of the issue.

        However, it's quite possible it wasn't.

        Just because your traceroute died at a Verizon router, that doesn't mean it was Verizon. It's just (more?) as likely, that the route leak (as reported here https://www.cloudflarestatus.com/incidents/46z55mdhg0t5) [cloudflarestatus.com] gave Verizon incorrect routing information and it flung your data the wrong way.

        Since this has affected a lot of folks and a some of the big boys, we'll likely get some details in the press over the next day or so.

        --
        No, no, you're not thinking; you're just being logical. --Niels Bohr
        • (Score: 2) by pkrasimirov on Monday June 24 2019, @01:20PM

          by pkrasimirov (3358) Subscriber Badge on Monday June 24 2019, @01:20PM (#859331)

          > Verizon was falling on it's face
          Perhaps yes.

          > Verizon was the source of the issue
          Not necessarily.

          It is possible to filter out bogus BGP announcements (if that was the problem cause). For example Google was working just fine.

    • (Score: 2) by NotSanguine on Monday June 24 2019, @02:09PM (13 children)

      Yep. DownDetector [downdetector.com] is showing widespread impact all around the same time SN was seeing issues.

      A heat map [downdetector.com] from DownDetector appears to show that the issues were worst in the US and Canada, as well as Western Europe.

      BGR [bgr.com] is also reporting on this.

      More reporting from The Verge [theverge.com]

      --
      No, no, you're not thinking; you're just being logical. --Niels Bohr
      • (Score: 3, Funny) by NPC-131072 on Monday June 24 2019, @02:28PM (12 children)

        by NPC-131072 (7144) on Monday June 24 2019, @02:28PM (#859356) Journal

        For those interested:

        Lawnmower man who just managed to light up the cell phone of every network and sysadmin in the world with network alerts is / was likely employed by a customer or peer of Level3.

        • (Score: 5, Informative) by NotSanguine on Monday June 24 2019, @03:06PM (11 children)

          Reading through the HN link posted [ycombinator.com], it appears that this *broad* outage was the result of a BGP [wikipedia.org] mis-configuration by Allegheny Technologies (ASN 396531).

          Apparently, they began broadcasting a route for their /24 network, but someone apparently fat-fingered the mask to be /4 instead of /24.

          For the less technical, that means they were broadcasting that a huge swath of the internet should be routed through their routers.

          Apparently, their upstream ISP (there was some mention of Verizon in the link above, but I haven't seen confirmation of this) didn't do any validation and rebroadcast the incorrect route.

          Other upstream network providers apparently did so as well, causing network traffic for a significant portion of the internet to be pushed at a small company in Pittsburgh, PA.

          Once there, it was unceremoniously dropped in the bit bucket and likely overloaded their network interfaces as well.

          --
          No, no, you're not thinking; you're just being logical. --Niels Bohr
          • (Score: 2, Informative) by NPC-131072 on Monday June 24 2019, @03:49PM (8 children)

            by NPC-131072 (7144) on Monday June 24 2019, @03:49PM (#859378) Journal

            NANOG Writeup [nanog.org]

            • (Score: 4, Informative) by NotSanguine on Monday June 24 2019, @04:11PM (7 children)

              So it wasn't specifically Allegheny's fault.

              From the post you linked [nanog.org]:

              It appears that one of the implicated ASNs, AS 33154 "DQE Communications
              LLC" is listed as customer on Noction's website:
              https://www.noction.com/clients/dqe [noction.com]

              I suspect AS 33154's [peeringdb.com] customer AS 396531 [ipinfo.io] turned up a new circuit with Verizon, but didn't have routing policies to prevent sending routes from
              33154 to 701 [peeringdb.com] and vice versa, or their router didn't have support for RFC 8212 [ietf.org].

              So a cluster-fuck by two ISPs who should know better, and small company that didn't and relied on the "pros" to handle things properly. Lovely.

              --
              No, no, you're not thinking; you're just being logical. --Niels Bohr
              • (Score: 3, Funny) by NPC-131072 on Monday June 24 2019, @05:07PM (5 children)

                by NPC-131072 (7144) on Monday June 24 2019, @05:07PM (#859409) Journal

                Yup, working theory is that ATI Metals (Allegheny) had a new circuit connected, multihomed with DQE (existing) and Verizon (new). DQE ran a BGP optimizer, these more specific roots then leaked from DQE via ATI to Verizon (who further propagated).

                Anybody expect the "fix" to this problem will "amazingly" be more centralized control of the internet?

                • (Score: 3, Insightful) by NotSanguine on Monday June 24 2019, @05:47PM (4 children)

                  Anybody expect the "fix" to this problem will "amazingly" be more centralized control of the internet?

                  Actually, no.

                  I'd expect requiring stricter adherence to RFC 8212, as well as something akin to Cloudflare's rPKI. In addition to those steps, I'd hope to see some standards around verifying that advertised routes actually *make sense* in their context before re-advertising them.

                  With the first and third items above, the major outage today would never have happened, and ATI Metals, DQE and Verizon would have dealt with this *before* any routes were advertised to the rest of the 'net

                  Once such workable solutions are in place, with some lead time, I'd expect peers to reject BGP routes that aren't in compliance, thus eliminating most inadvertent *and* malicious BGP advertisements. No centralization required.

                  There would be some corner cases that would likely still crop up, but we wouldn't be seeing this nearly as often.

                  Besides, it would be quite difficult, if not impossible to centralize something like BGP, since the whole point of the protocol is that it's decentralized.

                  What's more, there's no way you'd get the IETF to even *try* to centralize BGP. Have you ever been to an IETF meeting or participated in a working group? Not gonna happen.

                  --
                  No, no, you're not thinking; you're just being logical. --Niels Bohr
                  • (Score: 2) by NPC-131072 on Monday June 24 2019, @09:11PM (3 children)

                    by NPC-131072 (7144) on Monday June 24 2019, @09:11PM (#859503) Journal

                    There's a reason IANA doesn't have the root cert isn't there? 5 RIR roots (trust anchors) would still be centralizing control over routing authority. The legislative reach argument about regional Vs. national Internet Registries and balkanization ignores that any CA is by definition a centralized point of failure. WoT [wikipedia.org] to do?

                    • (Score: 2) by NotSanguine on Monday June 24 2019, @09:58PM (2 children)

                      Note, that I said "something akin to rPKI" not rPKI.

                      Cryptographic signatures can be useful *without* centralization.

                      Especially since verification needs to be done *between* peers/upstream/downstream providers, with signatures being confirmed to be valid by each peer, then updated again before being forwarded to the next set of peers. Which does not require anything top-down or centralized, just verification and trust between peers.

                      Why don't you write a protocol spec using RFC 7353 [ietf.org] that can be conformed with RFC 8212 rather than playing "gotcha" with me?

                      I'm sure we'll all appreciate your hard work. I look forward to reading your Internet Draft when you're done.

                      --
                      No, no, you're not thinking; you're just being logical. --Niels Bohr
                      • (Score: 2) by NPC-131072 on Tuesday June 25 2019, @12:33AM (1 child)

                        by NPC-131072 (7144) on Tuesday June 25 2019, @12:33AM (#859556) Journal

                        Note, that I said "something akin to rPKI" not rPKI.

                        Noted.

                        Why don't you write a protocol spec using RFC 7353 [ietf.org] that can be conformed with RFC 8212 rather than playing "gotcha" with me?

                        Wasn't playing "gotcha" but we've gone from origin to path validation. [ietf.org] Giving RIRs (or LIRs) the technical ability to revoke certs will surely make their politicization inevitable?

                        • (Score: 2) by NotSanguine on Tuesday June 25 2019, @03:05AM

                          Wasn't playing "gotcha" but we've gone from origin to path validation. [ietf.org] Giving RIRs (or LIRs) the technical ability to revoke certs will surely make their politicization inevitable?

                          Fair enough.

                          But I haven't *gone* anywhere. I think you misunderstand me.

                          Given that it behooves peers, as well as upstream/downstream providers to play it straight with each other, *especially* when it comes to BGP, given that they *need* each other to carry/forward their network traffic as expeditiously and efficiently as possible.

                          There's no profit in refusing to verify a BGP update signature via the public key provided by a peer. Once such signature is verified, the receiving peer *should* verify that the routes make sense WRT routes currently being advertised by other peers (whose signatures they *also* verify). Once such a BGP update has been validated, the receiving peer needs to *re-sign* the update with its own private key, with the public key associated with it having been securely shared with *its* peers, then forward that update to its peers.

                          The next hop should do the same thing. Ad inifinitum.

                          Given that these peers have a vested interest in maintaining those relationships, it's unclear to me why they would, unless it's warranted (e.g., malicious route updates, repeated errors in route updates, etc.), revoke the public key of a peer.

                          What "political" advantage would *anyone* get by doing so? All you'll end up doing is cutting off your nose to spite your face.

                          --
                          No, no, you're not thinking; you're just being logical. --Niels Bohr
              • (Score: 1, Interesting) by Anonymous Coward on Tuesday June 25 2019, @01:58AM

                by Anonymous Coward on Tuesday June 25 2019, @01:58AM (#859575)

                So a cluster-fuck by two ISPs who should know better, and small company that didn't and relied on the "pros" to handle things properly. Lovely.

                Additional information: DQE is the new (again) name for Duquesne Light, the incumbent electricity supplier to the Greater Pittsburgh area. The electic company has tons of right of way and easements to move cables around. An age ago I worked for their regional neighbor, who was also trying to play ISP at the time.

          • (Score: 0) by Anonymous Coward on Tuesday June 25 2019, @08:09AM (1 child)

            by Anonymous Coward on Tuesday June 25 2019, @08:09AM (#859645)

            So is this suspicious or it isn't because Russia or China aren't involved? ;)
            https://arstechnica.com/information-technology/2018/11/strange-snafu-misroutes-domestic-us-internet-traffic-through-china-telecom/ [arstechnica.com]

            • (Score: 2) by NotSanguine on Tuesday June 25 2019, @02:09PM

              Why don't you ask these folks [atimetals.com]? They're a pretty suspicious looking bunch, I'd say.

              Alex Jones is reporting that moments after this happened, large funds transfers from both Russian and Chinese banks were made to a day-care center in Lakeland, FL. and Netcraft confirms it! Definitely something evil going on, if you ask them.

              I'd take a look at this [zdnet.com] and save it offline somewhere, as it's likely to be taken off-line pretty soon to protect sources and methods.

              Since I cannot verify your security clearances, I can neither confirm nor deny the type of milk (skim, 1%, 2% or whole) I put in my coffee. That assumes I use milk at all. Black? With cream? Half and half? Do I even drink coffee?

              I'm sorry. You'll need to go through your handler for a query like this.

              I'll answer your question by citing Secret UN Resolution NWO-666 [wikipedia.org].

              Don't contact me again!

              --
              No, no, you're not thinking; you're just being logical. --Niels Bohr
(1)