Too many services depend not just on one cloud provider, but on one location:
Analysis Amazon's US-EAST-1 region outage caused widespread chaos, taking websites and services offline even in Europe and raising some difficult questions. After all, cloud operations are supposed to have some built-in resiliency, right?
The problems began just after midnight US Pacific Time today when Amazon Web Services (AWS) noticed increased error rates and latencies for multiple services running within its home US-EAST-1 region.
Within a couple of hours, Amazon's techies had identified DNS as a potential root cause of the issue – specifically the resolution of the DynamoDB API endpoint in US-EAST-1 – and were working on a fix.
However, it was affecting other AWS services, including global services and or features that rely on endpoints operating from AWS' original region, such as IAM (Identity and Access Management) updates and DynamoDB global tables.
While Amazon worked to fully resolve the problem, the issue was already causing widespread chaos to websites and online services beyond the Northern Virginia locale of US-EAST-1, and even outside of America's borders.
As The Register reported earlier, Amazon.com itself was down for a time, while the company's Alexa smart speakers and Ring doorbells stopped working. But the effects were also felt by messaging apps such as Signal and WhatsApp, while in the UK, Lloyds Bank and even government services such as tax agency HMRC were impacted.
According to a BBC report, outage monitor Downdetector indicated there had been more than 6.5 million reports globally, with upwards of 1,000 companies affected.
How could this happen? Amazon has a global footprint, and its infrastructure is split into regions, physical locations with a cluster of datacenters. Each region consists of a minimum of three isolated and physically separate availability zones (AZ), each with independent power and connected via redundant, ultra-low-latency networks.
Customers are encouraged to design their applications and services to run in multiple AZs to avoid being taken down by a failure in one of them.
Sadly, it seems that the entire edifice has an Achilles heel that can cause problems regardless of how much redundancy you design into your cloud-based operations, at least according to the experts we asked.
"The issue with AWS is that US East is the home of the common control plane for all of AWS locations except the federal government and European Sovereign Cloud. There was an issue some years ago when the problem was related to management of S3 policies that was felt globally," Omdia Chief Analyst Roy Illsley told us.
He explained that US-EAST-1 can cause global issues because many users and services default to using it since it was the first AWS region, even if they are in a different part of the world.
Certain "global" AWS services or features are run from US-EAST-1 and are dependent on its endpoints, and this includes DynamoDB Global Tables and the Amazon CloudFront content delivery network (CDN), Illsley added.
Sid Nag, president and chief research officer for Tekonyx, agreed.
"Although the impacted region is in the AWS US East region, many global services (including those used in Europe) depend on infrastructure or control-plane / cross-region features located in US-EAST-1. This means that even if the European region was unaffected in terms of its own availability zones, dependencies could still cause knock-on impact," he said.
"Some AWS features (for example global account-management, IAM, some control APIs, or even replication endpoints) are served from US-EAST-1, even if you're running workloads in Europe. If those services go down or become very slow, even European workloads may be impacted," he added.
Any organization whose resiliency plans extend to duplicating resources across two or more different cloud platforms will no doubt be feeling smug right now, but that level of redundancy costs money, and don't the cloud providers keep telling us how reliable they are?
The upshot of this is that many firms will likely be taking another look at the assumptions underpinning their cloud strategy.
"Today's massive AWS outage is a visceral reminder of the risks of over-reliance on two dominant cloud providers, an outage most of us will have felt in some way," said Nicky Stewart, Senior Advisor at the Open Cloud Coalition.
Cloud services in the UK are largely dominated by AWS and Microsoft's Azure, with Google Cloud coming a distant third.
"It's too soon to gauge the economic fallout, but for context, last year's global CrowdStrike outage was estimated to have cost the UK economy between £1.7 and £2.3 billion ($2.3 and $3.1 billion). Incidents like this make clear the need for a more open, competitive and interoperable cloud market; one where no single provider can bring so much of our digital world to a standstill," she added.
"The AWS outage is yet another reminder of the weakness of centralised systems. When a key component of internet infrastructure depends on a single US cloud provider, a single fault can bring global services to their knees - from banks to social media, and of course the likes of Signal, Slack and Zoom," said Amandine Le Pape, Co-Founder of Element, which provides sovereign and resilient communications for governments.
But there could also be compensation claims in the offing, especially where financial transactions may have failed or missed deadlines because of the incident.
"An outage such as this can certainly open the provider and its users to risk of loss, especially businesses that rely on its infrastructure to operate critical services," said Henna Elahi, Senior Associate at Grosvenor Law.
Elahi added that it would, of course, depend on factors, such as the terms of service and any service level agreements between the business and AWS, the specific causes of the outage and its severity and length.
"The impacts on Lloyds Bank, for example, could have very serious implications for the end user. Key payments and transfers that are being made may fail and this could lead to far reaching issues for a user such as causing breaches of contracts, failure to complete purchases and failure to provide security information. This may very well lead to customer complaints and attempts to recover any loss caused by the outage from the business," she said.
At 15.13 UTC today, AWS updated its Health Dashboard:
"We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."
Thirty minutes later, it added:
"We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches."
(Score: 3, Insightful) by Revek on Friday October 24, @01:23PM (4 children)
Its always DNS.
This page was generated by a Swarm of Roaming Elephants
(Score: 5, Informative) by krishnoid on Friday October 24, @02:32PM (2 children)
If you can get your server's IP address, it's always useful to keep it on hand. If you can/not ping it, you can separate the problem between:
DNS issues typically can be solved by editing a config file, and ping problems require changing a routing table or fiddling with actual hardware and/or tracing cables. Very different problem domains.
(Score: 4, Informative) by VLM on Friday October 24, @05:50PM (1 child)
Cross protocol can be SO MUCH FUN
Say you can ping your DNS server at 1.2.3.4 and it correctly returns server address 5.6.7.8 and 2001::5678 and you can ping 5.6.7.8 so why no work? Using ipv4 its all good.
Well can you ping6 your dns server at 2001::1234 and does it resolve over ipv6 and assuming it actually returns 2001:5678 can you ping6 the response 2001:5678?
Remember BGP don't care you can flap/suppress just one protocol prefix if you want, BGP don't care if it advertises 1.2.0.0/16 but flap suppressions 2001::0/64 whatever.
And the infra CAN be multi protocol two address protocols on each interface but no need, you can totally have a ipv4 DNS server and a ipv6 DNS server giving somewhat different responses super hilarious when that happens.
Or for some insane reason someone set differing weird MED values on ipv4 and ipv6 prefixes so they route in different paths to the same destination and one path is broken ha ha funny.
Personally I dislike cross protocol DNS servers because its just asking for routing trouble. If you ask for an ipv4 addrs over ipv6 its just tempting fate. But then that opens an lovely opportunity for inconsistent responses where only ipv6 queries give the wrong old response because someone restored a backup incorrectly or something.
If you think cross protocol DNS is funny, try anycast headaches. Normally having physically separate servers claim to be the same ip addrs is a bug except when you do it on purpose and call it anycast.
Its almost traditional to set up dns and ntp as anycast but I've also seen ELK stacks "in the wild" (my former employer would be too embarrassed to admit they did this publicly LOL it was insane). My elasticsearch load balancer ingress at 1.2.3.4 is down but YOUR es ingress at 1.2.3.4 in your building is up ha ha this is funny. Its all one happy ES cluster there are little load balanced clusters of nodes doing anycast in different buildings for client traffic on port 9200 although transport between nodes was done sanely over a different VLAN over port 9300 as is traditional for large ES clusters. It worked well until you had to troubleshoot outages at 2am.
(Score: 2) by gnuman on Friday October 24, @09:06PM
But that's a standard query? You ask for A and AAAA over whatever channel you want. As for old records and crap like that, DNSSEC FTW -- if you can keep the zone signed correctly, of course.
(Score: 2) by driverless on Saturday October 25, @03:14AM
Not always. The rest of the time it's certificates expiring.
(Score: 2) by VLM on Friday October 24, @05:32PM
Billing. My little consulting company has AWS for my minimal cloudy needs (and for fun) and perhaps I think billing comes from US-EAST-1. Can't charge you for XYZ service if billing is down.
On one hand you don't want to accidentally multi-charge a customer for the same service, on the other hand you don't want to undercharge the customer.
Error free distributed databases are hard and in the case of AWS this involves BIG money. Its like losing a billing tape at AT&T back before divestiture.
I have no inside track but I bet things went bonkers when it was impossible to bill. Bet there's nice local caching and buffering that gradually filled up resulting in random distributed services slowly shutting down. If you had a map of service outages by location and time you'd have a map of product sales, as presumably every S3 site (etc) has the same size buffer for billing entries.
(Score: 5, Insightful) by Thexalon on Friday October 24, @06:45PM (2 children)
This division of AWS had layoffs about 3 months ago, and more of the organization has been losing their white collar work force due to pointlessly demanding that they go work in an office or else. I'm pretty sure that had something to do with it, and I doubt that will be listed in their internal root cause analysis because for some reason "The boss was an idiot" isn't a valid reason for why something bad happened in most companies.
"Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
(Score: 2) by driverless on Saturday October 25, @03:16AM
Does anyone have figures for what percentage of the workforce there has been replaced by stochastic parrots marketed as "AI"?
(Score: 2) by pkrasimirov on Sunday October 26, @09:03AM
Not sure how much this is true: AI changed "acm-validations.aws" to "acm-validations.aws."
Totally fits with your observation on the layoffs.
(Score: 3, Insightful) by janrinok on Saturday October 25, @03:28AM (2 children)
If a potential enemy concentrates on US-East-1 and disables it, then a good proportion of the western world's internet falls down. Has nobody at AWS ever heard the adage "don't put all your eggs in one basket?"
Somebody deservers a serious ass-kicking for this structure.
[nostyle RIP 06 May 2025]
(Score: 0) by Anonymous Coward on Saturday October 25, @01:12PM
Centralization of infrastructure and monopolistic behavior by US tech corps are all by design, at the request of their client. If your eggs aren't in their basket, they start feeling very insecure.
(Score: 1) by khallow on Saturday October 25, @02:55PM
You would need to know the vulnerability exists first - which is now the case. My bet is that this slid through the cracks until now. Whether your observation stays true in the long term will depend on whether Amazon addresses the vulnerability.
A lot of engineering and similar fields work the hard way. They find the expensive, deadly problems when something expensive collapses and kills people. And then address it. I'm not so concerned about Amazon AWS system having a large vulnerability - unfortunately in complex systems, this happens. I'm more concerned about whether they'll learn from experience and address the whole class of problems.
If we see further centralization failures of this sort then that will be solid sign that they did not.
(Score: 2) by turgid on Saturday October 25, @08:35AM (2 children)
We did due diligence and best practice. We streamlined and outsourced. We did what was best for our investors. It was cheaper and better and more reliable than doing it ourselves. Then the whole company had to sit around twiddling its thumbs for a whole day. All we got back was $25 and not even an apology.
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 3, Touché) by Unixnut on Saturday October 25, @09:56AM (1 child)
To be fair, if "It was cheaper and better and more reliable than doing it ourselves", then even with this outage you would be better off than running your own infra. Outages can and do happen to everyone, it is just a matter of whether you have fewer or more outages with your chosen infrastructure compared to alternatives.
I don't know how true it is that that cloud is more reliable and better than on-prem infra. In my experience companies have had less problems running things in house than in the cloud. The difference is cloud is faster to scale, and they don't have to hire infra employees (or even have an "IT department" anymore). Considering how unpopular IT departments seem to be to non-tech companies, I would imagine they would be happy to pay more just not to have to deal with it.
(Score: 2) by turgid on Saturday October 25, @10:18AM
Suppose you have a team of Software Engineers working on Linux stuff and you tell them that must use AWS or Azure, end of?
I refuse to engage in a battle of wits with an unarmed opponent [wikipedia.org].
(Score: 2) by sjames on Saturday October 25, @08:13PM
...the easier it is to stop up the drain.