Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday May 22 2024, @01:01PM   Printer-friendly
from the D'oh! dept.

https://arstechnica.com/gadgets/2024/05/google-cloud-accidentally-nukes-customer-account-causes-two-weeks-of-downtime/

Buried under the news from Google I/O this week is one of Google Cloud's biggest blunders ever: Google's Amazon Web Services competitor accidentally deleted a giant customer account for no reason. UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service. UniSuper thankfully had some backups with a different provider and was able to recover its data, but according to UniSuper's incident log, downtime started May 2, and a full restoration of services didn't happen until May 15.

UniSuper's website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled "A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian." This statement reads, "Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper's Private Cloud services ultimately resulted in the deletion of UniSuper's Private Cloud subscription. This is an isolated, 'one-of-a-kind occurrence' that has never before occurred with any of Google Cloud's clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again."

[...] A June 2023 press release touted UniSuper's big cloud migration to Google, with Sam Cooper, UniSuper's Head of Architecture, saying, "With Google Cloud VMware Engine, migrating to the cloud is streamlined and extremely easy. It's all about efficiencies that help us deliver highly competitive fees for our members."

[...] The second must-read document in this whole saga is the outage update page, which contains 12 statements as the cloud devs worked through this catastrophe. The first update is May 2 with the ominous statement, "You may be aware of a service disruption affecting UniSuper's systems." UniSuper immediately seemed to have the problem nailed down, saying, "The issue originated from one of our third-party service providers, and we're actively partnering with them to resolve this." On May 3, Google Cloud publicly entered the picture with a joint statement from UniSuper and Google Cloud saying that the outage was not the result of a cyberattack.

[...] The joint statement and the outage updates are still not a technical post-mortem of what happened, and it's unclear if we'll get one. Google PR confirmed in multiple places it signed off on the statement, but a great breakdown from software developer Daniel Compton points out that the statement is not just vague, it's also full of terminology that doesn't align with Google Cloud products. The imprecise language makes it seem like the statement was written entirely by UniSuper. It would be nice to see a real breakdown of what happened from Google Cloud's perspective, especially when other current or potential customers are going to keep a watchful eye on how Google handles the fallout from this.

Anyway, don't put all your eggs in one cloud basket.


Original Submission

 
This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by VLM on Wednesday May 22 2024, @02:14PM (1 child)

    by VLM (445) on Wednesday May 22 2024, @02:14PM (#1357808)

    an inadvertent misconfiguration during provisioning

    Yeah pretty vague and unclear who's to blame. Note how carefully they don't assign blame in the press release or even the journalist coverage. I got a feeling this is one of those he-said she-said disputes about someone's cloud-hosted K8S clusters.

    Anyone else mess with K8S? I do. For money, even.

    The problem with K8S is its CLI is modal like vi. So test cluster #3 is all F-ed up and you want to wipe and start over from scratch. So you use one of a zillion CLI methods to connect to test cluster #3 and use kubectl or a helpful wrapper or maybe a script to utterly wipe the F-ing thing. Whoop-a-daisy step number 1 failed and you're actually still connected to the Prod cluster and the backup collection which you just perma-wiped. Everyone who does K8S stuff either is a noob who's faking it, or they're lying, or they'll admit that at least one time they ran 'kubectl delete' on the wrong namespace or even entirely wrong cluster.

    Another good way to blow up a K8S system is to extensively automate the F out of a deployment so you can go from bare empty K8S to a reasonable scalable production cluster. On test cluster you scale or autoscale to exactly one pod per deployment to save resources but in production you'd have autoscaler spawn off like 50 pods for each deployment, all good right? So you mess around with the PROD cluster for whatever reason, next task is redeploying a fresh test cluster, so you connect to test cluster, run your bare metal wipe and deploy for test clusters; oh shit my modal connection to testcluster.yaml failed, I'm still connected to PROD. Well that's a mess alright. That's why you keep backups or don't have your storage class auto-delete old PVCs. Unless you also wiped your backups (because who needs backups of temporary test clusters?) and you wipe old PVs for security or "saving money".

    The more productivity you gain from extensive automation and CI/CD type stuff, the more time you eventually lose when it fails and its hyper complicated or impossible to fix. K8S and similar tools are elephant-sweepers, they are optimized at sweeping elephant sized piles of complexity under the rug, and they work great until they don't. Then you need to pay me or someone like me $$$$ to come in and fix it or frankly just to figure out what happened sometimes. The more complicated something is, the more time someone like me will get paid to diagnose; actual fixing, at least as best as you can, usually isn't that hard.

    The other elephant-sweeping problem with K8S is running it yourself on your own bare metal is easy until its hard. This client with strange security issues (NDA situation) has a medium size K8S (like less than triple digit nodes) and JUST last week they tried upgrading their K8S to v1.28.9 because I donno why they do stuff like that, and everyone knows UPSes are less reliable than grid power (at least in civilized areas where I live) so they managed to lose power to the entire F-ing cluster during the upgrade. Really, you decided to do a K8S upgrade DURING a severe thunderstorm warning, you kind-of get what you deserve... K8S will bravely try to fix itself but IIRC one of the DaemonSet for Canal (its sort of a network layer run on every node in the cluster) blew itself up in some spectacular fashion during the power outage such that JUST that one node's Canal wouldn't properly and ended up in crashloop state and even deleting the pod and letting K8S try to deploy a new one didn't work. Of course it took my entire 3 hour minimum to figure out wtf was going on because the symptom report is "its all Fed up after a site-wide power failure". The fix was about 60 seconds, IIRC I did a rollback on the canal deployment and the node came right up on the cached older version of canal and presumably after I was done they upgraded again or redeployed or I donno not my problem (yet). That brings up the other issue, a system as huge as K8S is nearly impossible for most people to entirely understand (I myself am no expert on some of the more obscure StorageClass stuff) and the world's full of K8S "admins" or at least "users" who don't even know what they don't know.

    If you don't know what K8S Canal is, its a networking layer that's kind of like Flannel and Calico had a baby. It actually works pretty well at "hiding" VXLAN complexity from the users/admins which works really well until it doesn't, as usual. Flannel is like we're going to hide simple VXLAN behind a bunch of yaml and magic and you don't have to know anything about networking until its notworking. Calico doesn't have a single job its like Netbox for IPAM for internal cluster addresses decided to also wrangle inter-host VLAN security and it wants to be a software firewall, kind of. This is all hand-wavy and I'm sure more than a one-liner for each would be more precise but if you don't do K8S this might clarify what the components do.

    Anyway that wall of text was how I spent a day on a semi-self inflicted K8S failure and in summary K8S is kind of like a FPS pistol with auto-aim hack that's perfectly happy aiming at your own foot, although it usually works pretty well. It would seem the UniSuper peeps discovered this on a somewhat larger scale.\

    There are numerous middlemen trying to sell nifty web uis and hand-holding services to sit between the sysadmins and K8S but the problem is they mostly introduce even more bugs and possible failure mechanisms and less people use them so if you're running "wtf v0.1.2" and have a normal K8S problem almost no one knows what "wtf 0.1.2" actually is doing WRT the problem, so best of luck.

    Starting Score:    1  point
    Moderation   +4  
       Insightful=1, Interesting=3, Total=4
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 4, Funny) by Ox0000 on Wednesday May 22 2024, @08:09PM

    by Ox0000 (5111) on Wednesday May 22 2024, @08:09PM (#1357848)

    The problem with K8S is its CLI is modal like vi. So test cluster #3 is all F-ed up and you want to wipe and start over from scratch. So you use one of a zillion CLI methods to connect to test cluster #3 and use kubectl or a helpful wrapper or maybe a script to utterly wipe the F-ing thing. Whoop-a-daisy step number 1 failed and you're actually still connected to the Prod cluster and the backup collection which you just perma-wiped.

    Who would want to do that? That sounds unpleasant. ... I don't think I know anybody like that [xkcd.com]