Buried under the news from Google I/O this week is one of Google Cloud's biggest blunders ever: Google's Amazon Web Services competitor accidentally deleted a giant customer account for no reason. UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service. UniSuper thankfully had some backups with a different provider and was able to recover its data, but according to UniSuper's incident log, downtime started May 2, and a full restoration of services didn't happen until May 15.
UniSuper's website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled "A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian." This statement reads, "Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper's Private Cloud services ultimately resulted in the deletion of UniSuper's Private Cloud subscription. This is an isolated, 'one-of-a-kind occurrence' that has never before occurred with any of Google Cloud's clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again."
[...] A June 2023 press release touted UniSuper's big cloud migration to Google, with Sam Cooper, UniSuper's Head of Architecture, saying, "With Google Cloud VMware Engine, migrating to the cloud is streamlined and extremely easy. It's all about efficiencies that help us deliver highly competitive fees for our members."
[...] The second must-read document in this whole saga is the outage update page, which contains 12 statements as the cloud devs worked through this catastrophe. The first update is May 2 with the ominous statement, "You may be aware of a service disruption affecting UniSuper's systems." UniSuper immediately seemed to have the problem nailed down, saying, "The issue originated from one of our third-party service providers, and we're actively partnering with them to resolve this." On May 3, Google Cloud publicly entered the picture with a joint statement from UniSuper and Google Cloud saying that the outage was not the result of a cyberattack.
[...] The joint statement and the outage updates are still not a technical post-mortem of what happened, and it's unclear if we'll get one. Google PR confirmed in multiple places it signed off on the statement, but a great breakdown from software developer Daniel Compton points out that the statement is not just vague, it's also full of terminology that doesn't align with Google Cloud products. The imprecise language makes it seem like the statement was written entirely by UniSuper. It would be nice to see a real breakdown of what happened from Google Cloud's perspective, especially when other current or potential customers are going to keep a watchful eye on how Google handles the fallout from this.
Anyway, don't put all your eggs in one cloud basket.
Related Stories
Previously on SoylentNews: "Unprecedented" Google Cloud Event Wipes Out Customer Account and its Backups - 20240521
Earlier this month, Google Cloud experienced one of its biggest blunders ever, when UniSuper, a $135 billion Australian pension fund, had its Google Cloud account wiped out due to some kind of mistake on Google's end. At the time, UniSuper indicated it had lost everything it had stored with Google, even its backups, and that caused two weeks of downtime for its 647,000 members. There were joint statements from the Google Cloud CEO and UniSuper CEO on the matter, a lot of apologies, and presumably a lot of worried customers who wondered if their retirement fund had disappeared.
[...] Two weeks later, Google Cloud's internal review of the problem is finished, and the company has a blog post up detailing what happened.
Google has a "TL;DR" at the top of the post, and it sounds like a Google employee got an input wrong.
During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer's GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. The incident trigger and the downstream system behavior have both been corrected to ensure that this cannot happen again.
[...] In its post-mortem, Google now says, "Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration." It's hard to square these two statements, especially with the two-week recovery period. The goal of a backup is to be quickly restored; so either UniSuper's backups didn't get deleted and weren't effective, leading to two weeks of downtime, or they would have been effective had they not been partially or completely wiped out.
[...] Google says Cloud still has "safeguards in place with a combination of soft delete, advance notification, and human-in-the-loop, as appropriate," and it confirmed these safeguards all still work.
(Score: 5, Insightful) by Runaway1956 on Wednesday May 22 2024, @01:28PM
This is a customer with enough clout to make waves. I can't help wondering how Google would have responded to the deletion of a single private citizen, or a small business with no resources to fight for what they've lost. All animals are equal, but some animals are more equal. We don't ever want to forget that. It is quite possible that hundreds, or even thousands, of little nobodies have been deleted in the past, but we never heard of them because they lack influence.
“I have become friends with many school shooters” - Tampon Tim Walz
(Score: 5, Informative) by Kell on Wednesday May 22 2024, @01:53PM
As a customer effected by this, we got emails on day 1 and pretty much every day after with blow-by-blow descriptions of everything going on. I shudder to think of what would have happened to my super balance if it had been truly wiped (can you say "class action"?) but I am actually extremely satisfied with how Unisuper handled it all. I wish other organisations would be as up-front and honest with their customers.
Scientists ask questions. Engineers solve problems.
(Score: 5, Touché) by stratified cake on Wednesday May 22 2024, @01:59PM (5 children)
You can't touch it, you can't control it and sooner or later it sheds its moisture and it's gone.
(Score: 3, Funny) by Gaaark on Wednesday May 22 2024, @09:27PM (4 children)
Now do it as a Haiku! ;)
--- Please remind me if I haven't been civil to you: I'm channeling MDC. ---Gaaark 2.0 ---
(Score: 5, Touché) by mrpg on Wednesday May 22 2024, @09:41PM (3 children)
Your dreams in the cloud
a gentle breeze
gone
(Score: 3, Funny) by Gaaark on Wednesday May 22 2024, @09:56PM
Clouds cry your dreams down
Gravity does it's job now
Google falls down, down
--- Please remind me if I haven't been civil to you: I'm channeling MDC. ---Gaaark 2.0 ---
(Score: 3, Interesting) by DECbot on Wednesday May 22 2024, @10:12PM (1 child)
nice, but for a true haiku you need a 5-7-5 syllable count, a seasonal reference (kigo), and a cutting word (kireji)---which doesn't really exist in English.
Treasure stored in clouds
Tengu blows away account
Admins cry Spring rains
cats~$ sudo chown -R us /home/base
(Score: 3, Insightful) by mrpg on Thursday May 23 2024, @02:51AM
Learn all the rules then forget them.
(Score: 5, Interesting) by VLM on Wednesday May 22 2024, @02:14PM (1 child)
Yeah pretty vague and unclear who's to blame. Note how carefully they don't assign blame in the press release or even the journalist coverage. I got a feeling this is one of those he-said she-said disputes about someone's cloud-hosted K8S clusters.
Anyone else mess with K8S? I do. For money, even.
The problem with K8S is its CLI is modal like vi. So test cluster #3 is all F-ed up and you want to wipe and start over from scratch. So you use one of a zillion CLI methods to connect to test cluster #3 and use kubectl or a helpful wrapper or maybe a script to utterly wipe the F-ing thing. Whoop-a-daisy step number 1 failed and you're actually still connected to the Prod cluster and the backup collection which you just perma-wiped. Everyone who does K8S stuff either is a noob who's faking it, or they're lying, or they'll admit that at least one time they ran 'kubectl delete' on the wrong namespace or even entirely wrong cluster.
Another good way to blow up a K8S system is to extensively automate the F out of a deployment so you can go from bare empty K8S to a reasonable scalable production cluster. On test cluster you scale or autoscale to exactly one pod per deployment to save resources but in production you'd have autoscaler spawn off like 50 pods for each deployment, all good right? So you mess around with the PROD cluster for whatever reason, next task is redeploying a fresh test cluster, so you connect to test cluster, run your bare metal wipe and deploy for test clusters; oh shit my modal connection to testcluster.yaml failed, I'm still connected to PROD. Well that's a mess alright. That's why you keep backups or don't have your storage class auto-delete old PVCs. Unless you also wiped your backups (because who needs backups of temporary test clusters?) and you wipe old PVs for security or "saving money".
The more productivity you gain from extensive automation and CI/CD type stuff, the more time you eventually lose when it fails and its hyper complicated or impossible to fix. K8S and similar tools are elephant-sweepers, they are optimized at sweeping elephant sized piles of complexity under the rug, and they work great until they don't. Then you need to pay me or someone like me $$$$ to come in and fix it or frankly just to figure out what happened sometimes. The more complicated something is, the more time someone like me will get paid to diagnose; actual fixing, at least as best as you can, usually isn't that hard.
The other elephant-sweeping problem with K8S is running it yourself on your own bare metal is easy until its hard. This client with strange security issues (NDA situation) has a medium size K8S (like less than triple digit nodes) and JUST last week they tried upgrading their K8S to v1.28.9 because I donno why they do stuff like that, and everyone knows UPSes are less reliable than grid power (at least in civilized areas where I live) so they managed to lose power to the entire F-ing cluster during the upgrade. Really, you decided to do a K8S upgrade DURING a severe thunderstorm warning, you kind-of get what you deserve... K8S will bravely try to fix itself but IIRC one of the DaemonSet for Canal (its sort of a network layer run on every node in the cluster) blew itself up in some spectacular fashion during the power outage such that JUST that one node's Canal wouldn't properly and ended up in crashloop state and even deleting the pod and letting K8S try to deploy a new one didn't work. Of course it took my entire 3 hour minimum to figure out wtf was going on because the symptom report is "its all Fed up after a site-wide power failure". The fix was about 60 seconds, IIRC I did a rollback on the canal deployment and the node came right up on the cached older version of canal and presumably after I was done they upgraded again or redeployed or I donno not my problem (yet). That brings up the other issue, a system as huge as K8S is nearly impossible for most people to entirely understand (I myself am no expert on some of the more obscure StorageClass stuff) and the world's full of K8S "admins" or at least "users" who don't even know what they don't know.
If you don't know what K8S Canal is, its a networking layer that's kind of like Flannel and Calico had a baby. It actually works pretty well at "hiding" VXLAN complexity from the users/admins which works really well until it doesn't, as usual. Flannel is like we're going to hide simple VXLAN behind a bunch of yaml and magic and you don't have to know anything about networking until its notworking. Calico doesn't have a single job its like Netbox for IPAM for internal cluster addresses decided to also wrangle inter-host VLAN security and it wants to be a software firewall, kind of. This is all hand-wavy and I'm sure more than a one-liner for each would be more precise but if you don't do K8S this might clarify what the components do.
Anyway that wall of text was how I spent a day on a semi-self inflicted K8S failure and in summary K8S is kind of like a FPS pistol with auto-aim hack that's perfectly happy aiming at your own foot, although it usually works pretty well. It would seem the UniSuper peeps discovered this on a somewhat larger scale.\
There are numerous middlemen trying to sell nifty web uis and hand-holding services to sit between the sysadmins and K8S but the problem is they mostly introduce even more bugs and possible failure mechanisms and less people use them so if you're running "wtf v0.1.2" and have a normal K8S problem almost no one knows what "wtf 0.1.2" actually is doing WRT the problem, so best of luck.
(Score: 4, Funny) by Ox0000 on Wednesday May 22 2024, @08:09PM
Who would want to do that? That sounds unpleasant. ... I don't think I know anybody like that [xkcd.com]
(Score: 5, Interesting) by bzipitidoo on Wednesday May 22 2024, @02:22PM (5 children)
For a year, I worked as a sysadmin for a small tech company, and we had our problems. Had a db admin and a network admin too. The devs frequently used admins as scapegoats. Particularly galling when it was a problem they had caused by ignoring us.
The worst incident was one of those. The db admin wanted to password protect the database, not for security against hackers, but just to prevent accidents. The devs didn't want to be inconvenienced, and overruled him. One of them had also written a badly designed script that was used to install updates. The bad design was that the user logged into the computer with the source code to run this script, and passed the script a parameter to tell it which server to update, and that the update process was not a minimal one but instead a near complete reinstallation. No rsync, nosirree, this script used rm -rf and scp!
And then, it happened. The dev who usually ran the script accidentally pointed it at the production servers when he meant to point it at the test servers. Now, for provisioning the test servers, the script did a little additional action: dropped all the tables in the database. And because the production database had not been password protected, this wiped out all the data in the production db. Goodbye to everything. Everything! Our internal notes, and every written thing every one of all our customers had ever entrusted to us.
We were discussing some other relatively trivial problem, when a refresh of the company's web page to read a comment about it didn't work. The company's web site had just gone down! That started the mad scramble. Had we just been hacked? We leapt into action. I started checking to see which servers I could still reach, and those that I could reach (all of them), started examining the logs for evidence of intrusion or anything else to give us info on what had happened. I was not finding anything wrong, but I kept looking. The db admin was stressing to the max, trying to contain the damage. He checked the backups, only to find out that for lack of storage space, another of the devs had turned off the database backups a week before! The system had been programmed to make a backup every day. Soon, the boss learned that something was wrong, and got in touch with all of us to monitor the situation. Urged everyone not to panic, even as he himself started showing signs of panic. About 5 minutes later, the dev who had caused the crisis confessed, thus laying to rest fears that a hack had been perpetrated and might still be ongoing.
The db admin pulled off a miracle. He was able to restore the database from the most recent backup, then a week old, and bring it up to date by running all the transactions that had been done since that backup, because those transactions had all been logged, and those logs were still present. It took most of the day to restore the website to functionality though not with the most recent activity, and then, another 3 weeks to run all the logged transactions.
In the aftermath, I made a completely new script to handle the process of updating. The way I designed it, you logged into the target machine, then pulled the update to it. No more of this pushing of updates from the source, now it was pull to the target. That way, I hoped to insure no such mistake would ever happen again. The user could not fail to know which server they were on, or so I hoped. But another design change I made was to build everything in a newly created directory, then when that was done, rename the directories. The production directory got renamed to "old", and the "new" directory to production. That way, if there was any problem, could easily go back to the previous version. Also, that meant the service could stay up and running on the previous version while the new version was being installed, whereas previously, almost the first step had been that "rm -rf", taking the site down for the approximately 20 minutes it took to do an install, and making it impossible to revert.
(Score: 2) by RS3 on Thursday May 23 2024, @10:14PM (4 children)
30 or so years ago I worked at a "systems integrator" (yes, too generic). Basically a bunch of engineers doing PLC and industrial control system design and programming. 2 _very_ bright guys were doing some kind of big C/Unix project. Worked all day every day for months.
Things were working well, and at some point near the end of the project they decided to do a backup. Of course the guy brain-farted, gave the parameters in reverse, and wrote linear tape to /dev/hda (or whatever it was on that system).
They stopped fairly quickly, but obviously it was trashed. I'm not aware of there being any good file recovery or filesystem rebuild tools in those days for *nix filesystems.
Being a bit of a hacker I offered to scan the drive by sectors and extract as much as possible, but they declined and spent about 2 weeks recreating the entire project.
No direct impact on me, but whenever I'm in a position that could cause a big problem, I double and triple-check what I've typed before hitting 'enter'.
(Score: 2) by bzipitidoo on Friday May 24 2024, @02:20AM (3 children)
Yes, swapping the parameters is one of the tragic mistakes computers too easily empower. All the worse that commands aren't consistent on the ordering they expect. You've likely heard that "dd" stands for "disk destroyer". Ever meant to type in a capital A, and instead of hitting the shift key, you hit the ctrl key, which in some editors and applications selects everything, and then your next keypress replaces everything with the single character you just typed? If the editor can undo that mistake, no big deal, but occasionally, it can't be undone.
One of the worst systems for too easily losing text is the text boxes our browsers use to write these very comments. I've gotten into the habit of using that same dangerous ctrl-a followed by ctrl-c to copy my comments to the clipboard, in case something goes wrong when I mash that submit button, and then, when I go back, the browser presents me with a text box empty of the comment I had just made.
(Score: 2) by RS3 on Friday May 24 2024, @03:17AM (1 child)
Yes, I've done the ctrl-a ctrl-c when commenting here and other places. For a while this site would stop responding for a minute or many. Other times I've had browser crash. Sometimes I'm just not 'feeling it' and don't wish to post, maybe come back to it. However, every now and then the browser surprises me by remembering what I've written and filling in the text box with what I had written, like if I accidentally hit ctrl-w, which is a very common keystroke for me.
You mentioned 'dd', but I think dd is safer because you have to specify 'if' and 'of', which has the chance of hopefully making the person think a bit more. I'm pretty sure the aforementioned C/Unix guys used cp.
(Score: 2) by bzipitidoo on Friday May 24 2024, @02:15PM
The thing about having a "select all" function is the use cases. When would anyone want to "select all" in order to replace everything with a single character? If it's not "never", it's certainly rare enough that it ought to be changed. No one wants to do that when there are other options such as opening a new file. So why do editors facilitate such an action? After pressing ctrl-a, the main and almost only keypress that does something functional without erasing everything is ctrl-c. The only other keypresses that don't erase everything but instead "select none" are the arrow keys, and "select not quite all" are shift-left and up arrow keys. A change to that functionality could save a lot of pain. Maybe ctrl-a should be "copy everything to the clipboard and don't select anything", instead of "select all". Or maybe it should be "select all, for copying only", and the input of any visible character should deselect everything then just append that character to the end of the existing text, not replace all that text. If the user really does want it all deleted, let them use the backspace or the delete key for that.
In the early days of the MMORPG Everquest, 'a' was for "attack", and soon earned a reputation as the "'a' key of death". The player would be in a town facing a friendly NPC that was much more powerful than the player, and be chatting with other people, but missed prefacing their latest chat with the command to tell the system it was chat text, and the next time they used a word with the letter 'a', the system would take this as a wish to suicidally attack that friendly NPC, instantly turning it into an enemy that would squash the player in one mighty blow. Everquest eventually moved that "attack" command from 'a' to, IIRC, ctrl-a.
(Score: 2) by RS3 on Friday May 24 2024, @03:19AM
While I'm thinking of dd, if you're ever trying to recover a damaged hard drive, trying to dd sick drive to a good one or image file, dd will halt on sector errors. Check out ddrescue.
(Score: 5, Insightful) by SomeGuy on Wednesday May 22 2024, @06:00PM (2 children)
I'm honestly surprised this sort of thing does not happen more often. It seems like everything is falling apart these days, nothing really works any more, and nobody wants to hire anyone to make sure things keep running - just half-assed patch up whatever little problem and begone.
Don't put all your eggs in one cloud basket? As far as anyone in charge is ever concerned, there is only ONE basket, and that IS the cloud, and everything must go in there, because it makes their dicks look big. No need to worry about where anything is located, the cloud takes care of that. No need to concern yourself with how much resources are needed, it's all unicorn magic in teh cloudz!.
(Score: 1, Insightful) by Anonymous Coward on Wednesday May 22 2024, @06:48PM
> It seems like everything is falling apart these days, nothing really works any more, ....
Wild ass guess: SomeGuy is younger than 40, perhaps quite a bit younger?
Looking from my late 60s, it's pretty clear that everything has been falling apart and only patched back together--for most of human history. With 24/7 internet nooze, you might hear about more examples these days?
(Score: 4, Touché) by stormreaver on Wednesday May 22 2024, @09:04PM
It doesn't. This is the only time anything like this has ever happened, and even then I'm supremely confident that it is nothing more than mass hysteria. After all, we were assured that cloud providers are way more qualified to handle our business than we are, have way more people to ensure our data is safe, have redundant backups that we could only dream of, etc.
Nothing to see here, citizen. Move along.
(Score: 3, Insightful) by darkfeline on Thursday May 23 2024, @07:30AM
I dunno if any more details have been publicized, but I can tell you that a surprising number of people have deleted their cloud storage buckets and then realized, oh no! that contains lots of important info, can it be undeleted please?
What's even scarier is that I've heard of multiple cases of people clicking the delete button, clicking past the confirmation and warning that deletes are irreversible, who report that they did this because they didn't know it would actually delete it. I mean, can't hurt to test what the button does, right?
Recall that 50% of the population have below 100 IQ, and read the wisdom of Terry Pratchett:
Some humans would do anything to see if it was possible to do it. If
you put a large switch in some cave somewhere, with a sign on it
saying 'End-of-the-World Switch. PLEASE DO NOT TOUCH', the paint
wouldn't even have time to dry.
Join the SDF Public Access UNIX System today!
(Score: 4, Insightful) by Nuke on Thursday May 23 2024, @12:03PM (5 children)
That this shit can happen is yet another reason against "going paperless", as my bank and everyone else keeps nagging and hyping. At least a paper statement from the bank is proof of what I have at some point.
(Score: 0) by Anonymous Coward on Thursday May 23 2024, @12:27PM
Yep, I keep seven years (or more) of paper banking and other investment records. Also the original account-opening paperwork.
Keeping paper is cheaper than it used to be--with the big "paperless" movement I haven't had to pay for a metal filing cabinet in many years. They are pretty easy to find around here for free (out with the trash, etc).
(Score: 2) by Joe Desertrat on Friday May 24 2024, @01:55AM (3 children)
There's no reason to print this stuff on paper. You can easily download statements from your bank (if you can't, get another bank). Of course, you will have to back this up yourself, but that is simple enough these days, and no matter how much space it takes on your hard drive, external drive and/or disks, it will still take up far less physical space than piles of paper will.
(Score: 2) by Nuke on Friday May 24 2024, @09:35AM (2 children)
Having a paper statement from the bank is forensic proof that you had that that money in the bank at least at that point in time. Of course it could be a few weeks old, except that most of my money is in longer term bonds or accounts that need some notice of withdrawal, so the statement will show the amount up to the present date.
This news item itself demonstrates a reason, one of several. Another is being able to annotate statements yourself.
How do you download statements from your bank if they have lost all their data in a cloud, like this pension fund did?
Bank statements don't take up much space, although it could be an issue if you live in a broom cupboard but I don't. I have far more space being taken up by books, bikes, furniture, PCs, mementos, tools, food, pet equipment, boxes of USB leads in every combination, tins of paint waiting for me to get round to doing some decorating, boxed of ceramic tiles ditto, I could go on and on. A 10mm thick stack of A4 bank statements in a pocket file is the last thing I worry about finding space for.
(Score: 2) by Joe Desertrat on Saturday June 01 2024, @12:33AM (1 child)
How do you get paper statements from them if they have lost all their data? You can download your statement each month (or however often you like), usually well before any paper statements have gone out in the mail. Save it on YOUR computer, with backups on reliable media of your choosing, and you will never have to worry.
(Score: 2) by Nuke on Saturday June 01 2024, @09:20AM
I would have all the paper statements* from before they lost all data, up to two weeks ago as I write this. Of course if they lost all data today they could claim I had withdrawn and spent everything in the last two weeks, like I was hiding a new car or yacht somewhere, but those statements are better than nothing and most of my money is in long-term bonds, or limited withdrawal saving accounts, anyway. With my having paper statements the only account the bank could argue with me about would be my current account (= "checking account" in USA?) which never has a large amount in it for very long.
I do that anyway.
Not forensically valid. Digital data can be tampered with and often is**. That's why you cannot open a bank account (in the UK anyway) without showing proof of residence etc on paper, digital version unacceptable - ironically by the very same banks that nag you to go paperless.
* Actually I keep most of them for up to 5 years ago, and the annual summaries for ever.
** As in the current UK Post Office Horizon scandal.