Buried under the news from Google I/O this week is one of Google Cloud's biggest blunders ever: Google's Amazon Web Services competitor accidentally deleted a giant customer account for no reason. UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service. UniSuper thankfully had some backups with a different provider and was able to recover its data, but according to UniSuper's incident log, downtime started May 2, and a full restoration of services didn't happen until May 15.
UniSuper's website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled "A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian." This statement reads, "Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper's Private Cloud services ultimately resulted in the deletion of UniSuper's Private Cloud subscription. This is an isolated, 'one-of-a-kind occurrence' that has never before occurred with any of Google Cloud's clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again."
[...] A June 2023 press release touted UniSuper's big cloud migration to Google, with Sam Cooper, UniSuper's Head of Architecture, saying, "With Google Cloud VMware Engine, migrating to the cloud is streamlined and extremely easy. It's all about efficiencies that help us deliver highly competitive fees for our members."
[...] The second must-read document in this whole saga is the outage update page, which contains 12 statements as the cloud devs worked through this catastrophe. The first update is May 2 with the ominous statement, "You may be aware of a service disruption affecting UniSuper's systems." UniSuper immediately seemed to have the problem nailed down, saying, "The issue originated from one of our third-party service providers, and we're actively partnering with them to resolve this." On May 3, Google Cloud publicly entered the picture with a joint statement from UniSuper and Google Cloud saying that the outage was not the result of a cyberattack.
[...] The joint statement and the outage updates are still not a technical post-mortem of what happened, and it's unclear if we'll get one. Google PR confirmed in multiple places it signed off on the statement, but a great breakdown from software developer Daniel Compton points out that the statement is not just vague, it's also full of terminology that doesn't align with Google Cloud products. The imprecise language makes it seem like the statement was written entirely by UniSuper. It would be nice to see a real breakdown of what happened from Google Cloud's perspective, especially when other current or potential customers are going to keep a watchful eye on how Google handles the fallout from this.
Anyway, don't put all your eggs in one cloud basket.
(Score: 5, Interesting) by bzipitidoo on Wednesday May 22 2024, @02:22PM (5 children)
For a year, I worked as a sysadmin for a small tech company, and we had our problems. Had a db admin and a network admin too. The devs frequently used admins as scapegoats. Particularly galling when it was a problem they had caused by ignoring us.
The worst incident was one of those. The db admin wanted to password protect the database, not for security against hackers, but just to prevent accidents. The devs didn't want to be inconvenienced, and overruled him. One of them had also written a badly designed script that was used to install updates. The bad design was that the user logged into the computer with the source code to run this script, and passed the script a parameter to tell it which server to update, and that the update process was not a minimal one but instead a near complete reinstallation. No rsync, nosirree, this script used rm -rf and scp!
And then, it happened. The dev who usually ran the script accidentally pointed it at the production servers when he meant to point it at the test servers. Now, for provisioning the test servers, the script did a little additional action: dropped all the tables in the database. And because the production database had not been password protected, this wiped out all the data in the production db. Goodbye to everything. Everything! Our internal notes, and every written thing every one of all our customers had ever entrusted to us.
We were discussing some other relatively trivial problem, when a refresh of the company's web page to read a comment about it didn't work. The company's web site had just gone down! That started the mad scramble. Had we just been hacked? We leapt into action. I started checking to see which servers I could still reach, and those that I could reach (all of them), started examining the logs for evidence of intrusion or anything else to give us info on what had happened. I was not finding anything wrong, but I kept looking. The db admin was stressing to the max, trying to contain the damage. He checked the backups, only to find out that for lack of storage space, another of the devs had turned off the database backups a week before! The system had been programmed to make a backup every day. Soon, the boss learned that something was wrong, and got in touch with all of us to monitor the situation. Urged everyone not to panic, even as he himself started showing signs of panic. About 5 minutes later, the dev who had caused the crisis confessed, thus laying to rest fears that a hack had been perpetrated and might still be ongoing.
The db admin pulled off a miracle. He was able to restore the database from the most recent backup, then a week old, and bring it up to date by running all the transactions that had been done since that backup, because those transactions had all been logged, and those logs were still present. It took most of the day to restore the website to functionality though not with the most recent activity, and then, another 3 weeks to run all the logged transactions.
In the aftermath, I made a completely new script to handle the process of updating. The way I designed it, you logged into the target machine, then pulled the update to it. No more of this pushing of updates from the source, now it was pull to the target. That way, I hoped to insure no such mistake would ever happen again. The user could not fail to know which server they were on, or so I hoped. But another design change I made was to build everything in a newly created directory, then when that was done, rename the directories. The production directory got renamed to "old", and the "new" directory to production. That way, if there was any problem, could easily go back to the previous version. Also, that meant the service could stay up and running on the previous version while the new version was being installed, whereas previously, almost the first step had been that "rm -rf", taking the site down for the approximately 20 minutes it took to do an install, and making it impossible to revert.
(Score: 2) by RS3 on Thursday May 23 2024, @10:14PM (4 children)
30 or so years ago I worked at a "systems integrator" (yes, too generic). Basically a bunch of engineers doing PLC and industrial control system design and programming. 2 _very_ bright guys were doing some kind of big C/Unix project. Worked all day every day for months.
Things were working well, and at some point near the end of the project they decided to do a backup. Of course the guy brain-farted, gave the parameters in reverse, and wrote linear tape to /dev/hda (or whatever it was on that system).
They stopped fairly quickly, but obviously it was trashed. I'm not aware of there being any good file recovery or filesystem rebuild tools in those days for *nix filesystems.
Being a bit of a hacker I offered to scan the drive by sectors and extract as much as possible, but they declined and spent about 2 weeks recreating the entire project.
No direct impact on me, but whenever I'm in a position that could cause a big problem, I double and triple-check what I've typed before hitting 'enter'.
(Score: 2) by bzipitidoo on Friday May 24 2024, @02:20AM (3 children)
Yes, swapping the parameters is one of the tragic mistakes computers too easily empower. All the worse that commands aren't consistent on the ordering they expect. You've likely heard that "dd" stands for "disk destroyer". Ever meant to type in a capital A, and instead of hitting the shift key, you hit the ctrl key, which in some editors and applications selects everything, and then your next keypress replaces everything with the single character you just typed? If the editor can undo that mistake, no big deal, but occasionally, it can't be undone.
One of the worst systems for too easily losing text is the text boxes our browsers use to write these very comments. I've gotten into the habit of using that same dangerous ctrl-a followed by ctrl-c to copy my comments to the clipboard, in case something goes wrong when I mash that submit button, and then, when I go back, the browser presents me with a text box empty of the comment I had just made.
(Score: 2) by RS3 on Friday May 24 2024, @03:17AM (1 child)
Yes, I've done the ctrl-a ctrl-c when commenting here and other places. For a while this site would stop responding for a minute or many. Other times I've had browser crash. Sometimes I'm just not 'feeling it' and don't wish to post, maybe come back to it. However, every now and then the browser surprises me by remembering what I've written and filling in the text box with what I had written, like if I accidentally hit ctrl-w, which is a very common keystroke for me.
You mentioned 'dd', but I think dd is safer because you have to specify 'if' and 'of', which has the chance of hopefully making the person think a bit more. I'm pretty sure the aforementioned C/Unix guys used cp.
(Score: 2) by bzipitidoo on Friday May 24 2024, @02:15PM
The thing about having a "select all" function is the use cases. When would anyone want to "select all" in order to replace everything with a single character? If it's not "never", it's certainly rare enough that it ought to be changed. No one wants to do that when there are other options such as opening a new file. So why do editors facilitate such an action? After pressing ctrl-a, the main and almost only keypress that does something functional without erasing everything is ctrl-c. The only other keypresses that don't erase everything but instead "select none" are the arrow keys, and "select not quite all" are shift-left and up arrow keys. A change to that functionality could save a lot of pain. Maybe ctrl-a should be "copy everything to the clipboard and don't select anything", instead of "select all". Or maybe it should be "select all, for copying only", and the input of any visible character should deselect everything then just append that character to the end of the existing text, not replace all that text. If the user really does want it all deleted, let them use the backspace or the delete key for that.
In the early days of the MMORPG Everquest, 'a' was for "attack", and soon earned a reputation as the "'a' key of death". The player would be in a town facing a friendly NPC that was much more powerful than the player, and be chatting with other people, but missed prefacing their latest chat with the command to tell the system it was chat text, and the next time they used a word with the letter 'a', the system would take this as a wish to suicidally attack that friendly NPC, instantly turning it into an enemy that would squash the player in one mighty blow. Everquest eventually moved that "attack" command from 'a' to, IIRC, ctrl-a.
(Score: 2) by RS3 on Friday May 24 2024, @03:19AM
While I'm thinking of dd, if you're ever trying to recover a damaged hard drive, trying to dd sick drive to a good one or image file, dd will halt on sector errors. Check out ddrescue.