Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
posted by janrinok on Saturday June 01 2024, @09:28AM   Printer-friendly
from the D'oh! dept.

https://arstechnica.com/gadgets/2024/05/google-cloud-explains-how-it-accidentally-deleted-a-customer-account/

Previously on SoylentNews: "Unprecedented" Google Cloud Event Wipes Out Customer Account and its Backups - 20240521

Earlier this month, Google Cloud experienced one of its biggest blunders ever, when UniSuper, a $135 billion Australian pension fund, had its Google Cloud account wiped out due to some kind of mistake on Google's end. At the time, UniSuper indicated it had lost everything it had stored with Google, even its backups, and that caused two weeks of downtime for its 647,000 members. There were joint statements from the Google Cloud CEO and UniSuper CEO on the matter, a lot of apologies, and presumably a lot of worried customers who wondered if their retirement fund had disappeared.

[...] Two weeks later, Google Cloud's internal review of the problem is finished, and the company has a blog post up detailing what happened.

Google has a "TL;DR" at the top of the post, and it sounds like a Google employee got an input wrong.

During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer's GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. The incident trigger and the downstream system behavior have both been corrected to ensure that this cannot happen again.

[...] In its post-mortem, Google now says, "Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration." It's hard to square these two statements, especially with the two-week recovery period. The goal of a backup is to be quickly restored; so either UniSuper's backups didn't get deleted and weren't effective, leading to two weeks of downtime, or they would have been effective had they not been partially or completely wiped out.

[...] Google says Cloud still has "safeguards in place with a combination of soft delete, advance notification, and human-in-the-loop, as appropriate," and it confirmed these safeguards all still work.


Original Submission

Related Stories

“Unprecedented” Google Cloud Event Wipes Out Customer Account and its Backups 26 comments

https://arstechnica.com/gadgets/2024/05/google-cloud-accidentally-nukes-customer-account-causes-two-weeks-of-downtime/

Buried under the news from Google I/O this week is one of Google Cloud's biggest blunders ever: Google's Amazon Web Services competitor accidentally deleted a giant customer account for no reason. UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service. UniSuper thankfully had some backups with a different provider and was able to recover its data, but according to UniSuper's incident log, downtime started May 2, and a full restoration of services didn't happen until May 15.

UniSuper's website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled "A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian." This statement reads, "Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper's Private Cloud services ultimately resulted in the deletion of UniSuper's Private Cloud subscription. This is an isolated, 'one-of-a-kind occurrence' that has never before occurred with any of Google Cloud's clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again."

[...] A June 2023 press release touted UniSuper's big cloud migration to Google, with Sam Cooper, UniSuper's Head of Architecture, saying, "With Google Cloud VMware Engine, migrating to the cloud is streamlined and extremely easy. It's all about efficiencies that help us deliver highly competitive fees for our members."

[...] The second must-read document in this whole saga is the outage update page, which contains 12 statements as the cloud devs worked through this catastrophe. The first update is May 2 with the ominous statement, "You may be aware of a service disruption affecting UniSuper's systems." UniSuper immediately seemed to have the problem nailed down, saying, "The issue originated from one of our third-party service providers, and we're actively partnering with them to resolve this." On May 3, Google Cloud publicly entered the picture with a joint statement from UniSuper and Google Cloud saying that the outage was not the result of a cyberattack.

[...] The joint statement and the outage updates are still not a technical post-mortem of what happened, and it's unclear if we'll get one. Google PR confirmed in multiple places it signed off on the statement, but a great breakdown from software developer Daniel Compton points out that the statement is not just vague, it's also full of terminology that doesn't align with Google Cloud products. The imprecise language makes it seem like the statement was written entirely by UniSuper. It would be nice to see a real breakdown of what happened from Google Cloud's perspective, especially when other current or potential customers are going to keep a watchful eye on how Google handles the fallout from this.

Anyway, don't put all your eggs in one cloud basket.


Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 3, Interesting) by Anonymous Coward on Saturday June 01 2024, @10:35AM (3 children)

    by Anonymous Coward on Saturday June 01 2024, @10:35AM (#1358962)

    Maybe "move fast and fuck things up" coding needs to handle service input errors better then.

    • (Score: -1, Troll) by Anonymous Coward on Saturday June 01 2024, @01:28PM

      by Anonymous Coward on Saturday June 01 2024, @01:28PM (#1358982)

      Maybe "move fast and fuck things up" coding needs to handle service input errors better then...

      SoylentNews?

    • (Score: 1) by khallow on Monday June 03 2024, @03:28PM (1 child)

      by khallow (3766) Subscriber Badge on Monday June 03 2024, @03:28PM (#1359189) Journal
      Was that the problem here? I grant that Google has a huge field of things that moved fast, broke, and are no longer supported. But this doesn't seem to be one of their broken toys?
      • (Score: 2) by evilcam on Wednesday June 05 2024, @06:08AM

        by evilcam (3239) on Wednesday June 05 2024, @06:08AM (#1359389)

        From reading a few sources and Mastodon threads a few weeks back, I think what happened is that the input that was left blank was something like an Expiry date within the API to create a Google Cloud customer account.

        Normally this is set to null or some time very far in the future, but for whatever reason, when the Cloud VM Engine does it's thing if there's a null value it sets it to a year. I don't know why that would be input as the limit; maybe the intent of the product is to spin up a green fields staging ground before the hosts are moved to an established Google Cloud tenancy... But yeah, an engineer didn't understand how one of the thousands of tools Google Cloud Engineers must use and it bit the customer hard. They are extremely fortunate in that the information necessary to rebuild those systems (as well as the data itself) was available. UniSuper has about $125Bn AUD in assets under their management in trust for people's retirement - this ranks it in the top 25 US pension funds [pionline.com] (we do things differently in the antipodes) at about $83Bn USD in assets.

        I think the part that is honestly understated is the need to have your DR Plans updated after moving to the cloud to mitigate against these sorts of data loss and disaster scenarios. There's a tendency to say "It's in the cloud, Google/Microsoft/Amazon/someone else are managing our DR" because there's an SLA associated with availability or FTO/RPO. What I suspect many service owners fail to understand is that the Shared Responsibility model isn't all encompassing and there are still risks associated with using a third party's data centre. As a professional in this domain, it's all too common to ask questions during a project like "How is the system backed up?" and get an answer like "It's SaaS, the vendor handle that." Which like, cool, but how do you know it's going to protect you? Are you accepting the risk? Because, say your tenancy does get Shanghai'd irrecoverably, then what? Your business could be kaput. In the weeks it takes you to recover, what happens to your customers? To your staff? To your Shareholders? As a Board Member you might be held personally liable. As a CTO, this could completely destroy your reputation. You may have some recourse through the courts, but all those EULA's and Contracts that many of us never read will have carve outs for liability and they won't be for the benefit of the customer...

        Ultimately, we as IT Professionals, need to have robust DR and Backup strategies. And we need to advocate for the funding to put those mitigations in place, because the impact of a disaster like the above is catastrophic and for many businesses is an existential threat. And if the firm you work for refuses to heed these warnings, as employees, we must decide whether that is worth the risk to our careers of remaining in their employ.

(1)