Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday February 03 2017, @06:39AM   Printer-friendly
from the 5-backup-strategies-weren't-enough dept.

Ruby Paulson at BlogVault reports

GitLab, the online tech hub, is facing issues as a result of an accidental database deletion that happened in the wee hours of last night. A tired, frustrated system administrator thought that deleting a database would solve the lag-related issues that had cropped up... only to discover too late that he'd executed the command for the wrong database.

[...] It's certainly freaky that all the five backup solutions that GitLab had were ineffective, but this incident demonstrates that a number of things can go wrong with backups. The real aim for any backup solution, is to be able to restore data with ease... but simple oversights could render backup solutions useless.

Computer Business Review adds

The data loss took place when a system administrator accidentally deleted a directory on the wrong server during a database replication process. A folder containing 300GB of live production data was completely wiped.

[...] The last potentially useful backup was taken six hours before the issue occurred.

However, this is not seen to be of any help as snapshots are normally taken every 24 hours and the data loss occurred six hours after the previous snapshot which [resulted in] six hours of data loss.

David Mytton, founder and CEO [of] Server Density, said: "This unfortunate incident at GitLab highlights the urgent need for businesses to review and refresh their backup and incident handling processes to ensure data loss is recoverable, and teams know how to handle the procedure.

GitLab has been updating a Google Doc with info on the ongoing incident.

Additional coverage at:
TechCrunch
The Register


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Interesting) by bzipitidoo on Friday February 03 2017, @06:22PM

    by bzipitidoo (4388) on Friday February 03 2017, @06:22PM (#462518) Journal

    I was working for a small startup as a sysadm. We had a db admin and a network admin. The rest of the technical people were all developers.

    To make their lives a tiny bit more convenient, the devs demanded that the db not be protected with passwords. Even though they would all know the passwords, they didn't want the bother. The db admin protested, but management overruled him. So he did what he could to protect us. Had snapshots backed up daily, and had logs stored on a separate server. One of the devs had cobbled together an update system in which he logged into the server with the source code, aimed at the server to update merely by typing in the IP address on the command line, and hit enter to start the process. An option was to erase the db. You can see where all this is going, I'm sure.

    Yep, it happened. One ordinary work morning, I was discussing some minor technical matter with our db admin, when suddenly the company's website quit working. There was a mad scramble to figure out what had happened. Had we just been hacked? Our db admin immediately saw that our production database was gone. I saw all the servers were still working, and started checking if any intruders had somehow managed to penetrate, while trying to ensure I didn't get kicked off. If intruders had gained root access, they could kick me off by killing any number of processes, like the bash shell or the ssh client. Soon however, the dev who'd made the update system confessed. He meant to update the test servers with a fresh, clean database, and had accidentally aimed at the production servers instead. Oops.

    Then we discovered the next problem. Thanks to a shortage of hard drive space, another dev had disabled the daily db backup! Our most recent db backup was a week old. Took our db admin all day to recover enough data to get us back online. Once he had the databases working again, he turned to the log files. Took him 3 weeks to find the points in the logs where the week old backup left off, and run all the commands that had accumulated since.

    I took over the design of the update system. Started from scratch and created one that was a pull rather than a push. Log into the servers to be updated, and run my scripts from there. Made a lot of other improvements, so that, for instance, we no longer had 20 minutes of downtime during an update. Downtime was less than 1 second the way I set things up, and we could as quickly revert to the previous version, whereas the original update system destroyed the previous installation. In short, the first update system was a terrible, rushed hack and it was easy to do far better.

    After that, our db admin was allowed to protect the database with passwords. As a colleague at another company put it, they didn't deserve to get their data back after such reckless handling. Nevertheless, lesson learned. A further point is that maybe our db should have protested more strenuously, even going as far as threatening to quit if they didn't change their minds on allowing password protection. You never want to turn to the nuclear threat of quitting, but for that issue I feel it was justified. The lack of that tiny little inconvenience was a major threat to his ability to do his job.

    Starting Score:    1  point
    Moderation   +2  
       Interesting=2, Total=2
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4