Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by martyb on Friday May 21 2021, @12:25AM   Printer-friendly

As many of you noticed, we had a site crash today. From around 1300 until 2200 UTC (2021-05-20).

A HUGE thank you goes to mechanicjay who spent the whole time trying to get our ndb (cluster) working again. It's an uncommon configuration, which made recovery especially challenging... there's just not a lot of documentation about it on the web.

I reached out and got hold of The Mighty Buzzard on the phone. Then put him in touch with mechanicjay who got us back up and running using backups.

Unfortunately, we had to go way back until April 14 to get a working backup. (I don't know all the details, but it appears something went sideways on neon).

We're all wiped out right now. When we have rested and had a chance to discuss things, we'll post an update.

In the meantime, please join me in thanking mechanicjay and TMB for all they did to get us up and running again!

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by bzipitidoo on Friday May 21 2021, @04:52PM

    by bzipitidoo (4388) on Friday May 21 2021, @04:52PM (#1137588) Journal

    You ever done any sysadmin work? I have. Data safety takes more space than you seem to realize. There is never enough space for all the things you ought to do to keep the data safe. You have to make compromises. RAIDs protect you from hard drive failures, but little else. They aren't backups, no matter how much some people feel they are. You need to keep daily backups, real ones, actual complete copies of the data, on separate machines, preferably in separate facilities, because there are disasters that can wipe out or disable an entire server farm.

    No, you DON'T throw away yesterday's backup the moment you have finished creating today's backup. You keep it for a while. What you want to do is keep 7 days worth of backups, and as each backup passes the age of 7 days, delete them, keeping only one to be your weekly backup. Keep 4 weekly backups, use one to be the monthly backup, and delete the other 3 after a month. That way, when someone finds a mistake, finds out that they accidentally deleted the wrong file, or edited out some crucial verbiage, perhaps 3 weeks later, you can go back and get a copy from before the mistake was made. At any moment, you're going to have at least a dozen or two dozen backups on hand. That takes a lot of storage space. Lot of companies, particularly small companies, can't do it, don't have the expertise or the equipment or money.

    What do you do when you run out of space? I was at a company when that happened, and the person who ran out of space took matters into his own hands. He deleted one of the backups, and TURNED OFF THE BACKUP PROCESSES. He had the authority to do that. I wasn't notified, nor did I receive any warnings frmo the monitoring software that alerts people when something is wrong with the backups, because he also turned that off. Didn't tell anyone, and he should have. That was the state of affairs for a week. Then it happened. Disaster. Another developer screwed up and accidentally erased our entire production database. How'd that happen, you might wonder? The developers had demanded, for their convenience and against the database admin's strenuous protests, no password protection on the database. The admin wanted to at least have that, but he was overruled.

    So this developer thought he was erasing and rebuilding the test database, and didn't realize until too late that he had his little script pointed at production. DROP TABLE on all the tables. Asking for a password would have stopped it. I was on a call with the database admin when it happened. We turned to check some little something on the website, only to discover it was gone! Was working fine just a moment before. Then there was a mad scramble to find out what the H was going on. Our first thought was that we'd somehow been hacked. But it soon became clear that if it was a hack job, it was very weird in that it left all the servers intact and running, all the login credentials unchanged. I found no evidence of intrusion anywhere. Soon the guilty developer confessed. Then it was discovered that the backups had been turned off. Our newest backup was a week old. The only thing that allowed the database admin to recover was that all the database activity had been logged on yet another machine. Took him 3 weeks to run the logs to catch up the week old backup, but he did it.

    Starting Score:    1  point
    Moderation   +3  
       Interesting=3, Total=3
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5