Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Sunday May 23 2021, @12:16PM   Printer-friendly
from the slow-but-steady dept.

This is a follow-up to our site crash.

If you are just tuning in, SoylentNews experienced a database crash on 2021-05-20. We tried to restore from recent backups, but found they were corrupted and unusable. Thanks to heroic work by mechanicjay the site is back up and running!

Many thanks to mechanicjay for his 16-hour(!) day on Thursday to get us back up and then move us to a single back-end configuration. He didn't stop there, but has continued on gathering information and guiding work to get us to a more stable foundation. We'll keep you posted as to our progress.

Read on if interested... otherwise, another story will be along presently.

First off, as I understand it, we had previously been running with a cluster database (DB) configuration using ndb1 and MySQL on two servers; (fluorine and hydrogen). This has been troublesome in the past. Our options going forward appeared to be either add yet another server (more redundancy apparently makes ndb less cranky) or to slim down, eliminate the cluster, and run with just a single server. At least for the time being, we have decided to go with the latter under the K.I.S.S. way of thinking.

It appears our database backups had issues because we had some free space on our server... but not enough to permit full and clean backups.

We now have a daily report e-mailed out to staff that lists: server name, disk space total, disk space used, and disk space available. This should help prevent a repeat occurence of out-of-space happening on our servers.

Secondly, the DB restore basically forgot everything that happened from 2021-04-14 onward. (Insert pithy Monty Python Dead Parrot joke here.) Yeah, it stinks losing all those stories, journals, comments, and moderations. fnord666 and I had each lost ~150 stories we had edited and pushed out to the site.

So, one of the things we found out that disappeared was people's site subscriptions. By a happy coincidence, I just happened to have a screen up on the site listing the most recent subscriptions. We had a long discussion among staff as to how to proceed. First off, when the site is back and stable, we need a high priority code change: we need to log each subscription to someplace in addition to the DB.

Just to make things more interesting, since the site came back up, we have had some new subscriptions come in. The easiest and safest approach to restore these subscriptions came out to be straightforward albeit tedious.

I went through all known subs that got dropped, and gave a "gift" subscription to replace them. These gifts were based on the minimums listed on the Subscribe page. For example, if your subscription was for $20.00 or more, you were gifted with a 365-day subscription. At least $12.00 but less than $20.00 would get a 180-day subscription. Lastly, any subscription for less than $12.00 received a 30-day gift subscription.

There's one complication, we are still dealing with. Two gift subscriptions were made between 2021-04-14 and 2021-05-20. I know in whose name and UID the gift was made, but not that of the user actually making the gift. (We know the "giftee" but not the "gifter".) In one case we *do* have an email address; I've reached out to that person via e-mail for more information. Sadly, I have no other identifying information for the other person.

tl;dr: please check your subscription. If you find a discrepancy, please send an e-mail to admin (at) soylentnews (dot) org and I will personally look into it for you. Please provide whatever information you can: date, amount, Stripe vs PayPal, etc.

Lastly, I've seen a strong increase in story submissions among other very encouraging signs. It's as if there's a change in attitude from "what can I get from the site" to "what can I contribute". Thanks everybody for all you've done!

1What is an NDB Cluster?


Original Submission

Related Stories

Meta: SoylentNews.org had a Site Crash and We're Back! (mostly) 184 comments

As many of you noticed, we had a site crash today. From around 1300 until 2200 UTC (2021-05-20).

A HUGE thank you goes to mechanicjay who spent the whole time trying to get our ndb (cluster) working again. It's an uncommon configuration, which made recovery especially challenging... there's just not a lot of documentation about it on the web.

I reached out and got hold of The Mighty Buzzard on the phone. Then put him in touch with mechanicjay who got us back up and running using backups.

Unfortunately, we had to go way back until April 14 to get a working backup. (I don't know all the details, but it appears something went sideways on neon).

We're all wiped out right now. When we have rested and had a chance to discuss things, we'll post an update.

In the meantime, please join me in thanking mechanicjay and TMB for all they did to get us up and running again!

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Funny) by Anonymous Coward on Sunday May 23 2021, @12:48PM (4 children)

    by Anonymous Coward on Sunday May 23 2021, @12:48PM (#1137939)

    Hello, I am Prince Njubidu of Nigeria. I have sent a gift subscription to many of your subscribers last month and would like to verify that they received it. Please send me their Stripe, Paypal or other information and I will hasten to expedite the review.

    • (Score: 5, Insightful) by captain normal on Sunday May 23 2021, @01:50PM (3 children)

      by captain normal (2205) on Sunday May 23 2021, @01:50PM (#1137947)

      Is this the same AC that bashed TMB so much yesterday? What a waste of a seeming good brain. I may not agree with some of Buzzy's politics, but he has given many hours (that he could have been fishing) on keeping the site gong. I salute him for that.

      --
      When life isn't going right, go left.
      • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @08:31PM (2 children)

        by Anonymous Coward on Sunday May 23 2021, @08:31PM (#1138020)

        Yes, the same. We are one.

        • (Score: 2) by Gaaark on Sunday May 23 2021, @08:45PM (1 child)

          by Gaaark (41) on Sunday May 23 2021, @08:45PM (#1138026) Journal

          One of us! One of us! One of US!

          --
          --- Please remind me if I haven't been civil to you: I'm channeling MDC. ---Gaaark 2.0 ---
          • (Score: 2) by kazzie on Monday May 24 2021, @06:04PM

            by kazzie (5309) Subscriber Badge on Monday May 24 2021, @06:04PM (#1138290)

            Sigh... Time to dig out the gramaphone record of "Monkey on a String" on the Hammond Organ *again* [comedy.co.uk]...

  • (Score: 2, Funny) by Anonymous Coward on Sunday May 23 2021, @01:43PM (2 children)

    by Anonymous Coward on Sunday May 23 2021, @01:43PM (#1137946)

    All of this could have been avoided if you had only turned on the machine that goes 'Bing'. That's what it's for.

    • (Score: 4, Funny) by maxwell demon on Sunday May 23 2021, @05:32PM (1 child)

      by maxwell demon (1608) on Sunday May 23 2021, @05:32PM (#1137983) Journal

      The problem of course was that the backup got its favourite colour wrong, therefore it could not pass the bridge to the backup server.

      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 3, Funny) by Gaaark on Sunday May 23 2021, @08:49PM

        by Gaaark (41) on Sunday May 23 2021, @08:49PM (#1138028) Journal

        Should'a stopped at the Castle of Aaargh!
        (Where you be seeing Gaaargh!)

        --
        --- Please remind me if I haven't been civil to you: I'm channeling MDC. ---Gaaark 2.0 ---
  • (Score: -1, Troll) by Anonymous Coward on Sunday May 23 2021, @03:07PM (6 children)

    by Anonymous Coward on Sunday May 23 2021, @03:07PM (#1137960)

    Inquiring minds want to know

    I smell betrayal

    • (Score: 1, Funny) by Anonymous Coward on Sunday May 23 2021, @04:00PM

      by Anonymous Coward on Sunday May 23 2021, @04:00PM (#1137965)

      That's your underpants. Always wear clean underwear.

    • (Score: 2, Funny) by Anonymous Coward on Sunday May 23 2021, @04:08PM

      by Anonymous Coward on Sunday May 23 2021, @04:08PM (#1137968)

      He has become The Mighty BUZZER, which is pressed only something goes wrong. :-)

    • (Score: 3, Insightful) by Tork on Sunday May 23 2021, @05:34PM

      by Tork (3914) Subscriber Badge on Sunday May 23 2021, @05:34PM (#1137984)

      Obvious troll is obvious.

      --
      🏳️‍🌈 Proud Ally 🏳️‍🌈
    • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @07:45PM (2 children)

      by Anonymous Coward on Sunday May 23 2021, @07:45PM (#1138009)

      We sent him to a farm in upstate New York.

      • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @08:15PM (1 child)

        by Anonymous Coward on Sunday May 23 2021, @08:15PM (#1138015)

        funny farm?

        • (Score: 0) by Anonymous Coward on Monday May 24 2021, @01:03AM

          by Anonymous Coward on Monday May 24 2021, @01:03AM (#1138089)

          Now that he is there, it isn't funny anymore.

  • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @04:19PM

    by Anonymous Coward on Sunday May 23 2021, @04:19PM (#1137973)

    heh, lol my browser (cache) still shows link to my (now in nirvana) comment ...

  • (Score: 5, Insightful) by choose another one on Sunday May 23 2021, @08:30PM (9 children)

    by choose another one (515) Subscriber Badge on Sunday May 23 2021, @08:30PM (#1138019)

    It appears our database backups had issues because we had some free space on our server... but not enough to permit full and clean backups.

    We now have a daily report e-mailed out to staff that lists: server name, disk space total, disk space used, and disk space available. This should help prevent a repeat occurence of out-of-space happening on our servers.

    As someone who's had to deal with more than one enterprise customer having lost, in some cases weeks worth, valuable (I mean not that our commentary isn't valuable but...), data due to very similar occurrences I can categorically state that an emailed daily report is not enough. In fact a big red message all over the system UI (can you guess why we put that in?) when an admin logs in... is not enough. The slap-the-admin-user-around-the-face-with-a-wet-haddock feature was unfortunately put on hold due to protests from the haddock union who said the job was degrading (they'd seen some of the admins). We were considering optional interfacing with fire alarm systems but customers seemed to think their admins would fill up disks with porn junk to trigger extra fag breaks.

    There isn't really any solution, as the following possibly relevant Murphy's Law variants show:

    1. Murphy's Law of Transparent Failover: Any system that is redundant by failover so seamless that users do not notice, will failover without anyone noticing.*
    2. Murphy's Law of automated backups: Any backup system that completes without any user or admin action will fail without user or admin taking any action.**

    *You can add an explicit repeat-until-redundancy=0 around this per your favourite programming language, but it is implicit.

    **You can add "repeatedly" to both sides for emphasis, and also there is a subclause something like "and this failure will be at worst point of the regular test-restore schedule" - but apparently that doesn't apply in every customer environment (because test restore checks are not implemented).

    You can print these out an pin to server-room wall (not that anyone has a server room anymore). It won't stop it happening, but it may give some solace when it does that (a) many others have been there before you and (b) many others will go there after you.

    • (Score: 3, Interesting) by Anonymous Coward on Sunday May 23 2021, @08:54PM (4 children)

      by Anonymous Coward on Sunday May 23 2021, @08:54PM (#1138030)

      I second this. There is no value in a monitoring system that cries wolf every day, the end result is obvious: every admin will sooner or later null-route these e-mails, and you will be just as blind as before. Push-based messaging should be reserved for emergencies only.

      Here's what you will do:

      • create a simple site status page, showing only the most essential items: system uptime, 95-percentile request latency, ratio of 200 vs 500 http responses. Colour-coded like a traffic light seems a good start, based on your target metrics. You do have target metrics, don't you?
        All other monitoring data is stored for reference only, for ad-hoc querying if the admin needs it.
      • automate your restore tests and run them regularly. The status page will show a line for "time since last successful backup" and one for "time since last successful restore".
      • disable all sources of noise. The only push messages sent out by the monitoring system is when either a status page item turns red, or the status page hasn't been accessed for X hours/days.

      Trust me, the peanut gallery knows best.

      • (Score: 3, Insightful) by vux984 on Sunday May 23 2021, @09:49PM (3 children)

        by vux984 (5045) on Sunday May 23 2021, @09:49PM (#1138043)

        Push-based messaging should be reserved for emergencies only.

        The trouble with push based messaging that only happens in emergencies is that, like anything, it breaks. But nobody knows its broken, until the emergency it was supposed to warn about happens, and that's when you discover the notifications weren't working.

        Unless you have regular emergencies, that'll keep your emergency notification system maintained but then you clearly have other problems. :D

        Honestly the above posters suggestion of a color coded 'dashboard' page is possibly the best solution -- just make sure you stick the widget on some page you all log into daily. If you have to remember to go look at the widget you've lost.

        The only other possible suggestion I'd make is: make it someone else's problem. :)

        I use AWS Aurora (mysql) for some of our projects for example, because I really don't have time to do the job of managing databases (or email) servers properly. So instead I get clustering, read replicas, database backups, availability zones, snapshots, point-in-time recovery, and *someone else* is doing most of the work to keep it healthy. Sure I still need to take responsiblity for testing the backups from time to time to make sure they work etc.

        It's orders of magnitude better than running a dedicated database virtual machine doing mysql dumps and sending them off somewhere once a day.

        Obviously cost is a huge factor here, and may be the deal breaker, but it's worth considering.

        • (Score: 1, Interesting) by Anonymous Coward on Monday May 24 2021, @01:06AM

          by Anonymous Coward on Monday May 24 2021, @01:06AM (#1138091)

          That is one reason why we simulate failures without telling the notified party that we are simulating failures.

        • (Score: 2) by PiMuNu on Monday May 24 2021, @08:39AM (1 child)

          by PiMuNu (3823) on Monday May 24 2021, @08:39AM (#1138182)

          If you don't test it, it doesnt work.

          • (Score: 0) by Anonymous Coward on Monday May 24 2021, @01:13PM

            by Anonymous Coward on Monday May 24 2021, @01:13PM (#1138206)

            Yes, this, 100%. If you want to guarantee the delivery of push-based monitoring messages, you include it in your monitoring setup. You can either set up an automated monitor account that's supposed to receive a push message every X hours, or you make it an admin action to push a button and receive a message, and push another button if the message has been received. Then your status page can show a "time since last successful push-message test" and you can stop worrying about it.

    • (Score: 2, Offtopic) by fakefuck39 on Sunday May 23 2021, @09:05PM (1 child)

      by fakefuck39 (6620) on Sunday May 23 2021, @09:05PM (#1138035)

      This is complete bs you're spouting here. I've been doing this for 23 years - storage and compute w/ a specialty in BC/DR. What you describe is a rare event because of someone's incompetence. It's not only not unavoidable, it's in fact hard to create that scenario, in an enterprise environment - a bunch of people had to purposely fuck up.

      Since you're clearly not in the field, let me explain the standard solution every enterprise deploys - I've been at most fortune 50 companies, and had probably 200 customers, where I design, sell, and administer storage solutions.

      You have snapshots of our prod disk. This is either on the array or on the host/hypervisor level. The box I was working with today has a snap every 15min, of the whole ~1PB of file shares. Those rotate every two days. Then there is a bi-hourly snap that rotates every week. Then there is a daily snap that rotates monthly. This all gets replicated to the DR site. That daily snap goes to a VTL (disk-based virtual tape library), gets dedupped and compressed. That virtual tape library is replicated to the DR VTL. Weekly, a tape is cut and shipped offsite, where it stays for 2 years, and a monthly tape that stays for 7 years.

      This is standard and normal for an enterprise. It actually doesn't take up that much space, because it's snapshots/differentials, and they're dedupped and compressed against every freaking block on the array and the VTL.

      There are no daily email reports of status. There are traps set for alerts, such as 20% space left, and it shows up a red alarm, on a bunch of TV screens in the DC, that people are paid to sit and look at 24/7. Yes, an email also goes out - just when that disk is almost full. That array I worked on - it has hard and soft quotas. Meaning your alarm goes off when it's 80% full, then it gets to 100% full, but it can actually go to 200% full because it's oversubscribed via thin provisioning.

      So, in short, yes, let's take what you said, print it, and pin it to the "server-room" wall, so people who have been in a server room and know what they're talking about, can laugh at you, an amateur who has once taken a tour of a small commercial cage at a colo, does deskside support for a living, and decided to "chime in" with some clown-level advice. bbye now.

      • (Score: 0) by Anonymous Coward on Friday May 28 2021, @02:32PM

        by Anonymous Coward on Friday May 28 2021, @02:32PM (#1139634)

        Are you sure Raid 10 is not good enough?

    • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @11:04PM

      by Anonymous Coward on Sunday May 23 2021, @11:04PM (#1138066)

      I vote for angry sticky notes on the fridge in ALL CAPS!!!

    • (Score: 2) by PiMuNu on Monday May 24 2021, @08:37AM

      by PiMuNu (3823) on Monday May 24 2021, @08:37AM (#1138181)

      I understand there was once a dev shop which had a usb-powered nerf cannon fire at anyone who broke the build. Trains them to duck every time they submit a patch I guess...

  • (Score: 2) by hendrikboom on Sunday May 23 2021, @08:55PM

    by hendrikboom (1125) Subscriber Badge on Sunday May 23 2021, @08:55PM (#1138031) Homepage Journal

    At this point I'm happy to tel you that you didn't lose my subscription. I was almost ready to start one, but I'll wait a few days until things are normal again.

  • (Score: 1, Interesting) by Anonymous Coward on Sunday May 23 2021, @09:28PM (3 children)

    by Anonymous Coward on Sunday May 23 2021, @09:28PM (#1138038)

    Based on my reading, NDB is for speed (kinda like raid 0) and Galera is made for redundancy (kinda like raid 1). So far, based on what I've learned, with the current site load here, soylentnews might be best served by going with Galera if you guys are looking for the best fault tolerance. Additionally you guys might want to unify ALL of your public facing services onto a single box and then replicate that box across at least a few data centers. You should have as many instances/copies of your services as you have VPSs and NO TWO VPSs IN THE SAME DATACENTER unless you have a strong need for it. At Linode, putting more than one VPS at the same datacenter might land you with multiple virtualized instances on the same physical host! You can imagine how this would be bad if that host were to fail. From there you should also have full images of each replicated host, and periodic copies made of any dynamic data (like your DBs). These should be made to and stored on NON public facing machines. These can also be VPSs, but if you go that route, you should have at least 2 VPS in different data centers, each holding a complete copy of all of your backups.

    Lastly, you need someone working on that stuff full time, who logs into the back-ends of these boxes to monitor things on a REGULAR BASIS. LOGGING INTO YOUR ADMIN CONTROL PANELS ISN'T ENOUGH!!!!! This person should be comfortable with ssh, vim, and raw logfiles. As another poster pointed out, there is no such thing as set it and forget it in the IT world, unless of course you are fans of unplanned downtime and enjoy regular occurrences of random data loss...

    • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @11:11PM

      by Anonymous Coward on Sunday May 23 2021, @11:11PM (#1138067)

      NDB and Galera are more similar than they are different, which makes since because they came from the same roots in MySQL. NDB and Galera increase both speed and redundancy. The problem is that they didn't set it up properly. I don't know exactly what they mean by "add yet another server" because I haven't read any of the IRC exchanges and because they are already in a fully redundant setup, which has five nines uptime guarantees when set up correctly. But they don't have it set up correctly, so they do not have said uptime, and I don't know which type of node(s) they are thinking they need on that new server so I can't really say if it even would help. Not that it matters anymore, since they are migrating away from it.

    • (Score: 1) by shrewdsheep on Tuesday May 25 2021, @07:50AM (1 child)

      by shrewdsheep (5215) on Tuesday May 25 2021, @07:50AM (#1138497)

      From a total site perspective, I disagree. Bring the infrastructure cost down. Given that most (all?) downtime was not due hardware failures, consolidate the server to a single instance. But backup and stream the database logs, not the database itself. Automate recovery from the database logs. Maybe take a yearly database backup to increase recovery speed. Then a full site restore is possible within a few hours at most. No server "poking and bouncing" needed that we read about so often.

      • (Score: 0) by Anonymous Coward on Tuesday May 25 2021, @09:49AM

        by Anonymous Coward on Tuesday May 25 2021, @09:49AM (#1138509)

        FWIW, they had to do the "bouncing" before they originally switched to the NDB cluster. It isn't completely uncommon when using a perl stack similar to theirs but most people cron it or watchdog it so they don't have to think about it.

  • (Score: 2) by corey on Sunday May 23 2021, @09:34PM

    by corey (2202) on Sunday May 23 2021, @09:34PM (#1138040)

    Due to being super busy, I didn’t tune in on Thursday (Friday here), but thanks to mechanicjay for his/her effort!

  • (Score: 3, Informative) by DeVilla on Sunday May 23 2021, @10:19PM (4 children)

    by DeVilla (5354) on Sunday May 23 2021, @10:19PM (#1138054)

    Is it worth trying to restore old articles from RSS? It won't recreate the magical conversations that were had, but I can probably pull the articles from feedly if we don't wait too long.

    • (Score: 0) by Anonymous Coward on Sunday May 23 2021, @11:56PM (2 children)

      by Anonymous Coward on Sunday May 23 2021, @11:56PM (#1138074)

      Have not the site been indexed by google to? it wont bring it all back easily but the info might be there. Waybackmachine might have it to.

    • (Score: 0) by Anonymous Coward on Monday May 24 2021, @05:32AM

      by Anonymous Coward on Monday May 24 2021, @05:32AM (#1138152)

      Pull it and decide later.

      Inverting it is maybe a day's script. Maybe nobody will care. Maybe someone will!

  • (Score: 2) by DannyB on Monday May 24 2021, @09:50PM (1 child)

    by DannyB (5839) Subscriber Badge on Monday May 24 2021, @09:50PM (#1138365) Journal


    --
    People today are educated enough to repeat what they are taught but not to question what they are taught.
    • (Score: 1, Interesting) by Anonymous Coward on Tuesday May 25 2021, @12:52AM

      by Anonymous Coward on Tuesday May 25 2021, @12:52AM (#1138422)

      I love the juxtaposition of your comment subject and comment signature.

(1)