The Mighty Buzzard writes:
Yeah, so, failure to babysit the db node that was scheduled for a reboot on the 5th resulted in a bit of database FUBAR that left us temporarily losing everything from then to now. Fortunately we had a backup less than six hours old, restored from it, and appear to be copacetic now. Except for the missing five hours and change.
I'd usually make some sort of dumb joke here but it was already four hours past my bedtime when I found out about the problem. My brain is no work good anymore. Fill in whatever dad joke or snark about getting a do-over for a change strikes your fancy.
(Score: 2, Interesting) by Anonymous Coward on Sunday August 09 2020, @09:43AM (10 children)
I don't know if this is confirmation bias, y'all being more public about this, or an actual increase, but you guys seem to keep having problems related the database processes as of late. Perhaps you should think about adding a watchdog daemon to your system, giving the database itself some maintenance and optimization, making sure everything is up to date, and checking your logs for some sort of attack on your system.
(Score: 5, Interesting) by The Mighty Buzzard on Sunday August 09 2020, @01:45PM (9 children)
Funny how the db clustering system that's supposed to save us headaches has caused significant data loss twice now when boring old master/slave replication never did, ain't it? I'd have to do the math to see if occasionally restoring from backups has cost us more downtime than actually having to down the site when maintenance was required but I know for sure it's more annoying.
My rights don't end where your fear begins.
(Score: 0) by Anonymous Coward on Sunday August 09 2020, @07:18PM (4 children)
is there a post or posts that describe how everything is set up for SN? would make for an interesting read and other admins could weigh in with their 2 cents/$denomination.
(Score: 4, Funny) by The Mighty Buzzard on Monday August 10 2020, @04:31AM
The other admins have better sense than to talk to users. I'm the dumb one.
My rights don't end where your fear begins.
(Score: 2) by The Mighty Buzzard on Monday August 10 2020, @05:13AM (2 children)
Oh, if you really want to know the detailed network setup, drop me an email to remind me (I don't care if it's a real address. Throwaway is fine.) and I'll post it up as a journal entry when I get time. I've been running on busy days and four hours or so of sleep a night for what seems like about thirty years though, so don't go thinking I've forgotten about it until it doesn't show up within a week.
My rights don't end where your fear begins.
(Score: 2) by martyb on Monday August 10 2020, @10:14PM (1 child)
Consider me interested. :)
If I may suggest, it you follow through in writing up something... put it up on the Wiki and then link to that in your journal. (There's probably some stuff up there to start from, anyway!)
/me wishes there were a way to auto-explore and document (textually and graphically) connections between servers and the processes that run on each one.
Wit is intellect, dancing.
(Score: 2) by The Mighty Buzzard on Tuesday August 11 2020, @02:47AM
It's already on the wiki [soylentnews.org]. It's not entirely up to date but I'm not putting Aluminum up there until it's actually in service doing things.
My rights don't end where your fear begins.
(Score: 0) by Anonymous Coward on Sunday August 09 2020, @10:13PM (1 child)
Are you anywhere close to the load limit on a replication setup? And a two node cluster is basically worthless because you can't get a quorum with only two nodes. Another benefit of a replication scheme for you seems to be that in the current setup, failure requires manual intervention anyway. So you can STONITH with a watchdog and degrade read-only to the replica at the first sign of trouble until you sort it out or when under maintenance.
(Score: 2) by The Mighty Buzzard on Monday August 10 2020, @04:56AM
Two nodes is plenty for our purposes. Our network load vs. the bandwidth between our boxes makes replication essentially instant unless you have to completely restore a node, so mostly what we need is for the web frontends to not have to give a shit what db server they're dealing with in the event that one of them crashes. If we were looking to fail to read-only, we'd have stuck with master/slave. We consider read-only to be failure though.
My rights don't end where your fear begins.
(Score: 2) by gawdonblue on Monday August 10 2020, @02:40AM (1 child)
Yeah, in the last 3 years we've had to restart the DB at work twice because of "high-availability" clustering getting out of sync. These are the only fatal DB software failures that we have had.
Seems the more dependencies you add the more brittle things become.
(Score: 2) by The Mighty Buzzard on Monday August 10 2020, @04:58AM
Yeah, I'm sure there must be cluster ninjas out there that know every pitfall ahead of time and never have these problems but there aren't any on staff here.
My rights don't end where your fear begins.