Stories
Slash Boxes
Comments

SoylentNews is people

posted by on Sunday August 09 2020, @05:51AM   Printer-friendly
from the SNAFU dept.

The Mighty Buzzard writes:

Yeah, so, failure to babysit the db node that was scheduled for a reboot on the 5th resulted in a bit of database FUBAR that left us temporarily losing everything from then to now. Fortunately we had a backup less than six hours old, restored from it, and appear to be copacetic now. Except for the missing five hours and change.

I'd usually make some sort of dumb joke here but it was already four hours past my bedtime when I found out about the problem. My brain is no work good anymore. Fill in whatever dad joke or snark about getting a do-over for a change strikes your fancy.

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Interesting) by Anonymous Coward on Monday August 10 2020, @05:37AM (7 children)

    by Anonymous Coward on Monday August 10 2020, @05:37AM (#1034223)

    I understand the concerns about adding or moving servers, but I'd strongly suggest adding another management node on one of your existing machines. The resources required are minimal and it would help quite a bit with your cluster availability.

    Starting Score:    0  points
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  

    Total Score:   1  
  • (Score: 2) by The Mighty Buzzard on Monday August 10 2020, @03:23PM (6 children)

    Cheers, I'll see if I can find a suitable place to put one or two more. Like you said, if it's just the management daemons it doesn't take up much space or memory.

    --
    My rights don't end where your fear begins.
    • (Score: 0) by Anonymous Coward on Monday August 10 2020, @08:49PM (4 children)

      by Anonymous Coward on Monday August 10 2020, @08:49PM (#1034518)

      You really only need one more. That way you still have a quorum in case one fails. With that in place, the database nodes should be able to survive a rolling restart without "babysitting" them the whole time. However, the one located on its own machine should be the arbitrator. It might also help to set up unused slots in your cluster for all types of nodes to allow easier expansion in the future, as you can just assign new nodes to those slots in the future instead of rolling the whole thing.

      I hope my attempts to help don't feel like I'm piling on or intentionally demeaning.

      • (Score: 2) by The Mighty Buzzard on Tuesday August 11 2020, @02:53AM (3 children)

        by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Tuesday August 11 2020, @02:53AM (#1034696) Homepage Journal

        Nah. One server assigned as arbitrator = single point of failure. Not having that is the entire reason we're clustered to begin with, so it's a non-starter.

        --
        My rights don't end where your fear begins.
        • (Score: 0) by Anonymous Coward on Tuesday August 11 2020, @07:16AM (2 children)

          by Anonymous Coward on Tuesday August 11 2020, @07:16AM (#1034791)

          If you didn't disable arbitration, they elect one. You want a management node that isn't on the data nodes to be the preferred arbitrator with the highest rank, but you can set one of the others as fallback and even include your SQL nodes if you are paranoid. That way you require at least two failures, or as many as four, to bring the entire cluster down without losing data or degrading the cluster to the point of complete failure. But you state that is a SPOF. This, the fact your other data node stayed up, and your comments lead me to believe that you disabled arbitration, which isn't a good idea as you have no protection from a split brain and other problems in that case, or I wasn't clear enough.

          • (Score: 2) by The Mighty Buzzard on Tuesday August 11 2020, @10:11PM (1 child)

            by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Tuesday August 11 2020, @10:11PM (#1035207) Homepage Journal

            Point being, if there is only one non-data management node and it fails or is otherwise down for some reason, we're right back where we are now. I thought not being where we are right now was the entire idea.

            --
            My rights don't end where your fear begins.
            • (Score: 0) by Anonymous Coward on Wednesday August 12 2020, @02:13AM

              by Anonymous Coward on Wednesday August 12 2020, @02:13AM (#1035350)

              No, if it fails, there is no arbitrator but you still have a quorum, which means they just elect a new one or continue without one. But, you also would have survived the explained incident automatically without the split brain that resulted had you had one. I could go on, but you are going to do whatever you want anyway, so I'm not going to bother anymore.

    • (Score: 0) by Anonymous Coward on Tuesday August 11 2020, @01:14AM

      by Anonymous Coward on Tuesday August 11 2020, @01:14AM (#1034657)

      I meant primary and not arbitrator. It should definitely be the arbitrator too. But it is recommended practice to have the primary management node not be a data node as well. It helps with solving problems with quorum, split brain, and other issues, while also allowing centralized service management from a single node.