Stories
Slash Boxes
Comments

SoylentNews is people

Meta
posted by martyb on Friday October 16 2020, @03:15PM   Printer-friendly
from the if-it-were-still-down... dept.

We had an outage this morning -- "Internal Server Error" would appear when trying to load the main site.

I noticed this at about 0945 UTC from my mobile phone and immediately TXTed a message to "The Mighty Buzzard" (aka TMB) alerting him of the situation. Of course, it being 0545 EDT, he was sound asleep like any sane person would be.

I then booted up my computer and accessed "#Soylent" on IRC; discovered others were already aware. It appears to have been first noted at 05:42:57 UTC by "SoyCow8732". That was followed not long after by "c0lo" and "lld". Soon after, "chromas" was on the scene and tried bouncing the front ends, but no joy. He sleuthed around and concluded it was likely a mysql error, but our configuration is... interesting and it was non-obvious on how to restart things.

My hands were mostly tied as only a few days ago I managed to mess up Windows on my main system and would get a BSOD whenver I tried to boot it. I looked on from a system booted from a Ubuntu Live CD (well actually, a USB stick).

Eventually, TMB appeared, took stock of the situation, and was able to get things running again in pretty short order. Thanks Buzz!

Synopsis (AIUI) our installation of Mysql is setup so that there are redundant copies of the DB running on two different servers. The intent is to provide redundancy so that if one instance goes down, the other can take over and carry things along until the failing system is recovered. That's great in theory, but not so good in practice. Thankfully, it does [mostly] work. We are continuing to monitor the situation. Be assured this is working its way of the priority queue! I mean, who likes to wake up and debug server issues before their first cup of coffee?

So, that's my take on it. I'll leave it to TMB to add details/corrections should he deem it necessary.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Informative) by The Mighty Buzzard on Friday October 16 2020, @03:53PM (22 children)

    It wasn't either of the data nodes having an issue or the management servers arguing, it was both of the management servers deciding there was something screwy with the in-RAM database housed on both of them that both management nodes agreed was the correct version. All it took to fix was downing everything and letting it re-pull from the version on disk that was perfectly sound and up to date. I'm going to go with cosmic rays flipping the wrong bit in RAM since there was absolutely nothing in the logs that gave the slightest clue what specifically had happened, only that the resulting state of non-copaceticness existed.

    --
    My rights don't end where your fear begins.
    • (Score: 2) by RS3 on Friday October 16 2020, @04:11PM (16 children)

      by RS3 (6367) on Friday October 16 2020, @04:11PM (#1065440)

      When that happens does it do a RAM to dump.sql file (that might be an impossible pain to compare to on-disk database)? Cosmic ray might be what caused it, or some other software bug.

      Very annoying that the servers / mgt. sw. were unhappy, but no useful log info? Maybe some error logging VARIABLE needs to be turned on? Perhaps something took too long, timed out, and mgt. decided there was a sync problem?

      I don't think I've ever seen that problem, but I'm not running dual (ing) databases either. :)

      • (Score: 3, Informative) by The Mighty Buzzard on Friday October 16 2020, @07:02PM (15 children)

        DB became inconsistent (Paraphrased, I don't care about the exact message since it was utterly useless.) somehow or other, with nothing but normal, happy log messages leading up to it.

        --
        My rights don't end where your fear begins.
        • (Score: 3, Interesting) by RS3 on Friday October 16 2020, @07:31PM (12 children)

          by RS3 (6367) on Friday October 16 2020, @07:31PM (#1065541)

          Useless error messages, whether popups or logs, are my #1 complaint about software- especially OSes. Tiring. And I've done software development, and I can't say that my software would give useful error messages either. But, my software never breaks. :)

          One place I worked- EEG monitoring- the main programmer for one of the product lines was a bit of a nut. Just messing around I pressed some random keys on the keyboard while the system was running (simulated patient EEG of course) and the software LOCKED UP. When I told him what I did, he said: "why would anyone ever do that?" Did I mention it was an EEG machine? Used, among other things, to monitor someone while they're under anesthesia to make sure they don't 1) wake up, or 2) die? I probably should have tried to anonymously report it to FDA or someone. That was like 22 years ago and I otherwise liked the job and company and wasn't into making waves.

          That and the Anonymous Coward hadn't been invented yet.

          • (Score: 3, Funny) by aristarchus on Friday October 16 2020, @07:54PM (10 children)

            by aristarchus (2645) on Friday October 16 2020, @07:54PM (#1065546) Journal

            My fav, from Micro$ift, "There has been an undetectable error in your system." Always wondered, how did they manage to detect an undetectable error? I mean, Linus once bragged that Linux could do an infinite loop in 5 seconds, but that is nothing compared to detecting the undetectable!

            • (Score: 2) by RS3 on Friday October 16 2020, @08:27PM (1 child)

              by RS3 (6367) on Friday October 16 2020, @08:27PM (#1065561)

              That's hilarious. Well, in a qualified way. It's also horrible. Do you remember which Windump version? I gotta wonder if that was a clerical error, or an "Easter Egg".

              • (Score: 0, Offtopic) by aristarchus on Friday October 16 2020, @08:42PM

                by aristarchus (2645) on Friday October 16 2020, @08:42PM (#1065564) Journal

                Last Windoze I ran was Win95, so, either that, or 3.1 on DOS of some version. It's probably still in the source code, somewhere, just waiting to be activated. When there is an undetectable error, again. And, wait, did this not all start from martyb's windblows machine? Curious!

            • (Score: 0) by Anonymous Coward on Friday October 16 2020, @08:34PM (6 children)

              by Anonymous Coward on Friday October 16 2020, @08:34PM (#1065562)

              I've commented on those before, but undetectable errors are those cases where each step in a process has apparently completed successfully but the result fails verification. In that case, something somewhere went wrong but you can't detect where.

              • (Score: 0) by Anonymous Coward on Friday October 16 2020, @08:44PM (5 children)

                by Anonymous Coward on Friday October 16 2020, @08:44PM (#1065567)

                That just means the error is unidentified. But it has definitely been detected. See, this is why IT needs English majors, semantics are crucial.

                • (Score: 0) by Anonymous Coward on Friday October 16 2020, @08:45PM (1 child)

                  by Anonymous Coward on Friday October 16 2020, @08:45PM (#1065568)

                  Have you tried turning it off and turning it back on, yet?

                  • (Score: 0) by Anonymous Coward on Friday October 16 2020, @09:34PM

                    by Anonymous Coward on Friday October 16 2020, @09:34PM (#1065594)

                    Turn it off, don't turn it back on. That will fix many of your problems.

                • (Score: 1, Interesting) by Anonymous Coward on Friday October 16 2020, @10:23PM (2 children)

                  by Anonymous Coward on Friday October 16 2020, @10:23PM (#1065619)

                  I'll give you an example of sending data over a wire from A to B using an ECC and message verification. Said message get hits by a lighting bolt and some bits get flipped. The worst kind of error is an undetectable and unidentified error that skates past your error system completely and passes subsequent verification so you never see it. Next is the detectable but unidentified error where the ECC fails but it can't tell you what is wrong with the message. Then there is the undetectable but identified error that skates past ECC but fails JSON verification because you know where the error in the message is but everything leading up to it was a "success" because it couldn't detect it. Finally is the detectable and identifiable error where the ECC catches it and knows what the error is so it can correct it. Perhaps you are mistaking the undetectablity and unidentifiedness properties of the of the error itself and where it happens in the process and that errors can have different values depending on the scope of your discussion.

                  • (Score: 0) by Anonymous Coward on Friday October 16 2020, @10:41PM (1 child)

                    by Anonymous Coward on Friday October 16 2020, @10:41PM (#1065627)

                    So you are saying there are known knowns, and known unknowns, and unknown unknowns, and that the WMDs are to the north, or the northwest, or just the west, and maybe we will definitely find them in Syria! Rumsfeld!!! I knew it!!

                    But, still, there are no unknown knowns, or we would know about them, much like when Windowz detects an undetectable error.

                    • (Score: 0) by Anonymous Coward on Friday October 16 2020, @10:47PM

                      by Anonymous Coward on Friday October 16 2020, @10:47PM (#1065629)

                      Of course there are unknown knowns. Ever have a song pop into your head and you wonder where on Earth you heard it? That song, celebrity's identity when you see their face, and many more examples are things you knew but you don't know you know it.

            • (Score: 1) by fustakrakich on Friday October 16 2020, @09:32PM

              by fustakrakich (6150) on Friday October 16 2020, @09:32PM (#1065591) Journal

              It's Window's version of a "UFO". It knows it saw something

              --
              La politica e i criminali sono la stessa cosa..
          • (Score: -1, Troll) by Anonymous Coward on Saturday October 17 2020, @03:06AM

            by Anonymous Coward on Saturday October 17 2020, @03:06AM (#1065708)

            What a stupid irresponsible idiot. Don't go to the FDA. That might get something done but probably not in the United States of Trump, where we let women serve on SCOTUS but only if they have a strange BDSM fetish based on a desert religion that makes 50 Shades of Grey look tame.

            We should fabricate a rape accusation against him instead.

        • (Score: 2) by RS3 on Friday October 16 2020, @07:33PM (1 child)

          by RS3 (6367) on Friday October 16 2020, @07:33PM (#1065542)

          PS: if someone had the time and passion they could search for the particular useless message in the code and maybe figure out what was supposed to be happening, but that could be a rabbit hole.

    • (Score: 1) by shrewdsheep on Friday October 16 2020, @04:36PM (4 children)

      by shrewdsheep (5215) on Friday October 16 2020, @04:36PM (#1065462)

      Are there any periodic offline backups of the main database?

  • (Score: 5, Interesting) by RS3 on Friday October 16 2020, @03:58PM

    by RS3 (6367) on Friday October 16 2020, @03:58PM (#1065428)

    It's not said nearly enough, but thank you all who built and maintain this site.

    My hands were mostly tied as only a few days ago I managed to mess up Windows on my main system and would get a BSOD whenver I tried to boot it.

    Windows is a house of cards and you sneezed is all. :)

    You may be able to boot into "safe mode", or "Repair Your Computer", or worst-case do a "Repair Reinstallation".

    If those fail, buy another HD, partition it for (at least) dual boot Windows and Linux, install, configure, customize, etc., and put your old HD in a 2nd slot or an external USB drive case to access your data.

  • (Score: 0) by Anonymous Coward on Friday October 16 2020, @04:00PM (3 children)

    by Anonymous Coward on Friday October 16 2020, @04:00PM (#1065430)

    Who needs Space Force when we have Cosmic Ray Defense!!

    • (Score: 2) by looorg on Friday October 16 2020, @04:34PM (2 children)

      by looorg (578) on Friday October 16 2020, @04:34PM (#1065460)

      Perhaps we should have all the server encased in a protective film of mylar? To prevent future cosmic events.

      • (Score: 4, Touché) by SomeGuy on Friday October 16 2020, @06:26PM (1 child)

        by SomeGuy (5632) on Friday October 16 2020, @06:26PM (#1065520)

        It's in the cloud. Up in the clouds you get lots of cosmic rays.

        • (Score: 0) by Anonymous Coward on Friday October 16 2020, @10:26PM

          by Anonymous Coward on Friday October 16 2020, @10:26PM (#1065620)

          Then hire servers wrapped in moar mylar.

  • (Score: 5, Funny) by jelizondo on Friday October 16 2020, @06:13PM

    by jelizondo (653) Subscriber Badge on Friday October 16 2020, @06:13PM (#1065512) Journal

    Dude, you don't even look in the general direction of a server before your first cup of coffee otherwise the thing gets FUBARed pronto!

    Thanks to all for the quick response and please talk some sense into martyb: no coffee, no workee.

  • (Score: 1, Funny) by Anonymous Coward on Friday October 16 2020, @07:07PM (3 children)

    by Anonymous Coward on Friday October 16 2020, @07:07PM (#1065537)

    It protecc against cosmic rays.

  • (Score: 2, Funny) by aristarchus on Friday October 16 2020, @08:10PM (5 children)

    by aristarchus (2645) on Friday October 16 2020, @08:10PM (#1065554) Journal

    Even when down, SoylentNews managed to reject aristarchus submissions from IRC! Now THAT is what you call system architecture!!

    • (Score: 2, Informative) by Anonymous Coward on Friday October 16 2020, @09:35PM

      by Anonymous Coward on Friday October 16 2020, @09:35PM (#1065595)

      Situation Normal, All's Fine Uptown.

    • (Score: 3, Touché) by The Mighty Buzzard on Saturday October 17 2020, @12:02AM (3 children)

      Well, your submissions are so reliable I could hardcode it but I just can't see making life easier on the eds on purpose.

      --
      My rights don't end where your fear begins.
      • (Score: 2, Troll) by aristarchus on Saturday October 17 2020, @12:14AM (1 child)

        by aristarchus (2645) on Saturday October 17 2020, @12:14AM (#1065659) Journal

        Donno, Buzz! The regex of "alt-right" is passe, because there is really not much news about them anymore. It is like they have vanished, or just turned into regular white supremacists or Neo-Nazis, or Michigan militia kidnapper groups. So I have been forced to diversity. Looking to cover things like tax-evasion, mass disinformation programs, astronomy, and racism in tech. Of course, we will always have Peter Thiel.

      • (Score: 2) by Fnord666 on Saturday October 17 2020, @04:36AM

        by Fnord666 (652) on Saturday October 17 2020, @04:36AM (#1065728) Homepage

        Well, your submissions are so reliable I could hardcode it but I just can't see making life easier on the eds on purpose.

        Um, thanks I guess?

  • (Score: 0) by Anonymous Coward on Saturday October 17 2020, @09:28AM (3 children)

    by Anonymous Coward on Saturday October 17 2020, @09:28AM (#1065767)

    Looks like the Let's Encrypt cert expired.

  • (Score: 2) by wirelessduck on Monday October 19 2020, @12:43AM

    by wirelessduck (3407) on Monday October 19 2020, @12:43AM (#1066275)

    What's the time/effort involved in updating the codebase to run on PostgreSQL? Is it just going through and updating hardcoded DBI queries?

(1)