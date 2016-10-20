from the if-it-were-still-down... dept.
We had an outage this morning -- "Internal Server Error" would appear when trying to load the main site.
I noticed this at about 0945 UTC from my mobile phone and immediately TXTed a message to "The Mighty Buzzard" (aka TMB) alerting him of the situation. Of course, it being 0545 EDT, he was sound asleep like any sane person would be.
I then booted up my computer and accessed "#Soylent" on IRC; discovered others were already aware. It appears to have been first noted at 05:42:57 UTC by "SoyCow8732". That was followed not long after by "c0lo" and "lld". Soon after, "chromas" was on the scene and tried bouncing the front ends, but no joy. He sleuthed around and concluded it was likely a mysql error, but our configuration is... interesting and it was non-obvious on how to restart things.
My hands were mostly tied as only a few days ago I managed to mess up Windows on my main system and would get a BSOD whenver I tried to boot it. I looked on from a system booted from a Ubuntu Live CD (well actually, a USB stick).
Eventually, TMB appeared, took stock of the situation, and was able to get things running again in pretty short order. Thanks Buzz!
Synopsis (AIUI) our installation of Mysql is setup so that there are redundant copies of the DB running on two different servers. The intent is to provide redundancy so that if one instance goes down, the other can take over and carry things along until the failing system is recovered. That's great in theory, but not so good in practice. Thankfully, it does [mostly] work. We are continuing to monitor the situation. Be assured this is working its way of the priority queue! I mean, who likes to wake up and debug server issues before their first cup of coffee?
So, that's my take on it. I'll leave it to TMB to add details/corrections should he deem it necessary.
(Score: 5, Informative) by The Mighty Buzzard on Friday October 16, @03:53PM
It wasn't either of the data nodes having an issue or the management servers arguing, it was both of the management servers deciding there was something screwy with the in-RAM database housed on both of them that both management nodes agreed was the correct version. All it took to fix was downing everything and letting it re-pull from the version on disk that was perfectly sound and up to date. I'm going to go with cosmic rays flipping the wrong bit in RAM since there was absolutely nothing in the logs that gave the slightest clue what specifically had happened, only that the resulting state of non-copaceticness existed.
(Score: 2) by RS3 on Friday October 16, @04:11PM
When that happens does it do a RAM to dump.sql file (that might be an impossible pain to compare to on-disk database)? Cosmic ray might be what caused it, or some other software bug.
Very annoying that the servers / mgt. sw. were unhappy, but no useful log info? Maybe some error logging VARIABLE needs to be turned on? Perhaps something took too long, timed out, and mgt. decided there was a sync problem?
I don't think I've ever seen that problem, but I'm not running dual (ing) databases either. :)
(Score: 3, Informative) by The Mighty Buzzard on Friday October 16, @07:02PM
DB became inconsistent (Paraphrased, I don't care about the exact message since it was utterly useless.) somehow or other, with nothing but normal, happy log messages leading up to it.
(Score: 3, Interesting) by RS3 on Friday October 16, @07:31PM
Useless error messages, whether popups or logs, are my #1 complaint about software- especially OSes. Tiring. And I've done software development, and I can't say that my software would give useful error messages either. But, my software never breaks. :)
One place I worked- EEG monitoring- the main programmer for one of the product lines was a bit of a nut. Just messing around I pressed some random keys on the keyboard while the system was running (simulated patient EEG of course) and the software LOCKED UP. When I told him what I did, he said: "why would anyone ever do that?" Did I mention it was an EEG machine? Used, among other things, to monitor someone while they're under anesthesia to make sure they don't 1) wake up, or 2) die? I probably should have tried to anonymously report it to FDA or someone. That was like 22 years ago and I otherwise liked the job and company and wasn't into making waves.
That and the Anonymous Coward hadn't been invented yet.
(Score: 3, Funny) by aristarchus on Friday October 16, @07:54PM
My fav, from Micro$ift, "There has been an undetectable error in your system." Always wondered, how did they manage to detect an undetectable error? I mean, Linus once bragged that Linux could do an infinite loop in 5 seconds, but that is nothing compared to detecting the undetectable!
(Score: 2) by RS3 on Friday October 16, @08:27PM
That's hilarious. Well, in a qualified way. It's also horrible. Do you remember which Windump version? I gotta wonder if that was a clerical error, or an "Easter Egg".
(Score: 0, Offtopic) by aristarchus on Friday October 16, @08:42PM
Last Windoze I ran was Win95, so, either that, or 3.1 on DOS of some version. It's probably still in the source code, somewhere, just waiting to be activated. When there is an undetectable error, again. And, wait, did this not all start from martyb's windblows machine? Curious!
(Score: 0) by Anonymous Coward on Friday October 16, @08:34PM
I've commented on those before, but undetectable errors are those cases where each step in a process has apparently completed successfully but the result fails verification. In that case, something somewhere went wrong but you can't detect where.
(Score: 0) by Anonymous Coward on Friday October 16, @08:44PM
That just means the error is unidentified. But it has definitely been detected. See, this is why IT needs English majors, semantics are crucial.
(Score: 0) by Anonymous Coward on Friday October 16, @08:45PM
Have you tried turning it off and turning it back on, yet?
(Score: 0) by Anonymous Coward on Friday October 16, @09:34PM
Turn it off, don't turn it back on. That will fix many of your problems.
(Score: 1, Interesting) by Anonymous Coward on Friday October 16, @10:23PM
I'll give you an example of sending data over a wire from A to B using an ECC and message verification. Said message get hits by a lighting bolt and some bits get flipped. The worst kind of error is an undetectable and unidentified error that skates past your error system completely and passes subsequent verification so you never see it. Next is the detectable but unidentified error where the ECC fails but it can't tell you what is wrong with the message. Then there is the undetectable but identified error that skates past ECC but fails JSON verification because you know where the error in the message is but everything leading up to it was a "success" because it couldn't detect it. Finally is the detectable and identifiable error where the ECC catches it and knows what the error is so it can correct it. Perhaps you are mistaking the undetectablity and unidentifiedness properties of the of the error itself and where it happens in the process and that errors can have different values depending on the scope of your discussion.
(Score: 0) by Anonymous Coward on Friday October 16, @10:41PM
So you are saying there are known knowns, and known unknowns, and unknown unknowns, and that the WMDs are to the north, or the northwest, or just the west, and maybe we will definitely find them in Syria! Rumsfeld!!! I knew it!!
But, still, there are no unknown knowns, or we would know about them, much like when Windowz detects an undetectable error.
(Score: 0) by Anonymous Coward on Friday October 16, @10:47PM
Of course there are unknown knowns. Ever have a song pop into your head and you wonder where on Earth you heard it? That song, celebrity's identity when you see their face, and many more examples are things you knew but you don't know you know it.
(Score: 1) by fustakrakich on Friday October 16, @09:32PM
It's Window's version of a "UFO". It knows it saw something
(Score: 2) by RS3 on Friday October 16, @07:33PM
PS: if someone had the time and passion they could search for the particular useless message in the code and maybe figure out what was supposed to be happening, but that could be a rabbit hole.
(Score: 2) by The Mighty Buzzard on Friday October 16, @07:42PM
Likely thirty different things with 300 different causes. That generally seems to be the case when they give you fuck-all for information.
(Score: 1) by shrewdsheep on Friday October 16, @04:36PM
Are there any periodic offline backups of the main database?
(Score: 3, Informative) by The Mighty Buzzard on Friday October 16, @07:03PM
Yup, wasn't any data loss though.
(Score: 1) by shrewdsheep on Friday October 16, @10:10PM
Just out of curiosity, how big is the current dump?
(Score: 2, Funny) by Anonymous Coward on Friday October 16, @11:47PM
I usually flush without looking. Next time I'll look, just for you.
Flush buffers, of course. What did you think I meant?
(Score: 2) by The Mighty Buzzard on Friday October 16, @11:59PM
Round about 2GB.
(Score: 5, Interesting) by RS3 on Friday October 16, @03:58PM
It's not said nearly enough, but thank you all who built and maintain this site.
Windows is a house of cards and you sneezed is all. :)
You may be able to boot into "safe mode", or "Repair Your Computer", or worst-case do a "Repair Reinstallation".
If those fail, buy another HD, partition it for (at least) dual boot Windows and Linux, install, configure, customize, etc., and put your old HD in a 2nd slot or an external USB drive case to access your data.
(Score: 0) by Anonymous Coward on Friday October 16, @04:00PM
Who needs Space Force when we have Cosmic Ray Defense!!
(Score: 2) by looorg on Friday October 16, @04:34PM
Perhaps we should have all the server encased in a protective film of mylar? To prevent future cosmic events.
(Score: 4, Touché) by SomeGuy on Friday October 16, @06:26PM
It's in the cloud. Up in the clouds you get lots of cosmic rays.
(Score: 0) by Anonymous Coward on Friday October 16, @10:26PM
Then hire servers wrapped in moar mylar.
(Score: 5, Funny) by jelizondo on Friday October 16, @06:13PM
Dude, you don't even look in the general direction of a server before your first cup of coffee otherwise the thing gets FUBARed pronto!
Thanks to all for the quick response and please talk some sense into martyb: no coffee, no workee.
(Score: 1, Funny) by Anonymous Coward on Friday October 16, @07:07PM
It protecc against cosmic rays.
(Score: 4, Funny) by The Mighty Buzzard on Friday October 16, @07:44PM
I forget which the cosmic ones are. Charles, Stephens, Liotta, Palmer?
(Score: 1, Informative) by Anonymous Coward on Friday October 16, @08:05PM
Liotta
(Score: 2) by The Mighty Buzzard on Saturday October 17, @12:03AM
Stanz would be my personal choice but he's technically fictional.
(Score: 2, Funny) by aristarchus on Friday October 16, @08:10PM
Even when down, SoylentNews managed to reject aristarchus submissions from IRC! Now THAT is what you call system architecture!!
(Score: 2, Informative) by Anonymous Coward on Friday October 16, @09:35PM
Situation Normal, All's Fine Uptown.
(Score: 3, Touché) by The Mighty Buzzard on Saturday October 17, @12:02AM
Well, your submissions are so reliable I could hardcode it but I just can't see making life easier on the eds on purpose.
(Score: 2, Troll) by aristarchus on Saturday October 17, @12:14AM
Donno, Buzz! The regex of "alt-right" is passe, because there is really not much news about them anymore. It is like they have vanished, or just turned into regular white supremacists or Neo-Nazis, or Michigan militia kidnapper groups. So I have been forced to diversity. Looking to cover things like tax-evasion, mass disinformation programs, astronomy, and racism in tech. Of course, we will always have Peter Thiel.
(Score: 2) by The Mighty Buzzard on Saturday October 17, @10:55AM
Just make sure and stay away from science and tech so you don't make things difficult on the poor eds and force them to at least skim the headlines before they reject.
(Score: 2) by Fnord666 on Saturday October 17, @04:36AM
Um, thanks I guess?
(Score: 0) by Anonymous Coward on Saturday October 17, @09:28AM
Looks like the Let's Encrypt cert expired.
(Score: 3, Insightful) by The Mighty Buzzard on Saturday October 17, @10:56AM
Yeah. We suck. Mea culpa.
(Score: 0) by Anonymous Coward on Saturday October 17, @02:23PM
I was a bit surprised this morning when it happened, isn't that auto-renewed?
(Score: 2) by The Mighty Buzzard on Sunday October 18, @03:51AM
Nah, I don't trust zone file changes to scripting and we gotta use dns-01 challenges since we use wildcard certs.
(Score: 2) by wirelessduck on Monday October 19, @12:43AM
What's the time/effort involved in updating the codebase to run on PostgreSQL? Is it just going through and updating hardcoded DBI queries?