[Updated to correct time of neon CPU's spiking. --martyb]
We experienced an unexpected outage of the site this morning (20190110 00:15-07:45 UTC). At shortly after midnight approximately 0415 (UTC), CPU usage on neon suddenly pegged at 400% and things went downhill from there. Am not sure at this point what happened between 0015 and 0415.
Root cause is being investigated, but for now it seems the site is back up and working. Please let us know if you have any issues.
Note: you may need to have your browser ignore its cache (e.g. refresh with Ctrl+F5) and bring down everything fresh.
FWIW, system came back up after we rebooted neon (using the Linode manager page), and then bounced varnishd on fluorine and hydrogen (/home/bob/bin/bounce on each.)
Many thanks go to SemperOSS and cosurgi for problem determination and steps to rectify and FatPhil for his cheerleading!
[Update: TMB] So, the deal was that some unknown time in the past the ndb database node on helium had gone down. This wasn't a problem since we run a clustered database but nobody noticing it was. Then last night something caused neon to lose its cheese. Since it hosts the other node of the db, we had no db for a while. Bytram(martyb) has sysadmin powers for when unpleasant substances of various types hit the fan and thankfully he knew enough to get the neon db node back up and bounce apache/varnish on the web frontends, so kudos to him and all the folks who were backseat driving at the time due to lack of admin perms on their parts.
My brain's currently fried from going from asleep to OMGWTFBBQ without so much as a cup of coffee and a cigarette first, so I'm not going to dig into the root causes until it unfries itself but as a stopgap we have four more staff with shiny, new admin access that I'll be emergency bootcamping in the very near future. There's also going to be some monitoring reimplemented very soon so we notice this kind nonsense before it blows up in our faces again. I'll either update and bump this story or post a new one if we manage to figure out what the root causes were but at the moment the logs aren't being particularly helpful.
(Score: 0) by Anonymous Coward on Thursday January 10 2019, @01:04PM
Thank you for taking care of it!
(Score: 4, Insightful) by Anonymous Coward on Thursday January 10 2019, @01:24PM
Please keep us informed on the root cause analysis, those meta post are my favorite.
Thanks for the work
(Score: 5, Insightful) by realDonaldTrump on Thursday January 10 2019, @01:29PM (15 children)
And possibly many people thought so. But, it wasn't. Very important lesson!!!!
The modern digital is something you can't count on. Something always goes wrong. And you almost have to be Einstein to figure it out. Crazy!
(Score: 4, Funny) by bzipitidoo on Thursday January 10 2019, @01:54PM (9 children)
Russian hackers? If anyone would know, it's those who are colluding with them. How much money did they ask? What's SolyentNews worth?
(Score: 4, Funny) by ewk on Thursday January 10 2019, @02:40PM (7 children)
"What's SolyentNews worth?"
Not sure, but SoylentNews is priceless :-)
I don't always react, but when I do, I do it on SoylentNews
(Score: 2) by Runaway1956 on Thursday January 10 2019, @03:36PM (3 children)
I think it's worth a buck two-eighty.
(Score: 2) by bob_super on Thursday January 10 2019, @05:44PM (2 children)
Lock Ness monster offers three-fifty
All those meta posts recently, it's like website management is hard, or something. Haven't you considered yet that outsourcing it to some Indians would be better for the balance sheet and my stock ?
(Score: 2) by Runaway1956 on Thursday January 10 2019, @05:48PM (1 child)
But - but - but - I thought Buzzard was Indian? Surely he's not faking it like certain congress critters?
(Score: 2) by maxwell demon on Thursday January 10 2019, @07:03PM
You mean, SoylentNews was outsourced to India? :-)
The Tao of math: The numbers you can count are not the real numbers.
(Score: 2) by DeathMonkey on Thursday January 10 2019, @05:46PM (2 children)
Not sure, but SoylentNews is priceless :-)
Wait, what, I thought it was people? Sheesh, how am I supposed to eat priceless!
(Score: 2) by coolgopher on Thursday January 10 2019, @11:37PM
Well, at least for everything else there's Mastercard...
(Score: 2) by The Mighty Buzzard on Friday January 11 2019, @03:03AM
Slavery's illegal now, thus people are priceless. Dig in.
My rights don't end where your fear begins.
(Score: 3, Insightful) by Gaaark on Thursday January 10 2019, @04:41PM
What's it worth?
Less than $3000 for those who haven't subscribed?!
SUBSCRIBE!
--- Please remind me if I haven't been civil to you: I'm channeling MDC. ---Gaaark 2.0 ---
(Score: 2) by FatPhil on Thursday January 10 2019, @02:14PM
Anyway, on a serious note - remember that the IRC channels exist at times like these.
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 3, Touché) by DannyB on Thursday January 10 2019, @02:46PM (3 children)
Who needs Russian hackers when we've got systemd and Intel Management Engine?
To transfer files: right-click on file, pick Copy. Unplug mouse, plug mouse into other computer. Right-click, paste.
(Score: 0) by Anonymous Coward on Thursday January 10 2019, @03:28PM
Devuan users on Raspberry Pi?
(Score: 0) by Anonymous Coward on Thursday January 10 2019, @08:33PM (1 child)
They're running gentoo so I'm thinking openrc must have shit the bed while it was supposed to be managing processes.
(Score: 2) by The Mighty Buzzard on Friday January 11 2019, @03:04AM
Nah, the db nodes haven't been swapped over to Gentoo yet. Still Ubuntu.
My rights don't end where your fear begins.
(Score: 5, Touché) by pTamok on Thursday January 10 2019, @02:53PM
I guess since you only got 71.3% funded in the last 6 months, you only need to be up 71.3% of the time, so you are still ahead of the game...
Thank you for sorting things out and continuing with a poorly rewarded effort. I appreciate it.
(Score: 1, Funny) by Anonymous Coward on Thursday January 10 2019, @04:13PM (2 children)
Anything to do with your "late X-mas present"? ;-)
(Score: 3, Informative) by The Mighty Buzzard on Thursday January 10 2019, @05:25PM (1 child)
Nah, I'll update the story shortly as to what we've tracked down so far. Right now my brain hurts and my cup and coffee pot are both empty though. I'll get to it after those are all resolved.
My rights don't end where your fear begins.
(Score: 2) by edIII on Thursday January 10 2019, @11:33PM
Totally understandable :)
Thank you for helping.
Technically, lunchtime is at any moment. It's just a wave function.
(Score: 4, Funny) by RandomFactor on Thursday January 10 2019, @06:40PM
it was the Index of the OPID that caused it!
В «Правде» нет известий, в «Известиях» нет правды
(Score: 1, Insightful) by Anonymous Coward on Thursday January 10 2019, @07:25PM (3 children)
If a service is down that shouldn't be, you should have been notified. You may want to set up a nagios service to send you messages (with a critical path outside of the infrastructure you are monitoring).
(Score: 2) by The Mighty Buzzard on Thursday January 10 2019, @07:39PM (2 children)
We used to have monitoring software (icinga). No idea why we don't anymore. I vaguely remember hearing paulej72 and NCommander bitching about it being a pain in the ass and giving up on it but not the specifics. By the time I got roped into doing any admin work at all it was long gone and I was the junior most admin.
My rights don't end where your fear begins.
(Score: 0) by Anonymous Coward on Friday January 11 2019, @02:04AM (1 child)
Whatever happened to Ncommander?
(Score: 2) by The Mighty Buzzard on Friday January 11 2019, @03:06AM
He's mega busy but still around if needed, just not necessarily on immediate notice.
My rights don't end where your fear begins.
(Score: 3, Touché) by DannyB on Thursday January 10 2019, @07:28PM
The shop will be adopting the latest hipster trend of FOP.
(Failure Oriented Programming, or maybe Fear Oriented Programming?)
Fortunately, several new frameworks were hastily written, and one of them will be randomly selected by management.
The proof of value in these new frameworks and this new methodology is how quickly and efficiently a single developer can create a Hello World website. The default configuration and plumbing takes care of most of the work. So it must be good.
To transfer files: right-click on file, pick Copy. Unplug mouse, plug mouse into other computer. Right-click, paste.