Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Thursday January 10 2019, @06:07PM   Printer-friendly
from the we're-baaaaaack! dept.

[Updated to correct time of neon CPU's spiking. --martyb]

We experienced an unexpected outage of the site this morning (20190110 00:15-07:45 UTC). At shortly after midnight approximately 0415 (UTC), CPU usage on neon suddenly pegged at 400% and things went downhill from there. Am not sure at this point what happened between 0015 and 0415.

Root cause is being investigated, but for now it seems the site is back up and working. Please let us know if you have any issues.

Note: you may need to have your browser ignore its cache (e.g. refresh with Ctrl+F5) and bring down everything fresh.

FWIW, system came back up after we rebooted neon (using the Linode manager page), and then bounced varnishd on fluorine and hydrogen (/home/bob/bin/bounce on each.)

Many thanks go to SemperOSS and cosurgi for problem determination and steps to rectify and FatPhil for his cheerleading!

[Update: TMB] So, the deal was that some unknown time in the past the ndb database node on helium had gone down. This wasn't a problem since we run a clustered database but nobody noticing it was. Then last night something caused neon to lose its cheese. Since it hosts the other node of the db, we had no db for a while. Bytram(martyb) has sysadmin powers for when unpleasant substances of various types hit the fan and thankfully he knew enough to get the neon db node back up and bounce apache/varnish on the web frontends, so kudos to him and all the folks who were backseat driving at the time due to lack of admin perms on their parts.

My brain's currently fried from going from asleep to OMGWTFBBQ without so much as a cup of coffee and a cigarette first, so I'm not going to dig into the root causes until it unfries itself but as a stopgap we have four more staff with shiny, new admin access that I'll be emergency bootcamping in the very near future. There's also going to be some monitoring reimplemented very soon so we notice this kind nonsense before it blows up in our faces again. I'll either update and bump this story or post a new one if we manage to figure out what the root causes were but at the moment the logs aren't being particularly helpful.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Thursday January 10 2019, @08:33PM (1 child)

    by Anonymous Coward on Thursday January 10 2019, @08:33PM (#784660)

    They're running gentoo so I'm thinking openrc must have shit the bed while it was supposed to be managing processes.

  • (Score: 2) by The Mighty Buzzard on Friday January 11 2019, @03:04AM

    by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Friday January 11 2019, @03:04AM (#784844) Homepage Journal

    Nah, the db nodes haven't been swapped over to Gentoo yet. Still Ubuntu.

    --
    My rights don't end where your fear begins.