Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Thursday January 10 2019, @06:07PM   Printer-friendly
from the we're-baaaaaack! dept.

[Updated to correct time of neon CPU's spiking. --martyb]

We experienced an unexpected outage of the site this morning (20190110 00:15-07:45 UTC). At shortly after midnight approximately 0415 (UTC), CPU usage on neon suddenly pegged at 400% and things went downhill from there. Am not sure at this point what happened between 0015 and 0415.

Root cause is being investigated, but for now it seems the site is back up and working. Please let us know if you have any issues.

Note: you may need to have your browser ignore its cache (e.g. refresh with Ctrl+F5) and bring down everything fresh.

FWIW, system came back up after we rebooted neon (using the Linode manager page), and then bounced varnishd on fluorine and hydrogen (/home/bob/bin/bounce on each.)

Many thanks go to SemperOSS and cosurgi for problem determination and steps to rectify and FatPhil for his cheerleading!

[Update: TMB] So, the deal was that some unknown time in the past the ndb database node on helium had gone down. This wasn't a problem since we run a clustered database but nobody noticing it was. Then last night something caused neon to lose its cheese. Since it hosts the other node of the db, we had no db for a while. Bytram(martyb) has sysadmin powers for when unpleasant substances of various types hit the fan and thankfully he knew enough to get the neon db node back up and bounce apache/varnish on the web frontends, so kudos to him and all the folks who were backseat driving at the time due to lack of admin perms on their parts.

My brain's currently fried from going from asleep to OMGWTFBBQ without so much as a cup of coffee and a cigarette first, so I'm not going to dig into the root causes until it unfries itself but as a stopgap we have four more staff with shiny, new admin access that I'll be emergency bootcamping in the very near future. There's also going to be some monitoring reimplemented very soon so we notice this kind nonsense before it blows up in our faces again. I'll either update and bump this story or post a new one if we manage to figure out what the root causes were but at the moment the logs aren't being particularly helpful.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Funny) by Anonymous Coward on Thursday January 10 2019, @04:13PM (2 children)

    by Anonymous Coward on Thursday January 10 2019, @04:13PM (#784531)

    Anything to do with your "late X-mas present"? ;-)

    Starting Score:    0  points
    Moderation   +1  
       Funny=1, Total=1
    Extra 'Funny' Modifier   0  

    Total Score:   1  
  • (Score: 3, Informative) by The Mighty Buzzard on Thursday January 10 2019, @05:25PM (1 child)

    by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Thursday January 10 2019, @05:25PM (#784564) Homepage Journal

    Nah, I'll update the story shortly as to what we've tracked down so far. Right now my brain hurts and my cup and coffee pot are both empty though. I'll get to it after those are all resolved.

    --
    My rights don't end where your fear begins.
    • (Score: 2) by edIII on Thursday January 10 2019, @11:33PM

      by edIII (791) on Thursday January 10 2019, @11:33PM (#784723)

      Totally understandable :)

      Thank you for helping.

      --
      Technically, lunchtime is at any moment. It's just a wave function.