We had a minor site hiccup today. All seems to be working, now.
We have always been open and upfront about the site, so in the interests of full disclosure here is a summary of the problem and steps taken to fix it.
tl;dr Comment counts shown for each story on the main page seem to have stopped getting updated since about midnight this morning; appears to be working now. Please accept our apologies for any who were inconvenienced.
Read on past the fold for details.
Problem: Comment counts on the main page showed "0" comments on recent stories, but opening a story showed the correct number of comments for it.
Actions Taken:
1.) Try bouncing the front-end servers to restart apache (This is a low-risk step that seems to fix a surprising number of issues).
No joy.
2.) Ask for help on the #dev channel on IRC.
Ncommander replied asking if slashd (an over-seeing daemon for the site) was running.
Looked through my log files and on the site wiki; determined that slashd should be running on server: fluorine
ps -AF | grep slashd | wc showed 32 processes
Ncommander suggested: killall -9 slashd
Try: killall -9 slashd
"No process found."
Inspection of output of PS -AF suggested this one-liner should do it:
$(ps -AF | grep slashd | awk '{print "kill -9 " $2}' )
Got most of the processes, but there still seemed to be some stragglers.
/etc/init.d/./slash stop
/etc/init.d/./slash restart
Conclusion:
Looked like it might have worked... reloaded main page... see updated comment counts!
Looks like all is working again.
It's a credit to the staff here that the site has been running so smoothly and without crashing or hiccups for... I can't remember when we last had an outage. Given that in the early days of the site we had maybe a few hours of uptime between crashes, we have come a long ways!
I'm going to assume this is one of those "have you tried turning it off and back on again" kind of problems, and unless the problem re-occurs, assume it is solved.
Need to hurry to get to work, so I apologize for the brevity of this posting.
--martyb
(Score: 3, Informative) by The Mighty Buzzard on Wednesday April 24 2019, @12:37PM (2 children)
Just stopping restarting the apache/varnish processes. I even wrote a script named "bounce" so folks not comfortable with the system can do everything properly and in the correct order.
The actual cause was slashd though, which is basically a silly-assed reinvention of a cron daemon by the the folks who wrote slashcode in the first place. It only takes about a minute to fix, even counting sshing in and such, if you've done it a time or three. If it gave us problems more than once a year or so, I'd look into fixing it. I was out fishing/camping this time or it wouldn't have caused poor martyb any headaches this time around.
My rights don't end where your fear begins.
(Score: 2) by RS3 on Wednesday April 24 2019, @02:57PM (1 child)
> I was out fishing/camping this time...
Good! I need to make time for something outdoors more than an hour here and there.
Someday the really big fish will reel YOU in. Then martyb, et al, will learn what "we're fu....." means!
slashd is the systemd of slashcode? Someday I'm gonna download that slashcode and marvel...
What would happen if your ran your "bounce" script in cron.weekly or monthly just for the heck of it? Or maybe the problem occurs randomly, well, due to an unknown problem and bounce needs run on demand. So maybe a cron script that scans maybe every 5 minutes for whatever the problem's symptoms are and calls (or does) bounce?
(Score: 2) by The Mighty Buzzard on Wednesday April 24 2019, @11:43PM
Well, bounce didn't need run this time. slashd is an independent daemon that almost never gives us any trouble but requires someone who knows what/how to bitchslap, or at least someone able to figure it out.
My rights don't end where your fear begins.