As debugging efforts continue, I think most of the community should expect we're going to be running a daily story on our effort to return the site to normality. We did another bugfix rollout to production to continue cleaning up error logs, and Paul I continue to make modifications to improve site performance. As you may have already noticed, we've managed a serious coup in the battle for low page times by replacing Linode's NodeBalancer product with a self-rolled nginx-frontend proxy. The vast majority of page loads are now sub-second.
Here's the quick list of stuff we changed over the weekend:
Rehash 15.05.3 - Changelog
Although we've yet to formally locate and disable the cause of the 500s and HASH(*) entries you sometimes get on page load, I now have a working theory on what's going on.
During debugging, I would notice we'd almost universally get a bad load from the cache if varnish or the loadbalancer burped for a moment, As best I can tell running backwards through traces and various error logs, the underlying cause of the 500s and bad page loads is one of two causes: either we're getting bad data from memcached on a cache read, or a bad load into cache from the database. Of these two, I'm leaning closer to the former, since if we were loading bad data into memcached, we'd see consistent 500s once the cache was corrupted.
It's somewhat difficult to accept that memcached is responsible for our corruption issues; we've been using it since golive a year and a half ago. However, given a lack of other leads, I flipped the memcached config switch to off, and then loadtested the site to see how bad the performance drop would be. Much to my surprise, the combination of query optimizations, the faster apache 2 codebase, and (for the most part) increased responsiveness due to having a multimaster database seems to be able to cover the slack for the time being. As of writing, memcached has been disabled for several hours, and I've yet to see any of the telltale signs of corruption in error_log. I also need to note that its the middle of the night for the vast majority of our users, so this may just be the calm before the storm.
I dislike using production as something of a guenna pig, but given the very transitory nature of this bug, and our inability to reliably reproduce it on our dev systems, we're left somewhere between a rock and a hard place. I would like to thank the SoylentNews community for their understanding over the last week, and I hope normality may finally return to this site :)
~ NCommander
(Score: 3, Informative) by sudo rm -rf on Monday June 08 2015, @12:53PM
Note I am no expert in server setups, but lowering MaxRequestsPerChild to 10000 seems quite reasonable to me. At the moment, when I load this page, my browser (Firefox) sends 18 requests (of which, btw, 17 result in a 304 not changed - very nice job there!), so killing and starting a new thread every 10000 requests will probably not affect end-user experience at all.
Also lowering MaxClients to 15 is a good idea, I think, given that the amount of response-data is pretty low (only a few KB worth of comments) and so the number of connections in the queue should be pretty low most of the time (because of fast processing of individual requests).
And lastly, if those two settings will solve the corruption problems - even better!