SoylentNews Comments | Site Repair Work: Rehash 15.05.3

Site Repair Work: Rehash 15.05.3

posted by NCommander on Monday June 08 2015, @10:00AM

from the boat-anchor-detached dept.

As debugging efforts continue, I think most of the community should expect we're going to be running a daily story on our effort to return the site to normality. We did another bugfix rollout to production to continue cleaning up error logs, and Paul I continue to make modifications to improve site performance. As you may have already noticed, we've managed a serious coup in the battle for low page times by replacing Linode's NodeBalancer product with a self-rolled nginx-frontend proxy. The vast majority of page loads are now sub-second.

Here's the quick list of stuff we changed over the weekend:

Rehash 15.05.3 - Changelog

Optimized several slow queries in the frontend which were causing timeouts
Drastically improved rehash's reporting to error_log
Rolled out new frontend server (sodium) to handle load-balancing and SSL termination
SSL by-default re-enabled; Firefox now loads SoylentNews extremely quickly due to the new frontend supporting SSL keepalive
Disabled memcached on production temporarily for now (see past the break for details)

Although we've yet to formally locate and disable the cause of the 500s and HASH(*) entries you sometimes get on page load, I now have a working theory on what's going on.

During debugging, I would notice we'd almost universally get a bad load from the cache if varnish or the loadbalancer burped for a moment, As best I can tell running backwards through traces and various error logs, the underlying cause of the 500s and bad page loads is one of two causes: either we're getting bad data from memcached on a cache read, or a bad load into cache from the database. Of these two, I'm leaning closer to the former, since if we were loading bad data into memcached, we'd see consistent 500s once the cache was corrupted.

It's somewhat difficult to accept that memcached is responsible for our corruption issues; we've been using it since golive a year and a half ago. However, given a lack of other leads, I flipped the memcached config switch to off, and then loadtested the site to see how bad the performance drop would be. Much to my surprise, the combination of query optimizations, the faster apache 2 codebase, and (for the most part) increased responsiveness due to having a multimaster database seems to be able to cover the slack for the time being. As of writing, memcached has been disabled for several hours, and I've yet to see any of the telltale signs of corruption in error_log. I also need to note that its the middle of the night for the vast majority of our users, so this may just be the calm before the storm.

I dislike using production as something of a guenna pig, but given the very transitory nature of this bug, and our inability to reliably reproduce it on our dev systems, we're left somewhere between a rock and a hard place. I would like to thank the SoylentNews community for their understanding over the last week, and I hope normality may finally return to this site :)

~ NCommander

This discussion has been archived. No new comments can be posted.

Site Repair Work: Rehash 15.05.3 | Log In/Create an Account | Top | 16 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Memcache and corruption Memcache and corruption (Score: 4, Informative) by paulej72 on Monday June 08 2015, @11:09AM

by paulej72 (58) on Monday June 08 2015, @11:09AM (#193599) Journal

One other lead we had with corruption is that the mod_perl memory gets corrupted. As part of the optimization rehash saves some vars into the working memory of the Apache thread. Since there are several Apache threads running at any given moment it is possible to cycle through ones that have a bad local cache of the data in memory. When moving over to Apache 2, we forgot to set the same configs that we had set in Apache 1 that prevented this from happening.

The two settings that we fixed were changing MaxClients to 15 from 150 to lower memory needs and MaxRequestsPerChild to 10000 from a default of unlimited. The latter change causes Apache to kill a thread once it has served 10000 requests and start a new one. This has the benefit of freeing the memory the thread uses and any corruption that may have happened.

As our corruption seems to happen more as time progressed, it seems like this may be the real fix for corruption issues. I was able to set this yesterday before NCommander turned off Memcache and what I saw in the logs was a vast improvement in the crap that was reported, although this was not conclusive as I did not have a chance to run it long enough to see what prolong use without a restart would do.

--
Team Leader for SN Development

Starting Score:	1		point
Moderation		+2
Informative=2, Total=2
Extra 'Informative' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		4

Re:Memcache and corruption (Score: 3, Informative) by sudo rm -rf on Monday June 08 2015, @12:53PM

by sudo rm -rf (2357) on Monday June 08 2015, @12:53PM (#193618) Journal

Note I am no expert in server setups, but lowering MaxRequestsPerChild to 10000 seems quite reasonable to me. At the moment, when I load this page, my browser (Firefox) sends 18 requests (of which, btw, 17 result in a 304 not changed - very nice job there!), so killing and starting a new thread every 10000 requests will probably not affect end-user experience at all.
Also lowering MaxClients to 15 is a good idea, I think, given that the amount of response-data is pretty low (only a few KB worth of comments) and so the number of connections in the queue should be pretty low most of the time (because of fast processing of individual requests).
And lastly, if those two settings will solve the corruption problems - even better!

Parent
Re:Memcache and corruption Re:Memcache and corruption (Score: 3, Interesting) by kbahey on Monday June 08 2015, @02:26PM

by kbahey (1147) on Monday June 08 2015, @02:26PM (#193653) Homepage

If it helps, I see the 500 error occasionally. It happens when I am trying to moderate. Maybe once a day. When I retry things work (possibly because I am routed to a different server or Apache instance.
Lower that 10,000 even more. Maybe 2,000 to 3,000 or so. Apache will not be spawning new children that often even with such a lower number.
Also, regarding memcached disabled. Did the application use memcached for state information (so Apache/Perl instances on multiple machines can communicate/coordinate with each other)? If so, then is state info now in the database? If it does not affect performance then it is a moot point.

--
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning [2bits.com].

Parent
- Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by NCommander on Monday June 08 2015, @08:32PM
  
  by NCommander (2) <michael@casadevall.pro> on Monday June 08 2015, @08:32PM (#193804) Homepage Journal
  
  Has it happened since this article was posted? I've been keeping an eye on error_log since I woke up and haven't seen any of the taletell signs of data corruption we were seeing.
  
  --
  Still always moving
  
  Parent
  - Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by kbahey on Monday June 08 2015, @09:01PM
    
    by kbahey (1147) on Monday June 08 2015, @09:01PM (#193816) Homepage
    
    Yes, error_log will show the 500 error if it happens.
    I have not seen the 500 error today.
    
    --
    2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning [2bits.com].
    
    Parent
    - Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by NCommander on Monday June 08 2015, @09:11PM
      
      by NCommander (2) <michael@casadevall.pro> on Monday June 08 2015, @09:11PM (#193820) Homepage Journal
      
      Yeah, I think we found the cause of our corruption issue. Now we need to figure out why it is. My guess the problem is in Cache::Memcached::Fast, and not something we did as I think its unlikely the issue is with memcached itself.
      
      --
      Still always moving
      
      Parent
      - Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by kbahey on Monday June 08 2015, @09:28PM
        
        by kbahey (1147) on Monday June 08 2015, @09:28PM (#193823) Homepage
        
        Here is a guess: are you trying to large cache items? Perhaps you need to change the default from 1M to something more (using the -I [that is an uppercase "i"]).
        
        --
        2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning [2bits.com].
        
        Parent
        
        Re:Memcache and corruption (Score: 2) by NCommander on Monday June 08 2015, @10:01PM
        
        by NCommander (2) <michael@casadevall.pro> on Monday June 08 2015, @10:01PM (#193834) Homepage Journal
        
        We didn't change the configuration on memcached or any of the caching layer code on the upgrade. Generally what it does is a bunch of core functions such as getStory, getComments, or getUser tries to pull the data from memcache. If its not there, it runs a SELECT on the database to get it dynamically, and loads it into memcache for future loads. It's possible migrating from Perl 5.12 -> 5.20.1 broke the API though I have no idea how ...
        
        --
        Still always moving
        
        Parent
Re:Memcache and corruption (Score: 2) by FatPhil on Monday June 08 2015, @09:01PM

by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Monday June 08 2015, @09:01PM (#193817) Homepage

Good work guys! The nice thing about being able to get away without memcached is that there's pretty much no downside to it. If it really isn't giving a noticeable boost, it might be best to just stick with a simpler configuration. If the bugs don't go away - you've now got a simpler system to debug, and if the do go away then problem solved.

--
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves

Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Site Repair Work: Rehash 15.05.3

Memcache and corruption Memcache and corruption (Score: 4, Informative) by paulej72 on Monday June 08 2015, @11:09AM

Re:Memcache and corruption (Score: 3, Informative) by sudo rm -rf on Monday June 08 2015, @12:53PM

Re:Memcache and corruption Re:Memcache and corruption (Score: 3, Interesting) by kbahey on Monday June 08 2015, @02:26PM

Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by NCommander on Monday June 08 2015, @08:32PM

Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by kbahey on Monday June 08 2015, @09:01PM

Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by NCommander on Monday June 08 2015, @09:11PM

Re:Memcache and corruption Re:Memcache and corruption (Score: 2) by kbahey on Monday June 08 2015, @09:28PM

Re:Memcache and corruption (Score: 2) by NCommander on Monday June 08 2015, @10:01PM

Re:Memcache and corruption (Score: 2) by FatPhil on Monday June 08 2015, @09:01PM