Join our Folding@Home team:
Main F@H site
Our team page
Support us: Subscribe Here
and buy SoylentNews Swag
We always have a place for talented people, visit the Get Involved section on the wiki to see how you can make SoylentNews better.
We just deployed a new point upgrade to rehash today to fix a bunch of small bugs that have been with us since the rehash upgrade and a few that were around longer than that. Here are the highlights:
We were able to kill off about 10 high priority bugs with this mini release. Current issues and feature requests can be found on GitHub and you can submit new issues or feature requests here if you have a GitHub account. We also welcome bugs and requests sent via email to admin@soylentnews.org or left in the comments below.
Our goals for the next major update is more of the same bug hunting and killing with a few features added here and there. Again I would like to thank you for your patience with the growing pains we have had with the 15_05 rehash upgrade. This update should bring us mostly back to where we were before in terms of broken features.
As debugging efforts continue, I think most of the community should expect we're going to be running a daily story on our effort to return the site to normality. We did another bugfix rollout to production to continue cleaning up error logs, and Paul I continue to make modifications to improve site performance. As you may have already noticed, we've managed a serious coup in the battle for low page times by replacing Linode's NodeBalancer product with a self-rolled nginx-frontend proxy. The vast majority of page loads are now sub-second.
Here's the quick list of stuff we changed over the weekend:
Rehash 15.05.3 - Changelog
Although we've yet to formally locate and disable the cause of the 500s and HASH(*) entries you sometimes get on page load, I now have a working theory on what's going on.
During debugging, I would notice we'd almost universally get a bad load from the cache if varnish or the loadbalancer burped for a moment, As best I can tell running backwards through traces and various error logs, the underlying cause of the 500s and bad page loads is one of two causes: either we're getting bad data from memcached on a cache read, or a bad load into cache from the database. Of these two, I'm leaning closer to the former, since if we were loading bad data into memcached, we'd see consistent 500s once the cache was corrupted.
It's somewhat difficult to accept that memcached is responsible for our corruption issues; we've been using it since golive a year and a half ago. However, given a lack of other leads, I flipped the memcached config switch to off, and then loadtested the site to see how bad the performance drop would be. Much to my surprise, the combination of query optimizations, the faster apache 2 codebase, and (for the most part) increased responsiveness due to having a multimaster database seems to be able to cover the slack for the time being. As of writing, memcached has been disabled for several hours, and I've yet to see any of the telltale signs of corruption in error_log. I also need to note that its the middle of the night for the vast majority of our users, so this may just be the calm before the storm.
I dislike using production as something of a guenna pig, but given the very transitory nature of this bug, and our inability to reliably reproduce it on our dev systems, we're left somewhere between a rock and a hard place. I would like to thank the SoylentNews community for their understanding over the last week, and I hope normality may finally return to this site :)
~ NCommander
We are having some difficulties with the site at the moment. We are aware that there is an error being presented about our certificate being expired. We are working on it and appreciate your patience while we iron this out.
Interim measure #1: try to use the non-secure link to the site http://soylentnews.org
Interim measure #2: Accept the expired cert, but be sure to uncheck the 'Permanently store this exception' checkbox (this may not be available to you on very recent versions of Firefox).
If you have other suggestions on workarounds, please submit as a comment. Be certain we are doing all we can to get the site back up and running!
[Update] NCommander reports that: "New SSL certificate has been installed and the site no longer generates certificate errors. I'm continuing to do work to get performance for Firefox to be decent."
[Update 2]: A revised frontend proxy is nearly operational and ready to go into service which will resolve performance issues for firefox
Earlier tonight, I modified our varnish rules to redirect all traffic to https://soylentnews.org if they came in as plain HTTP. Unfortunately, due to dropping SSLv3 support to prevent POODLE attacks, IE6 clients will no longer be able to reach SoylentNews. If this seriously inconveniences a large number of users, we may go through the trouble of whitelisting IE6 to drop down to HTTP only.
In addition, I applied an experimental update to production to try and clear as many errors as possible from the Apache error logs, in an attempt to continue isolating any remaining bugs and slowdowns. I also ripped out more dead code related to FireHose, Achievements, and Tags. As such, site performance appears to roughly be back to where it should be, and I have yet to see any 500 errors post-upgrade (though I concede that said update has only been up for about 2 hours at this point).
Tor traffic is set to bypass HTTPS due to the fact there is no way to prevent a self-signed certificate warning, and by design, tor both encrypts and authenticates hosts when connecting to them. A few lingering issues with the tor proxy were fixed with most recent code push, and the onion site should be back to functioning normally
P.S. I'm aware that the site is generating warnings due to the fact we use a SHA-1 based certificate. We will be changing out the certificate as soon as reasonably possible.
Moving on from frontend stuff, I'm getting to the point that I want to dig in deep and rewrite most of the database layer of SoylentNews/rehash. For those who are unfamiliar with our codebase, its primarily written in perl, with fairly strict adherence to the MVC model, going as far as installing system modules for code shared between the frontend and backend daemons. Slash was written with database portability in mind, and at least historically, a port to postgreSQL existed in the late 90s/early 2000s, and there was some legacy Oracle code authored by VA Linux as well. This code has bitrotted to the point of unusability, leaving the MySQL backend the only functional mode; I deleted the legacy code about a year ago from our git repo.
However, migrating from MySQL has remaining on my personal TODO list for a long time, due to unreliability, data corruption, and configuration pain. The obvious choice from where I'm sitting is postgreSQL. For those who aren't super familiar with the pro/cons with MySQL, this article by Elnur Abdurrakhimov has a pretty good summary and a list of links explaining in-depth why MySQL is not a good solution for any large site. We've hit a lot of pain in the nearly 1.5 years SN has been up due to limitations in the database layer as well, forcing us to use a clustering solution to provide any sort of real redundancy for our backend. Although I'm familiar with database programming, I'm not a DBA by trade, so I'm hoping to tap into the collective knowledge of the SoylentNews community and work out a reasonable migration plan and design.
[More after the break...]
Beside my personal dislike of MySQL, there's a more important reason to migrate from MySQL. MySQL's support for stored procedures is incredibly poor, which means raw SQL has to be written in the application layer. rehash reduces the danger of injection by providing a set of wrapper functions such as select/insert/update which take four arguments: table, from clause, where clause, and anything extra if necessary; these parameters are assembled into a full query which is in turn properly escaped to prevent most obvious attacks from working. Extensive whitelists are used for sanitizing parameters, but by design, rehash uses a single namespace, with a single user account which has full SELECT/INSERT/UPDATE/DELETE permissions across the board. If any single point is compromised, the entire database is toast. Furthermore, because of poor support for views in MySQL, massive JOINs litter the codebase, making some queries reaching across 5-6 tables (with the most horrific example I can think of being the modbomb SELECT which reaches across almost every user and comment table in the database). This makes debugging and optimizing anything a slow and *painful experience.
What I want to do is remove as much code out of the application layer, and move it down the stack into the database. Each function in Slash/DB/MySQL/MySQL.pm should be replicated with a stored procedure which at a minimum executes the query, and if possible, relocate as much of query processing logic into pg/Perl modules. This should be relatively straightforward to implement, and allow high code reusability due to the fact that almost all of rehash's methods exist in perl modules, and not in individual .pl scripts. The upshot of this is that the only permission the DB account requires is EXECUTE to run the stored procedures; if possible, I'd also like to localize which perl function can call which pgSQL procedure; i.e., the getStories() function can only call procedures relating to that, vs. having access to all stored procedures in the database.
This would greatly reduce the reach of any SQL injection attacks, as well as hardening the site against possible compromise; unrestricted access to the database would require breaching one of the DB servers directly instead of getting lucky via a bug in rehash. As I've stated before, no security can be perfect, but I would love to get this site to the point that only a dedicated, targeted attack would even stand a chance of succeeding. That being said, while it sounds good on paper, I'm not 100% sure this type of design is reasonable. Very few FOSS projects seem to take advantage of stored procedures, triggers, views and other such functionality and I'm wondering if others have tried and failed to implement this level of functionality.
So, knowing what you want to do is good, but knowing how to do it is just as important. What I think the first step needs to be is a basic straight-port of the site from MySQL to postgreSQL, and implement a better schema upgrade system. As of right now, our upgrade "system" is writing queries in a text file, and just executing them when a site upgrade has to be done. Very low-tech, and error prone for a large number of queries. I don't have a lot of experience in managing very large schemas, so one thing on which I'd like advice is if there's a good, pre-existing framework that could be used to simplify our database upgrades. By far, the ideal scenario would be if we could run a single script which can intelligently upgrade the database from release to release.
Once both these pieces are in play, a slow but steady migration of code from the application layer to the database layer would allow us to handle the transition in a sane manner. Database auditing can be used to keep track of the most frequently used queries and function calls, and keep an eye on our total progress towards reaching an EXECUTE-only world.
That's everything in a nutshell. I want to know what you guys think, and as always, I'll be reading and replying to comments down below!
~ NCommander
Contact me, or paulej72 on IRC, or post a comment below if you're interested in helping.
Rehash 15.05.1 - Release Notes
The primary cause of the slowdown was due to the fact that rehash did large JOIN operations on text columns in MySQL. This is bad practice in general due to performance reasons, but it causes a drastic slowdown with MySQL cluster, which prevents the query optimizer from doing what's known as a "pushdown", and allowing the query to execute on the NDB nodes. This caused article load to be O(n*m), where n was the number of articles in the database and m was the number of articles with the neverdisplay attribute set. The revised queries now load at O(1). Instead it had to do multiple pulls from the database and assemble the query data on the frontend, a process that took 4-5 seconds per problematic query. The problem was compounded that there are limited number of httpd daemons at any given moment, and any database pull that hit a problematic query (which were in index.pl and article.pl) would cause resource exhaustion.
Fortunately, our load balancer and varnish cache have a fairly high timeout waiting for httpd to come available, preventing the site from soyling itself under high load, or when we do an apache restart, which prevented SN from going down. Thank you for everyone's patience with this matter :).
~ NCommander
I am happy to announce we have reached our funding goal of $4500 for the first half of this year! From the bottom of our hearts, a big thank you to everyone for all your support. There were a few who chose to pay a lot more than we ever suspected, they know who they are, and I would literally like to buy them a beer.
Continuing with the good news, our goal for July-December is less than half of our first-half goal, and we are a month ahead of schedule — details to follow on that. Though we won't all be dining on champagne and caviar, we are hopeful we will be able to continue paying the bills for the foreseeable future. For now, I just want to say that you are all one awesome community and that you continue to surprise and inspire us; we'll keep doing what we can to make this place the best it can be.
This was by far one of the most painful upgrades we've ever done to this site, and resulted in nearly a three hour downtime. Even as of writing, we're still not back to 100% due to unexpected breakage that did not show up in dev. As I need a break from trying to debug rehash, let me write up what's known, what's new, and what went pear-shaped.
Rehash 15.05 - What's New
I want to re-state that this upgrade is by far the most invasive one we've ever done. Nearly every file and function in rehash had to be modifying due to changes in the mod_perl infrastructure, and more than a few ugly hacks had to be written to emulate the original API in places. We knew going into this upgrade it was going to be painful, but we had a load of unexpected hiccups and headaches. Even as I write this, the site is still limping due to some of that breakage. Read more past the break for a full understanding of what has been going on.
Way back at golive, we identified quite a few goals that we needed to reach if we wanted the site to be maintainable in the long run. One of these was getting to a modern version of Apache, and perl; slashcode (and rehash) are tightly tied to the Apache API for performance reasons, and historically only ran against Apache 1.3, and mod_perl 1. This put us in the unfortunate position of having to run on a codebase that had long been EOLed when we launched in 2014. We took precautions to protect the site such as running everything through apparmor, and trying to adhere to the smallest set of permissions possible, but no matter how you looked at it, we were stuck on a dead platform. As such, this was something that *had* to get done for the sake of maintainability, security and support.
This was further complicated by a massive API break between mod_perl 1 -> 2, which many (IMHO) unnecessary changes done to data structures and such that meant such an upgrade was an all-or-nothing affair. There was no way we could piecemeal upgrade the site to the new API. We had a few previous attempts at this port, all of them going nowhere, but over a long weekend in March, I sat down with rehash and our dev server, lithium, and got to the point the main index could be loaded under mod_perl 2. From there, we tried to hammer down whatever bugs we could, but we were effectively maintaining the legacy slashcode codebase, and the newer rehash codebase. Due to limited development time, most of the bug fixes and such were placed on rehash once it reached a state of functionality, and these would be shoehorned in with the stack of bugs we were fixing). I took the opportunity to try and clear out as many of the long-standing wishlist bugs as possible, such as IPv6 support.
In our year and a half of dealing with slashcode, we had also identified several pain points; for example, if the database went down even for a second, the site would lockup, and httpd would hang to the point that it was necessary to kill -9 the process. Although slashcode has support for the native master-slave replication built into MySQL, it had no support for failover. Furthermore, MySQL's native replication is extremely lacking in the area of reliability. Until very recently, there was no support for dynamically changing the master database in case of failure, and the manual process is exceedingly slow and error prone. While MySQL 5.6 has improved the situation with global transactions IDs (GTID), it still required code support in the application to handle failover, and a specific monitoring daemon to manage the process, in effect creating a new single point of failure. It also continues to lack any functionality heal or otherwise recover from replication failures. In my research, I found that there was simply bad and worse options with vanilla MySQL in handling replication and failover. As such, I started looking seriously into MySQL Cluster, which adds multi-master replication to MySQL at the cost of some backwards compatibility.
I was hesitant to make such a large change to the system, but short of rewriting rehash to use a different RDBM, there wasn't a lot of options. After another weekend of hacking, dev.soylentnews.org was running on a two system cluster, which provided the basis for further development. This required removing all the FULLTEXT indexes in the database, and rewriting the entire search engine to use Sphinx Search. Unfortunately, there's no trivial way to migrate from vanilla MySQL to cluster. To prevent a long story from getting even longer, to perform the migration, the site would have to be offlined, a modified schema would have to be loaded into the database, and then the data re-imported in two separate transactions. Furthermore, MySQL Cluster needs to know in advance how many attributes and such are being used in the cluster, adding another tuning step to the entire process. This quirk of cluster caused significant headache when it came to import the production database.
To understand why things went so pear shaped on this cluster**** of the upgrade, a little information is needed on how we do upgrades. Normally, after the code has baked for awhile on dev, our QA team (Bytram) gives us an ACK when he feels its ready. If the devs feel we're also up to scratch to deploy, one person, usually me or Paul will push the update out to production. Normally, this is a quick process; git tag/pull and then deploy. Unfortunately, due to the massive amounts of infrastructure changes required by this upgrade, more work than normal would be required. In preparation, I prepared our old webfrontend, hydrogen, which had been down for an extended period following a system break to take the new perl, Apache 2, etc, and loaded a copy of rehash. The upgrade would then just be a matter of moving the database over to cluster, changing the load balancer to point to hydrogen, and then upgrading the current webfrontend to flourine. At 20:00 EDT, I offlined the site to handle the database migration, dumping the schema and tables. Unfortunately, the MaxNoOfAttributes and other tuning variables were too low to handle two copies of the database, and thus the initial import failed. Due to difficulty with internal configuration changes, and other headaches (such as forgetting to exclude CREATE TABLE statements from the original database), it took nearly two hours to simply begin importing the 700 MiB SQL file, and another 30 or so minutes for the import to finish. I admit I nearly gave up the upgrade at this point, but was encouraged to soldier on. In hindsight, I could have better tested this procedure, and had gotten all the snags out of the way prior to upgrade; the blame for the extended downtime solely lies with me. Once the database was updated, I quickly got the mysqld frontend on hydrogen up and running, as well as Apache2, just to learn I had more problems as the site returned to the internet nearly three hours later.
What I didn't realize at the time was hydrogen's earlier failure had not been resolved as I thought, and it gave truly abysmal performance, with 10+ second page loads. As soon as this was realized, I quickly pressed fluorine, our 'normal' frontend server into service, and site performance went from horrific to bad. A review of the logs showed that some of the internal caches used by rehash were throwing errors; this wasn't an issue we had seen on dev, and such was causing excessive amounts of traffic to go to the database, and causing Apache to hang as the system tries to keep up with the load. Two hours of debugging have yet to reveal the root cause of the failure, so I've taken a break to write this up before digging into it again
As I write this, site performance remains fairly poor, as the server is excessively smashing against the database. Several features which worked on dev went snap when the site was rolled out on production, and I find myself feeling that I'm responsible for hosing the site. I'm going to keep working for as long as I can stay awake to try and fix as many issues as I can, but it may be a day or two before we're back to business as usual. I truly apologize for the community; this entire site update has gone horribly pear shaped, and I don't like looking incompetent. All I can do now is try and pick up the pieces and get us back to where we were. I'll keep this post updated.
~ NCommander