Join our Folding@Home team:
Main F@H site
Our team page
Support us: Subscribe Here
and buy SoylentNews Swag
We always have a place for talented people, visit the Get Involved section on the wiki to see how you can make SoylentNews better.
There are two things we've always strived to do well here: (1) listen to community feedback and adjust our plans and processes based on that input, and (2) communicate changes with the community. We are working our way through a rather painful upgrade from slash to rehash whereby we've gone through four point releases to get mostly back to where we started. A lot of folks questioned the necessity of such a large-scale change, and while we posted changelogs summarizing changes, I'd like to provide a broad picture of everything and how we're going to fix it.
Dissecting The Rehash Upgrade
Check past the break for more.
Rehash was by far the largest and most invasive upgrade to the site, requiring modifications to nearly every component. To understand what went wrong, a full understanding of the context of the rehash upgrade is important, so I apologize in advance if this is a bit wordy; much of this information was present in previous upgrade posts, and comments, but I want to put it in all in one canonical spot for ease of reference
For those unfamiliar with mod_perl, the original Slash codebase was coded against mod_perl 1, which in turn was tied to Apache 1.3 — which, by the time we launched this site, had been unsupported for years. Thus it was known from day one that if we were going to be serious on both continuing on the legacy Slash codebase and keeping SoylentNews itself going that this was something that was going to have to be done. Not if, but when.
As a stopgap, I wrote and applied AppArmor profiles to try and mitigate any potential damage, but we were in the web equivalent of Windows XP; forever vulnerable to zeroday exploits. With this stopgap in place, our first priorities were to improve site usability, sort out moderation (which of course is always a work in progress), and continue with the endless tweaks and changes we've made since go-live. During this time, multiple members of the dev team (myself included) tried to do a quick and dirty port, with no success. The API changes in mod_perl were extensive; beside obvious API calls, many of the data structures were changed, and even environment flags were changed. In short, every single line of code in the site that interacted with Apache had to be changed. In other words, every place the codebase interacted with $r (the mod_perl request object) had to be modified, and in other places, logic had to be reworked to handle changes in behavior such as this rather notable example in how form variables are handled. Finally, after an intensive weekend of hacking and pain, I managed to get index.pl to properly render with Apache2, but it became clear that due to very limited backwards compatibility, the port became all-or-nothing; there was no way to simply piecemeal the upgrade.
Furthermore, over time, we'd also noticed other aspects which were problematic. As coded, Slash used MySQL replication for performance, but had no support for multi-master or failover. This was compounded by the fact that if the database went down, even momentarily, the entire stack would completely hang; apache and slashd would have to be manually kill -9ed, and restarted for the site to be usable. This was further complicated in that in-practice, MySQL replication leaves a lot to be desired; there's no consistency to confirm that both the master and slave have 1:1 data. Unless the entire frontend was shutdown, and master->slave replication was verified, it is trivial to lose data due to an ill-timed shutdown or crash (and as time has shown, our host, Linode, sometimes has to restart our nodes to apply security updates to their infrastructure). In practice, this meant that failing over the master was a slow and error-prone process, and after failover, replication would have to be manually re-established in the reverse direction to bring the former-master up to date, then failed back by hand.
While MySQL 5.6 implemented GTID-based replication to remove some pain, it still failed to be a workable solution for us. Although it is possible to get multi-master replication in vanilla MySQL, it would require serious tweaks to how AUTO_INCREMENT work in the codebase, and violated what little ACID compliance MySQL has. As an aside, I found out that the other site uses this form of multi-master replication. For any discussion-based website, the database is the most important mission critical infrastructure. Thus, rehash's genesis had two explicate goals attached to it:
The MySQL redundancy problem proved problematic, and short of simply porting the site wholesale to another DB engine, the only solution I could find that would keep ACID compliance of our database was with MySQL cluster. In a cluster configuration, the database itself is stored by a backend daemon known as ndbd; instances of mysqld act as a frontend to NDB. In effect, the mysql daemon becomes a frontend to the underlying datastore, which in turn keeps everything consistent. Unfortunately, MySQL cluster is not 100% compatible with vanilla MySQL. FULLTEXT indexes aren't supported under cluster, and some types of queries involving JOINs have a considerably longer execution time if they cannot be parallelized.
As an experiment, I initialized a new cluster instance, and moved the development database to it, to see if the idea was even practical. Surprisingly, with the exception of the site search engine, at first glance, everything appeared to be more or less functional under cluster. As such, this provided us with the basis for the site upgrade.
One limitation of our dev environment is that we do not have the infrastructure to properly load test the changes before deployment. We knew that were going to be bugs and issues with the site upgrade, but we were also starting to get to the point that if we didn't deploy rehash, there was a good chance we won't do it at all. I subscribe to the notion of release early, release often. We believed that the site would be mostly functional post-upgrade, and that any issues encountered would be relatively minor. Unfortunately, we were wrong, and it took four site upgrades to get us back to normality which entailed: rewriting many queries, performing active debugging on production, and a lot of luck to get us there. All things considered, not a great situation to be in.
Because of this, I want to work out a fundamental plan to prevent a repeat of such a painful upgrade even if we make large scale changes to the site, and prevent the site from destabilizing even if we make additional large scale changes.
On a most basic level, good documentation goes a long way in keeping stuff both maintainable and usable. Unfortunately, a large part of the technical documentation on the site is over 14 years old. As such, I've made an effort to go through and try and update the PODs to be more in line, including, but not limited to, a revised README, updated INSTALL instructions, notes on some of the quirkier parts of the site and so forth. While I don't expect a huge uptake of people running rehash for their personal sites, being able to run a development instance of the site may hopefully increase the amount of involvement of drive-by contributions. As of right now, I've implements a "make build-environment" feature which automatically downloads and installs local copies of Apache, Perl, and mod_perl, plus all CPAN modules required for rehash. This both makes it easier for us to update the site, and get security fixes from upstream rolled in.
With luck, we can get to the point that someone can simply read the docs and have a full understanding of how rehash goes together, and as with understanding is always the first step towards succeeding at anything.
One thing that came out of the aftermath of the rehash upgrade is that the underlying schema and configuration tables between production and dev have deviated from each other. The reason for this is fairly obvious; our method of doing database upgrades is crud at best. rehash has no automated method of updating the database; instead queries to be executed get written to sql/mysql/upgrades, and executed by hand during a site upgrade, and the initial schema is upgraded for new install. The practical end result of this is that the installation scripts, the dev database, and production all have a slightly different layout due to human error. Wherever possible, we should limit the amount of manual effort required to manage and administrator SoylentNews. If anyone knows of a good pre-existing framework we can use to do database upgrades, I'm all ears. Otherwise, I'll be looking at building one from scratch and intergrating it into our development cycle.
For anyone who has worked on a large project before, unit testing can be a developer's best friend. It lets you know that your API is doing what you want and acting as expected. Now, in normal circumstances, unit testing is difficult to impossible as much of the logic in many web applications is not exposed in a way that makes testing easy, requiring tools like Windmill to simulate page inputs and outputs. Based on previous projects I've done, I'd normally say this represents more effort than is warranted since you frequently have to update the tests even for minor UI changes. In our case, we have a more realistic option. A quirk of rehash's heritage is that approximately 95% of it exists in global perl modules that are either installed in the site_perl directory, or or in the plugins/ directory. As such, rehash strongly adheres to the Model-View-Controller design and methodology.
As such, we have a clear and (partially) documented API to code against which allows us to write simple tests, and confirm the output of the data structures instead of trying to parse HTML to know if something is good or bad. Such a test suite would have made porting the site to mod_perl 2 much simpler, and will come in useful if we ever change database engines or operating system platforms. As such, I've designated it a high priority to at least get the core libraries connected with unit tests to ensure consistent behavior in light of any updates we may make. This is going to be a considerable amount of effort, but I strongly suspect it will reduce our QA workload, and make our upgrades close to a non-event.
The rehash upgrade was a wakeup call for us that we need to improve our processes and methodology, as well as automate aspects of the upgrade process. Even though we're all volunteers, and operate on a best-effort basis, destabilizing the site for a week is not something I personally consider acceptable, and I accept full responsibility as I was the one who both pushed for it, and deployed the upgrade to production. As a community, you've been incredibly tolerant, but I have no desire to test your collective patience. As such, in practice, our next development cycle will be very low key as we work to build the systems outlined in this document, and further tighten up and polish rehash. To all, keep being awesome, and we'll do our best to meet your expectations.
~ NCommander
We just deployed a new point upgrade to rehash today to fix a bunch of small bugs that have been with us since the rehash upgrade and a few that were around longer than that. Here are the highlights:
We were able to kill off about 10 high priority bugs with this mini release. Current issues and feature requests can be found on GitHub and you can submit new issues or feature requests here if you have a GitHub account. We also welcome bugs and requests sent via email to admin@soylentnews.org or left in the comments below.
Our goals for the next major update is more of the same bug hunting and killing with a few features added here and there. Again I would like to thank you for your patience with the growing pains we have had with the 15_05 rehash upgrade. This update should bring us mostly back to where we were before in terms of broken features.
As debugging efforts continue, I think most of the community should expect we're going to be running a daily story on our effort to return the site to normality. We did another bugfix rollout to production to continue cleaning up error logs, and Paul I continue to make modifications to improve site performance. As you may have already noticed, we've managed a serious coup in the battle for low page times by replacing Linode's NodeBalancer product with a self-rolled nginx-frontend proxy. The vast majority of page loads are now sub-second.
Here's the quick list of stuff we changed over the weekend:
Rehash 15.05.3 - Changelog
Although we've yet to formally locate and disable the cause of the 500s and HASH(*) entries you sometimes get on page load, I now have a working theory on what's going on.
During debugging, I would notice we'd almost universally get a bad load from the cache if varnish or the loadbalancer burped for a moment, As best I can tell running backwards through traces and various error logs, the underlying cause of the 500s and bad page loads is one of two causes: either we're getting bad data from memcached on a cache read, or a bad load into cache from the database. Of these two, I'm leaning closer to the former, since if we were loading bad data into memcached, we'd see consistent 500s once the cache was corrupted.
It's somewhat difficult to accept that memcached is responsible for our corruption issues; we've been using it since golive a year and a half ago. However, given a lack of other leads, I flipped the memcached config switch to off, and then loadtested the site to see how bad the performance drop would be. Much to my surprise, the combination of query optimizations, the faster apache 2 codebase, and (for the most part) increased responsiveness due to having a multimaster database seems to be able to cover the slack for the time being. As of writing, memcached has been disabled for several hours, and I've yet to see any of the telltale signs of corruption in error_log. I also need to note that its the middle of the night for the vast majority of our users, so this may just be the calm before the storm.
I dislike using production as something of a guenna pig, but given the very transitory nature of this bug, and our inability to reliably reproduce it on our dev systems, we're left somewhere between a rock and a hard place. I would like to thank the SoylentNews community for their understanding over the last week, and I hope normality may finally return to this site :)
~ NCommander
We are having some difficulties with the site at the moment. We are aware that there is an error being presented about our certificate being expired. We are working on it and appreciate your patience while we iron this out.
Interim measure #1: try to use the non-secure link to the site http://soylentnews.org
Interim measure #2: Accept the expired cert, but be sure to uncheck the 'Permanently store this exception' checkbox (this may not be available to you on very recent versions of Firefox).
If you have other suggestions on workarounds, please submit as a comment. Be certain we are doing all we can to get the site back up and running!
[Update] NCommander reports that: "New SSL certificate has been installed and the site no longer generates certificate errors. I'm continuing to do work to get performance for Firefox to be decent."
[Update 2]: A revised frontend proxy is nearly operational and ready to go into service which will resolve performance issues for firefox
Earlier tonight, I modified our varnish rules to redirect all traffic to https://soylentnews.org if they came in as plain HTTP. Unfortunately, due to dropping SSLv3 support to prevent POODLE attacks, IE6 clients will no longer be able to reach SoylentNews. If this seriously inconveniences a large number of users, we may go through the trouble of whitelisting IE6 to drop down to HTTP only.
In addition, I applied an experimental update to production to try and clear as many errors as possible from the Apache error logs, in an attempt to continue isolating any remaining bugs and slowdowns. I also ripped out more dead code related to FireHose, Achievements, and Tags. As such, site performance appears to roughly be back to where it should be, and I have yet to see any 500 errors post-upgrade (though I concede that said update has only been up for about 2 hours at this point).
Tor traffic is set to bypass HTTPS due to the fact there is no way to prevent a self-signed certificate warning, and by design, tor both encrypts and authenticates hosts when connecting to them. A few lingering issues with the tor proxy were fixed with most recent code push, and the onion site should be back to functioning normally
P.S. I'm aware that the site is generating warnings due to the fact we use a SHA-1 based certificate. We will be changing out the certificate as soon as reasonably possible.
Moving on from frontend stuff, I'm getting to the point that I want to dig in deep and rewrite most of the database layer of SoylentNews/rehash. For those who are unfamiliar with our codebase, its primarily written in perl, with fairly strict adherence to the MVC model, going as far as installing system modules for code shared between the frontend and backend daemons. Slash was written with database portability in mind, and at least historically, a port to postgreSQL existed in the late 90s/early 2000s, and there was some legacy Oracle code authored by VA Linux as well. This code has bitrotted to the point of unusability, leaving the MySQL backend the only functional mode; I deleted the legacy code about a year ago from our git repo.
However, migrating from MySQL has remaining on my personal TODO list for a long time, due to unreliability, data corruption, and configuration pain. The obvious choice from where I'm sitting is postgreSQL. For those who aren't super familiar with the pro/cons with MySQL, this article by Elnur Abdurrakhimov has a pretty good summary and a list of links explaining in-depth why MySQL is not a good solution for any large site. We've hit a lot of pain in the nearly 1.5 years SN has been up due to limitations in the database layer as well, forcing us to use a clustering solution to provide any sort of real redundancy for our backend. Although I'm familiar with database programming, I'm not a DBA by trade, so I'm hoping to tap into the collective knowledge of the SoylentNews community and work out a reasonable migration plan and design.
[More after the break...]
Beside my personal dislike of MySQL, there's a more important reason to migrate from MySQL. MySQL's support for stored procedures is incredibly poor, which means raw SQL has to be written in the application layer. rehash reduces the danger of injection by providing a set of wrapper functions such as select/insert/update which take four arguments: table, from clause, where clause, and anything extra if necessary; these parameters are assembled into a full query which is in turn properly escaped to prevent most obvious attacks from working. Extensive whitelists are used for sanitizing parameters, but by design, rehash uses a single namespace, with a single user account which has full SELECT/INSERT/UPDATE/DELETE permissions across the board. If any single point is compromised, the entire database is toast. Furthermore, because of poor support for views in MySQL, massive JOINs litter the codebase, making some queries reaching across 5-6 tables (with the most horrific example I can think of being the modbomb SELECT which reaches across almost every user and comment table in the database). This makes debugging and optimizing anything a slow and *painful experience.
What I want to do is remove as much code out of the application layer, and move it down the stack into the database. Each function in Slash/DB/MySQL/MySQL.pm should be replicated with a stored procedure which at a minimum executes the query, and if possible, relocate as much of query processing logic into pg/Perl modules. This should be relatively straightforward to implement, and allow high code reusability due to the fact that almost all of rehash's methods exist in perl modules, and not in individual .pl scripts. The upshot of this is that the only permission the DB account requires is EXECUTE to run the stored procedures; if possible, I'd also like to localize which perl function can call which pgSQL procedure; i.e., the getStories() function can only call procedures relating to that, vs. having access to all stored procedures in the database.
This would greatly reduce the reach of any SQL injection attacks, as well as hardening the site against possible compromise; unrestricted access to the database would require breaching one of the DB servers directly instead of getting lucky via a bug in rehash. As I've stated before, no security can be perfect, but I would love to get this site to the point that only a dedicated, targeted attack would even stand a chance of succeeding. That being said, while it sounds good on paper, I'm not 100% sure this type of design is reasonable. Very few FOSS projects seem to take advantage of stored procedures, triggers, views and other such functionality and I'm wondering if others have tried and failed to implement this level of functionality.
So, knowing what you want to do is good, but knowing how to do it is just as important. What I think the first step needs to be is a basic straight-port of the site from MySQL to postgreSQL, and implement a better schema upgrade system. As of right now, our upgrade "system" is writing queries in a text file, and just executing them when a site upgrade has to be done. Very low-tech, and error prone for a large number of queries. I don't have a lot of experience in managing very large schemas, so one thing on which I'd like advice is if there's a good, pre-existing framework that could be used to simplify our database upgrades. By far, the ideal scenario would be if we could run a single script which can intelligently upgrade the database from release to release.
Once both these pieces are in play, a slow but steady migration of code from the application layer to the database layer would allow us to handle the transition in a sane manner. Database auditing can be used to keep track of the most frequently used queries and function calls, and keep an eye on our total progress towards reaching an EXECUTE-only world.
That's everything in a nutshell. I want to know what you guys think, and as always, I'll be reading and replying to comments down below!
~ NCommander
Contact me, or paulej72 on IRC, or post a comment below if you're interested in helping.
Rehash 15.05.1 - Release Notes
The primary cause of the slowdown was due to the fact that rehash did large JOIN operations on text columns in MySQL. This is bad practice in general due to performance reasons, but it causes a drastic slowdown with MySQL cluster, which prevents the query optimizer from doing what's known as a "pushdown", and allowing the query to execute on the NDB nodes. This caused article load to be O(n*m), where n was the number of articles in the database and m was the number of articles with the neverdisplay attribute set. The revised queries now load at O(1). Instead it had to do multiple pulls from the database and assemble the query data on the frontend, a process that took 4-5 seconds per problematic query. The problem was compounded that there are limited number of httpd daemons at any given moment, and any database pull that hit a problematic query (which were in index.pl and article.pl) would cause resource exhaustion.
Fortunately, our load balancer and varnish cache have a fairly high timeout waiting for httpd to come available, preventing the site from soyling itself under high load, or when we do an apache restart, which prevented SN from going down. Thank you for everyone's patience with this matter :).
~ NCommander
I am happy to announce we have reached our funding goal of $4500 for the first half of this year! From the bottom of our hearts, a big thank you to everyone for all your support. There were a few who chose to pay a lot more than we ever suspected, they know who they are, and I would literally like to buy them a beer.
Continuing with the good news, our goal for July-December is less than half of our first-half goal, and we are a month ahead of schedule — details to follow on that. Though we won't all be dining on champagne and caviar, we are hopeful we will be able to continue paying the bills for the foreseeable future. For now, I just want to say that you are all one awesome community and that you continue to surprise and inspire us; we'll keep doing what we can to make this place the best it can be.
This was by far one of the most painful upgrades we've ever done to this site, and resulted in nearly a three hour downtime. Even as of writing, we're still not back to 100% due to unexpected breakage that did not show up in dev. As I need a break from trying to debug rehash, let me write up what's known, what's new, and what went pear-shaped.
Rehash 15.05 - What's New
I want to re-state that this upgrade is by far the most invasive one we've ever done. Nearly every file and function in rehash had to be modifying due to changes in the mod_perl infrastructure, and more than a few ugly hacks had to be written to emulate the original API in places. We knew going into this upgrade it was going to be painful, but we had a load of unexpected hiccups and headaches. Even as I write this, the site is still limping due to some of that breakage. Read more past the break for a full understanding of what has been going on.
Way back at golive, we identified quite a few goals that we needed to reach if we wanted the site to be maintainable in the long run. One of these was getting to a modern version of Apache, and perl; slashcode (and rehash) are tightly tied to the Apache API for performance reasons, and historically only ran against Apache 1.3, and mod_perl 1. This put us in the unfortunate position of having to run on a codebase that had long been EOLed when we launched in 2014. We took precautions to protect the site such as running everything through apparmor, and trying to adhere to the smallest set of permissions possible, but no matter how you looked at it, we were stuck on a dead platform. As such, this was something that *had* to get done for the sake of maintainability, security and support.
This was further complicated by a massive API break between mod_perl 1 -> 2, which many (IMHO) unnecessary changes done to data structures and such that meant such an upgrade was an all-or-nothing affair. There was no way we could piecemeal upgrade the site to the new API. We had a few previous attempts at this port, all of them going nowhere, but over a long weekend in March, I sat down with rehash and our dev server, lithium, and got to the point the main index could be loaded under mod_perl 2. From there, we tried to hammer down whatever bugs we could, but we were effectively maintaining the legacy slashcode codebase, and the newer rehash codebase. Due to limited development time, most of the bug fixes and such were placed on rehash once it reached a state of functionality, and these would be shoehorned in with the stack of bugs we were fixing). I took the opportunity to try and clear out as many of the long-standing wishlist bugs as possible, such as IPv6 support.
In our year and a half of dealing with slashcode, we had also identified several pain points; for example, if the database went down even for a second, the site would lockup, and httpd would hang to the point that it was necessary to kill -9 the process. Although slashcode has support for the native master-slave replication built into MySQL, it had no support for failover. Furthermore, MySQL's native replication is extremely lacking in the area of reliability. Until very recently, there was no support for dynamically changing the master database in case of failure, and the manual process is exceedingly slow and error prone. While MySQL 5.6 has improved the situation with global transactions IDs (GTID), it still required code support in the application to handle failover, and a specific monitoring daemon to manage the process, in effect creating a new single point of failure. It also continues to lack any functionality heal or otherwise recover from replication failures. In my research, I found that there was simply bad and worse options with vanilla MySQL in handling replication and failover. As such, I started looking seriously into MySQL Cluster, which adds multi-master replication to MySQL at the cost of some backwards compatibility.
I was hesitant to make such a large change to the system, but short of rewriting rehash to use a different RDBM, there wasn't a lot of options. After another weekend of hacking, dev.soylentnews.org was running on a two system cluster, which provided the basis for further development. This required removing all the FULLTEXT indexes in the database, and rewriting the entire search engine to use Sphinx Search. Unfortunately, there's no trivial way to migrate from vanilla MySQL to cluster. To prevent a long story from getting even longer, to perform the migration, the site would have to be offlined, a modified schema would have to be loaded into the database, and then the data re-imported in two separate transactions. Furthermore, MySQL Cluster needs to know in advance how many attributes and such are being used in the cluster, adding another tuning step to the entire process. This quirk of cluster caused significant headache when it came to import the production database.
To understand why things went so pear shaped on this cluster**** of the upgrade, a little information is needed on how we do upgrades. Normally, after the code has baked for awhile on dev, our QA team (Bytram) gives us an ACK when he feels its ready. If the devs feel we're also up to scratch to deploy, one person, usually me or Paul will push the update out to production. Normally, this is a quick process; git tag/pull and then deploy. Unfortunately, due to the massive amounts of infrastructure changes required by this upgrade, more work than normal would be required. In preparation, I prepared our old webfrontend, hydrogen, which had been down for an extended period following a system break to take the new perl, Apache 2, etc, and loaded a copy of rehash. The upgrade would then just be a matter of moving the database over to cluster, changing the load balancer to point to hydrogen, and then upgrading the current webfrontend to flourine. At 20:00 EDT, I offlined the site to handle the database migration, dumping the schema and tables. Unfortunately, the MaxNoOfAttributes and other tuning variables were too low to handle two copies of the database, and thus the initial import failed. Due to difficulty with internal configuration changes, and other headaches (such as forgetting to exclude CREATE TABLE statements from the original database), it took nearly two hours to simply begin importing the 700 MiB SQL file, and another 30 or so minutes for the import to finish. I admit I nearly gave up the upgrade at this point, but was encouraged to soldier on. In hindsight, I could have better tested this procedure, and had gotten all the snags out of the way prior to upgrade; the blame for the extended downtime solely lies with me. Once the database was updated, I quickly got the mysqld frontend on hydrogen up and running, as well as Apache2, just to learn I had more problems as the site returned to the internet nearly three hours later.
What I didn't realize at the time was hydrogen's earlier failure had not been resolved as I thought, and it gave truly abysmal performance, with 10+ second page loads. As soon as this was realized, I quickly pressed fluorine, our 'normal' frontend server into service, and site performance went from horrific to bad. A review of the logs showed that some of the internal caches used by rehash were throwing errors; this wasn't an issue we had seen on dev, and such was causing excessive amounts of traffic to go to the database, and causing Apache to hang as the system tries to keep up with the load. Two hours of debugging have yet to reveal the root cause of the failure, so I've taken a break to write this up before digging into it again
As I write this, site performance remains fairly poor, as the server is excessively smashing against the database. Several features which worked on dev went snap when the site was rolled out on production, and I find myself feeling that I'm responsible for hosing the site. I'm going to keep working for as long as I can stay awake to try and fix as many issues as I can, but it may be a day or two before we're back to business as usual. I truly apologize for the community; this entire site update has gone horribly pear shaped, and I don't like looking incompetent. All I can do now is try and pick up the pieces and get us back to where we were. I'll keep this post updated.
~ NCommander