|Title||Learning From Our Mistakes (Or How To Prevent Another Painful Upgrade)|
|Date||Friday June 19 2015, @12:45PM|
|from the hoping-not-to-break-the-site-again-in-2015 dept.|
There are two things we've always strived to do well here: (1) listen to community feedback and adjust our plans and processes based on that input, and (2) communicate changes with the community. We are working our way through a rather painful upgrade from slash to rehash whereby we've gone through four point releases to get mostly back to where we started. A lot of folks questioned the necessity of such a large-scale change, and while we posted changelogs summarizing changes, I'd like to provide a broad picture of everything and how we're going to fix it.
Dissecting The Rehash Upgrade
Check past the break for more.
Rehash was by far the largest and most invasive upgrade to the site, requiring modifications to nearly every component. To understand what went wrong, a full understanding of the context of the rehash upgrade is important, so I apologize in advance if this is a bit wordy; much of this information was present in previous upgrade posts, and comments, but I want to put it in all in one canonical spot for ease of reference
For those unfamiliar with mod_perl, the original Slash codebase was coded against mod_perl 1, which in turn was tied to Apache 1.3 — which, by the time we launched this site, had been unsupported for years. Thus it was known from day one that if we were going to be serious on both continuing on the legacy Slash codebase and keeping SoylentNews itself going that this was something that was going to have to be done. Not if, but when.
As a stopgap, I wrote and applied AppArmor profiles to try and mitigate any potential damage, but we were in the web equivalent of Windows XP; forever vulnerable to zeroday exploits. With this stopgap in place, our first priorities were to improve site usability, sort out moderation (which of course is always a work in progress), and continue with the endless tweaks and changes we've made since go-live. During this time, multiple members of the dev team (myself included) tried to do a quick and dirty port, with no success. The API changes in mod_perl were extensive; beside obvious API calls, many of the data structures were changed, and even environment flags were changed. In short, every single line of code in the site that interacted with Apache had to be changed. In other words, every place the codebase interacted with $r (the mod_perl request object) had to be modified, and in other places, logic had to be reworked to handle changes in behavior such as this rather notable example in how form variables are handled. Finally, after an intensive weekend of hacking and pain, I managed to get index.pl to properly render with Apache2, but it became clear that due to very limited backwards compatibility, the port became all-or-nothing; there was no way to simply piecemeal the upgrade.
Furthermore, over time, we'd also noticed other aspects which were problematic. As coded, Slash used MySQL replication for performance, but had no support for multi-master or failover. This was compounded by the fact that if the database went down, even momentarily, the entire stack would completely hang; apache and slashd would have to be manually kill -9ed, and restarted for the site to be usable. This was further complicated in that in-practice, MySQL replication leaves a lot to be desired; there's no consistency to confirm that both the master and slave have 1:1 data. Unless the entire frontend was shutdown, and master->slave replication was verified, it is trivial to lose data due to an ill-timed shutdown or crash (and as time has shown, our host, Linode, sometimes has to restart our nodes to apply security updates to their infrastructure). In practice, this meant that failing over the master was a slow and error-prone process, and after failover, replication would have to be manually re-established in the reverse direction to bring the former-master up to date, then failed back by hand.
While MySQL 5.6 implemented GTID-based replication to remove some pain, it still failed to be a workable solution for us. Although it is possible to get multi-master replication in vanilla MySQL, it would require serious tweaks to how AUTO_INCREMENT work in the codebase, and violated what little ACID compliance MySQL has. As an aside, I found out that the other site uses this form of multi-master replication. For any discussion-based website, the database is the most important mission critical infrastructure. Thus, rehash's genesis had two explicate goals attached to it:
The MySQL redundancy problem proved problematic, and short of simply porting the site wholesale to another DB engine, the only solution I could find that would keep ACID compliance of our database was with MySQL cluster. In a cluster configuration, the database itself is stored by a backend daemon known as ndbd; instances of mysqld act as a frontend to NDB. In effect, the mysql daemon becomes a frontend to the underlying datastore, which in turn keeps everything consistent. Unfortunately, MySQL cluster is not 100% compatible with vanilla MySQL. FULLTEXT indexes aren't supported under cluster, and some types of queries involving JOINs have a considerably longer execution time if they cannot be parallelized.
As an experiment, I initialized a new cluster instance, and moved the development database to it, to see if the idea was even practical. Surprisingly, with the exception of the site search engine, at first glance, everything appeared to be more or less functional under cluster. As such, this provided us with the basis for the site upgrade.
One limitation of our dev environment is that we do not have the infrastructure to properly load test the changes before deployment. We knew that were going to be bugs and issues with the site upgrade, but we were also starting to get to the point that if we didn't deploy rehash, there was a good chance we won't do it at all. I subscribe to the notion of release early, release often. We believed that the site would be mostly functional post-upgrade, and that any issues encountered would be relatively minor. Unfortunately, we were wrong, and it took four site upgrades to get us back to normality which entailed: rewriting many queries, performing active debugging on production, and a lot of luck to get us there. All things considered, not a great situation to be in.
Because of this, I want to work out a fundamental plan to prevent a repeat of such a painful upgrade even if we make large scale changes to the site, and prevent the site from destabilizing even if we make additional large scale changes.
On a most basic level, good documentation goes a long way in keeping stuff both maintainable and usable. Unfortunately, a large part of the technical documentation on the site is over 14 years old. As such, I've made an effort to go through and try and update the PODs to be more in line, including, but not limited to, a revised README, updated INSTALL instructions, notes on some of the quirkier parts of the site and so forth. While I don't expect a huge uptake of people running rehash for their personal sites, being able to run a development instance of the site may hopefully increase the amount of involvement of drive-by contributions. As of right now, I've implements a "make build-environment" feature which automatically downloads and installs local copies of Apache, Perl, and mod_perl, plus all CPAN modules required for rehash. This both makes it easier for us to update the site, and get security fixes from upstream rolled in.
With luck, we can get to the point that someone can simply read the docs and have a full understanding of how rehash goes together, and as with understanding is always the first step towards succeeding at anything.
One thing that came out of the aftermath of the rehash upgrade is that the underlying schema and configuration tables between production and dev have deviated from each other. The reason for this is fairly obvious; our method of doing database upgrades is crud at best. rehash has no automated method of updating the database; instead queries to be executed get written to sql/mysql/upgrades, and executed by hand during a site upgrade, and the initial schema is upgraded for new install. The practical end result of this is that the installation scripts, the dev database, and production all have a slightly different layout due to human error. Wherever possible, we should limit the amount of manual effort required to manage and administrator SoylentNews. If anyone knows of a good pre-existing framework we can use to do database upgrades, I'm all ears. Otherwise, I'll be looking at building one from scratch and intergrating it into our development cycle.
For anyone who has worked on a large project before, unit testing can be a developer's best friend. It lets you know that your API is doing what you want and acting as expected. Now, in normal circumstances, unit testing is difficult to impossible as much of the logic in many web applications is not exposed in a way that makes testing easy, requiring tools like Windmill to simulate page inputs and outputs. Based on previous projects I've done, I'd normally say this represents more effort than is warranted since you frequently have to update the tests even for minor UI changes. In our case, we have a more realistic option. A quirk of rehash's heritage is that approximately 95% of it exists in global perl modules that are either installed in the site_perl directory, or or in the plugins/ directory. As such, rehash strongly adheres to the Model-View-Controller design and methodology.
As such, we have a clear and (partially) documented API to code against which allows us to write simple tests, and confirm the output of the data structures instead of trying to parse HTML to know if something is good or bad. Such a test suite would have made porting the site to mod_perl 2 much simpler, and will come in useful if we ever change database engines or operating system platforms. As such, I've designated it a high priority to at least get the core libraries connected with unit tests to ensure consistent behavior in light of any updates we may make. This is going to be a considerable amount of effort, but I strongly suspect it will reduce our QA workload, and make our upgrades close to a non-event.
The rehash upgrade was a wakeup call for us that we need to improve our processes and methodology, as well as automate aspects of the upgrade process. Even though we're all volunteers, and operate on a best-effort basis, destabilizing the site for a week is not something I personally consider acceptable, and I accept full responsibility as I was the one who both pushed for it, and deployed the upgrade to production. As a community, you've been incredibly tolerant, but I have no desire to test your collective patience. As such, in practice, our next development cycle will be very low key as we work to build the systems outlined in this document, and further tighten up and polish rehash. To all, keep being awesome, and we'll do our best to meet your expectations.
printed from SoylentNews, Learning From Our Mistakes (Or How To Prevent Another Painful Upgrade) on 2022-10-05 05:13:24