This was by far one of the most painful upgrades we've ever done to this site, and resulted in nearly a three hour downtime. Even as of writing, we're still not back to 100% due to unexpected breakage that did not show up in dev. As I need a break from trying to debug rehash, let me write up what's known, what's new, and what went pear-shaped.
Rehash 15.05 - What's New
I want to re-state that this upgrade is by far the most invasive one we've ever done. Nearly every file and function in rehash had to be modifying due to changes in the mod_perl infrastructure, and more than a few ugly hacks had to be written to emulate the original API in places. We knew going into this upgrade it was going to be painful, but we had a load of unexpected hiccups and headaches. Even as I write this, the site is still limping due to some of that breakage. Read more past the break for a full understanding of what has been going on.
Way back at golive, we identified quite a few goals that we needed to reach if we wanted the site to be maintainable in the long run. One of these was getting to a modern version of Apache, and perl; slashcode (and rehash) are tightly tied to the Apache API for performance reasons, and historically only ran against Apache 1.3, and mod_perl 1. This put us in the unfortunate position of having to run on a codebase that had long been EOLed when we launched in 2014. We took precautions to protect the site such as running everything through apparmor, and trying to adhere to the smallest set of permissions possible, but no matter how you looked at it, we were stuck on a dead platform. As such, this was something that *had* to get done for the sake of maintainability, security and support.
This was further complicated by a massive API break between mod_perl 1 -> 2, which many (IMHO) unnecessary changes done to data structures and such that meant such an upgrade was an all-or-nothing affair. There was no way we could piecemeal upgrade the site to the new API. We had a few previous attempts at this port, all of them going nowhere, but over a long weekend in March, I sat down with rehash and our dev server, lithium, and got to the point the main index could be loaded under mod_perl 2. From there, we tried to hammer down whatever bugs we could, but we were effectively maintaining the legacy slashcode codebase, and the newer rehash codebase. Due to limited development time, most of the bug fixes and such were placed on rehash once it reached a state of functionality, and these would be shoehorned in with the stack of bugs we were fixing). I took the opportunity to try and clear out as many of the long-standing wishlist bugs as possible, such as IPv6 support.
In our year and a half of dealing with slashcode, we had also identified several pain points; for example, if the database went down even for a second, the site would lockup, and httpd would hang to the point that it was necessary to kill -9 the process. Although slashcode has support for the native master-slave replication built into MySQL, it had no support for failover. Furthermore, MySQL's native replication is extremely lacking in the area of reliability. Until very recently, there was no support for dynamically changing the master database in case of failure, and the manual process is exceedingly slow and error prone. While MySQL 5.6 has improved the situation with global transactions IDs (GTID), it still required code support in the application to handle failover, and a specific monitoring daemon to manage the process, in effect creating a new single point of failure. It also continues to lack any functionality heal or otherwise recover from replication failures. In my research, I found that there was simply bad and worse options with vanilla MySQL in handling replication and failover. As such, I started looking seriously into MySQL Cluster, which adds multi-master replication to MySQL at the cost of some backwards compatibility.
I was hesitant to make such a large change to the system, but short of rewriting rehash to use a different RDBM, there wasn't a lot of options. After another weekend of hacking, dev.soylentnews.org was running on a two system cluster, which provided the basis for further development. This required removing all the FULLTEXT indexes in the database, and rewriting the entire search engine to use Sphinx Search. Unfortunately, there's no trivial way to migrate from vanilla MySQL to cluster. To prevent a long story from getting even longer, to perform the migration, the site would have to be offlined, a modified schema would have to be loaded into the database, and then the data re-imported in two separate transactions. Furthermore, MySQL Cluster needs to know in advance how many attributes and such are being used in the cluster, adding another tuning step to the entire process. This quirk of cluster caused significant headache when it came to import the production database.
To understand why things went so pear shaped on this cluster**** of the upgrade, a little information is needed on how we do upgrades. Normally, after the code has baked for awhile on dev, our QA team (Bytram) gives us an ACK when he feels its ready. If the devs feel we're also up to scratch to deploy, one person, usually me or Paul will push the update out to production. Normally, this is a quick process; git tag/pull and then deploy. Unfortunately, due to the massive amounts of infrastructure changes required by this upgrade, more work than normal would be required. In preparation, I prepared our old webfrontend, hydrogen, which had been down for an extended period following a system break to take the new perl, Apache 2, etc, and loaded a copy of rehash. The upgrade would then just be a matter of moving the database over to cluster, changing the load balancer to point to hydrogen, and then upgrading the current webfrontend to flourine. At 20:00 EDT, I offlined the site to handle the database migration, dumping the schema and tables. Unfortunately, the MaxNoOfAttributes and other tuning variables were too low to handle two copies of the database, and thus the initial import failed. Due to difficulty with internal configuration changes, and other headaches (such as forgetting to exclude CREATE TABLE statements from the original database), it took nearly two hours to simply begin importing the 700 MiB SQL file, and another 30 or so minutes for the import to finish. I admit I nearly gave up the upgrade at this point, but was encouraged to soldier on. In hindsight, I could have better tested this procedure, and had gotten all the snags out of the way prior to upgrade; the blame for the extended downtime solely lies with me. Once the database was updated, I quickly got the mysqld frontend on hydrogen up and running, as well as Apache2, just to learn I had more problems as the site returned to the internet nearly three hours later.
What I didn't realize at the time was hydrogen's earlier failure had not been resolved as I thought, and it gave truly abysmal performance, with 10+ second page loads. As soon as this was realized, I quickly pressed fluorine, our 'normal' frontend server into service, and site performance went from horrific to bad. A review of the logs showed that some of the internal caches used by rehash were throwing errors; this wasn't an issue we had seen on dev, and such was causing excessive amounts of traffic to go to the database, and causing Apache to hang as the system tries to keep up with the load. Two hours of debugging have yet to reveal the root cause of the failure, so I've taken a break to write this up before digging into it again
As I write this, site performance remains fairly poor, as the server is excessively smashing against the database. Several features which worked on dev went snap when the site was rolled out on production, and I find myself feeling that I'm responsible for hosing the site. I'm going to keep working for as long as I can stay awake to try and fix as many issues as I can, but it may be a day or two before we're back to business as usual. I truly apologize for the community; this entire site update has gone horribly pear shaped, and I don't like looking incompetent. All I can do now is try and pick up the pieces and get us back to where we were. I'll keep this post updated.
~ NCommander
(Score: 5, Insightful) by janrinok on Monday June 01 2015, @07:29AM
(Score: 2, Informative) by Anonymous Coward on Monday June 01 2015, @07:45AM
Agreed. This site is meant to be fun, you devs are doing it in your free time, it's not the end of the world if the site doesn't work properly for a few days. (Also the login-system doesn't seem to work at the moment - i guess it has maybe something to do with server side caching? - , so in case this post appears as AC: I'm sudo rm -rf)