Stories
Slash Boxes
Comments

SoylentNews is people

posted by NCommander on Monday June 01 2015, @07:17AM   Printer-friendly
from the that-sucked dept.

This was by far one of the most painful upgrades we've ever done to this site, and resulted in nearly a three hour downtime. Even as of writing, we're still not back to 100% due to unexpected breakage that did not show up in dev. As I need a break from trying to debug rehash, let me write up what's known, what's new, and what went pear-shaped.

Rehash 15.05 - What's New

  • Rewrote large amounts of the site to migrate to Apache 2, mod_perl 2, and perl 5.20.
    • This was a massive undertaking. I did a large part of the initial work, but paulej72, and TheMightyBuzzard did lots to help fix a lot of the lingering issues. Major props to Bytram for catching many of the bugs pre-release
  • Nexus Support (finally).
    • Currently we have the Meta and Breaking News nexii, with the possibility of adding more in the future, such as a Freshmeat replacement.
    • Nexii can be filtered in the user control panel under the Homepage tab. At the moment, this functionality is hosed due to unexpected breakage, but should be functional within the next 24-48 hours
  • IPv6 support - the AAAA record is live as we speak
  • Themes can be attached to a nexus independent of the "primary theme" setting; user choice overrides this
  • Squashed More UTF-8 Bugs
  • Migration to MySQL Cluster (more on this below)
  • Rewrote site search engine to use sphinx search and (in general) be more useful
  • Long comments properly collaspe now
  • Support for SSL by default (not live yet)
  • Fault tolerance; the site no longer explodes into confetti if a database or webfrontend goes down unexpectedly; allows for much easier system maintenance as we can offline things without manual migration of services
  • Improved editor functionality, including per-article note block
  • Lots of small fixes everywhere, due to the extended development cycle

I want to re-state that this upgrade is by far the most invasive one we've ever done. Nearly every file and function in rehash had to be modifying due to changes in the mod_perl infrastructure, and more than a few ugly hacks had to be written to emulate the original API in places. We knew going into this upgrade it was going to be painful, but we had a load of unexpected hiccups and headaches. Even as I write this, the site is still limping due to some of that breakage. Read more past the break for a full understanding of what has been going on.

Understanding The Rewrite (what makes rehash tick)

Way back at golive, we identified quite a few goals that we needed to reach if we wanted the site to be maintainable in the long run. One of these was getting to a modern version of Apache, and perl; slashcode (and rehash) are tightly tied to the Apache API for performance reasons, and historically only ran against Apache 1.3, and mod_perl 1. This put us in the unfortunate position of having to run on a codebase that had long been EOLed when we launched in 2014. We took precautions to protect the site such as running everything through apparmor, and trying to adhere to the smallest set of permissions possible, but no matter how you looked at it, we were stuck on a dead platform. As such, this was something that *had* to get done for the sake of maintainability, security and support.

This was further complicated by a massive API break between mod_perl 1 -> 2, which many (IMHO) unnecessary changes done to data structures and such that meant such an upgrade was an all-or-nothing affair. There was no way we could piecemeal upgrade the site to the new API. We had a few previous attempts at this port, all of them going nowhere, but over a long weekend in March, I sat down with rehash and our dev server, lithium, and got to the point the main index could be loaded under mod_perl 2. From there, we tried to hammer down whatever bugs we could, but we were effectively maintaining the legacy slashcode codebase, and the newer rehash codebase. Due to limited development time, most of the bug fixes and such were placed on rehash once it reached a state of functionality, and these would be shoehorned in with the stack of bugs we were fixing). I took the opportunity to try and clear out as many of the long-standing wishlist bugs as possible, such as IPv6 support.

In our year and a half of dealing with slashcode, we had also identified several pain points; for example, if the database went down even for a second, the site would lockup, and httpd would hang to the point that it was necessary to kill -9 the process. Although slashcode has support for the native master-slave replication built into MySQL, it had no support for failover. Furthermore, MySQL's native replication is extremely lacking in the area of reliability. Until very recently, there was no support for dynamically changing the master database in case of failure, and the manual process is exceedingly slow and error prone. While MySQL 5.6 has improved the situation with global transactions IDs (GTID), it still required code support in the application to handle failover, and a specific monitoring daemon to manage the process, in effect creating a new single point of failure. It also continues to lack any functionality heal or otherwise recover from replication failures. In my research, I found that there was simply bad and worse options with vanilla MySQL in handling replication and failover. As such, I started looking seriously into MySQL Cluster, which adds multi-master replication to MySQL at the cost of some backwards compatibility.

I was hesitant to make such a large change to the system, but short of rewriting rehash to use a different RDBM, there wasn't a lot of options. After another weekend of hacking, dev.soylentnews.org was running on a two system cluster, which provided the basis for further development. This required removing all the FULLTEXT indexes in the database, and rewriting the entire search engine to use Sphinx Search. Unfortunately, there's no trivial way to migrate from vanilla MySQL to cluster. To prevent a long story from getting even longer, to perform the migration, the site would have to be offlined, a modified schema would have to be loaded into the database, and then the data re-imported in two separate transactions. Furthermore, MySQL Cluster needs to know in advance how many attributes and such are being used in the cluster, adding another tuning step to the entire process. This quirk of cluster caused significant headache when it came to import the production database.

Understanding Our Upgrade Process

To understand why things went so pear shaped on this cluster**** of the upgrade, a little information is needed on how we do upgrades. Normally, after the code has baked for awhile on dev, our QA team (Bytram) gives us an ACK when he feels its ready. If the devs feel we're also up to scratch to deploy, one person, usually me or Paul will push the update out to production. Normally, this is a quick process; git tag/pull and then deploy. Unfortunately, due to the massive amounts of infrastructure changes required by this upgrade, more work than normal would be required. In preparation, I prepared our old webfrontend, hydrogen, which had been down for an extended period following a system break to take the new perl, Apache 2, etc, and loaded a copy of rehash. The upgrade would then just be a matter of moving the database over to cluster, changing the load balancer to point to hydrogen, and then upgrading the current webfrontend to flourine. At 20:00 EDT, I offlined the site to handle the database migration, dumping the schema and tables. Unfortunately, the MaxNoOfAttributes and other tuning variables were too low to handle two copies of the database, and thus the initial import failed. Due to difficulty with internal configuration changes, and other headaches (such as forgetting to exclude CREATE TABLE statements from the original database), it took nearly two hours to simply begin importing the 700 MiB SQL file, and another 30 or so minutes for the import to finish. I admit I nearly gave up the upgrade at this point, but was encouraged to soldier on. In hindsight, I could have better tested this procedure, and had gotten all the snags out of the way prior to upgrade; the blame for the extended downtime solely lies with me. Once the database was updated, I quickly got the mysqld frontend on hydrogen up and running, as well as Apache2, just to learn I had more problems as the site returned to the internet nearly three hours later.

What I didn't realize at the time was hydrogen's earlier failure had not been resolved as I thought, and it gave truly abysmal performance, with 10+ second page loads. As soon as this was realized, I quickly pressed fluorine, our 'normal' frontend server into service, and site performance went from horrific to bad. A review of the logs showed that some of the internal caches used by rehash were throwing errors; this wasn't an issue we had seen on dev, and such was causing excessive amounts of traffic to go to the database, and causing Apache to hang as the system tries to keep up with the load. Two hours of debugging have yet to reveal the root cause of the failure, so I've taken a break to write this up before digging into it again

The End Result

As I write this, site performance remains fairly poor, as the server is excessively smashing against the database. Several features which worked on dev went snap when the site was rolled out on production, and I find myself feeling that I'm responsible for hosing the site. I'm going to keep working for as long as I can stay awake to try and fix as many issues as I can, but it may be a day or two before we're back to business as usual. I truly apologize for the community; this entire site update has gone horribly pear shaped, and I don't like looking incompetent. All I can do now is try and pick up the pieces and get us back to where we were. I'll keep this post updated.

~ NCommander

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by janrinok on Monday June 01 2015, @07:29AM

    by janrinok (52) Subscriber Badge on Monday June 01 2015, @07:29AM (#190593) Journal
    Don't take it too hard NC. You and the rest of dev do a brilliant job and we were bound to have the occasional one not go according to plan. Get some rest.
    Starting Score:    1  point
    Moderation   +4  
       Insightful=4, Total=4
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 2, Informative) by Anonymous Coward on Monday June 01 2015, @07:45AM

    by Anonymous Coward on Monday June 01 2015, @07:45AM (#190600)

    Agreed. This site is meant to be fun, you devs are doing it in your free time, it's not the end of the world if the site doesn't work properly for a few days. (Also the login-system doesn't seem to work at the moment - i guess it has maybe something to do with server side caching? - , so in case this post appears as AC: I'm sudo rm -rf)