Slash Boxes

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 12 submissions in the queue.
posted by NCommander on Friday June 19 2015, @12:45PM   Printer-friendly
from the hoping-not-to-break-the-site-again-in-2015 dept.

There are two things we've always strived to do well here: (1) listen to community feedback and adjust our plans and processes based on that input, and (2) communicate changes with the community. We are working our way through a rather painful upgrade from slash to rehash whereby we've gone through four point releases to get mostly back to where we started. A lot of folks questioned the necessity of such a large-scale change, and while we posted changelogs summarizing changes, I'd like to provide a broad picture of everything and how we're going to fix it.

Dissecting The Rehash Upgrade

  • Necessity of the Upgrade
  • Improving Documentation
  • Database Upgrade Framework
  • Unit Testing the Site
  • In Closing...

Check past the break for more.

Necessity of The Upgrade

Rehash was by far the largest and most invasive upgrade to the site, requiring modifications to nearly every component. To understand what went wrong, a full understanding of the context of the rehash upgrade is important, so I apologize in advance if this is a bit wordy; much of this information was present in previous upgrade posts, and comments, but I want to put it in all in one canonical spot for ease of reference

For those unfamiliar with mod_perl, the original Slash codebase was coded against mod_perl 1, which in turn was tied to Apache 1.3 — which, by the time we launched this site, had been unsupported for years. Thus it was known from day one that if we were going to be serious on both continuing on the legacy Slash codebase and keeping SoylentNews itself going that this was something that was going to have to be done. Not if, but when.

As a stopgap, I wrote and applied AppArmor profiles to try and mitigate any potential damage, but we were in the web equivalent of Windows XP; forever vulnerable to zeroday exploits. With this stopgap in place, our first priorities were to improve site usability, sort out moderation (which of course is always a work in progress), and continue with the endless tweaks and changes we've made since go-live. During this time, multiple members of the dev team (myself included) tried to do a quick and dirty port, with no success. The API changes in mod_perl were extensive; beside obvious API calls, many of the data structures were changed, and even environment flags were changed. In short, every single line of code in the site that interacted with Apache had to be changed. In other words, every place the codebase interacted with $r (the mod_perl request object) had to be modified, and in other places, logic had to be reworked to handle changes in behavior such as this rather notable example in how form variables are handled. Finally, after an intensive weekend of hacking and pain, I managed to get to properly render with Apache2, but it became clear that due to very limited backwards compatibility, the port became all-or-nothing; there was no way to simply piecemeal the upgrade.

Furthermore, over time, we'd also noticed other aspects which were problematic. As coded, Slash used MySQL replication for performance, but had no support for multi-master or failover. This was compounded by the fact that if the database went down, even momentarily, the entire stack would completely hang; apache and slashd would have to be manually kill -9ed, and restarted for the site to be usable. This was further complicated in that in-practice, MySQL replication leaves a lot to be desired; there's no consistency to confirm that both the master and slave have 1:1 data. Unless the entire frontend was shutdown, and master->slave replication was verified, it is trivial to lose data due to an ill-timed shutdown or crash (and as time has shown, our host, Linode, sometimes has to restart our nodes to apply security updates to their infrastructure). In practice, this meant that failing over the master was a slow and error-prone process, and after failover, replication would have to be manually re-established in the reverse direction to bring the former-master up to date, then failed back by hand.

While MySQL 5.6 implemented GTID-based replication to remove some pain, it still failed to be a workable solution for us. Although it is possible to get multi-master replication in vanilla MySQL, it would require serious tweaks to how AUTO_INCREMENT work in the codebase, and violated what little ACID compliance MySQL has. As an aside, I found out that the other site uses this form of multi-master replication. For any discussion-based website, the database is the most important mission critical infrastructure. Thus, rehash's genesis had two explicate goals attached to it:

  • Update The Underlying Software Stack
  • Find A Solution To The MySQL Redundancy Problem

The MySQL redundancy problem proved problematic, and short of simply porting the site wholesale to another DB engine, the only solution I could find that would keep ACID compliance of our database was with MySQL cluster. In a cluster configuration, the database itself is stored by a backend daemon known as ndbd; instances of mysqld act as a frontend to NDB. In effect, the mysql daemon becomes a frontend to the underlying datastore, which in turn keeps everything consistent. Unfortunately, MySQL cluster is not 100% compatible with vanilla MySQL. FULLTEXT indexes aren't supported under cluster, and some types of queries involving JOINs have a considerably longer execution time if they cannot be parallelized.

As an experiment, I initialized a new cluster instance, and moved the development database to it, to see if the idea was even practical. Surprisingly, with the exception of the site search engine, at first glance, everything appeared to be more or less functional under cluster. As such, this provided us with the basis for the site upgrade.

One limitation of our dev environment is that we do not have the infrastructure to properly load test the changes before deployment. We knew that were going to be bugs and issues with the site upgrade, but we were also starting to get to the point that if we didn't deploy rehash, there was a good chance we won't do it at all. I subscribe to the notion of release early, release often. We believed that the site would be mostly functional post-upgrade, and that any issues encountered would be relatively minor. Unfortunately, we were wrong, and it took four site upgrades to get us back to normality which entailed: rewriting many queries, performing active debugging on production, and a lot of luck to get us there. All things considered, not a great situation to be in.

Because of this, I want to work out a fundamental plan to prevent a repeat of such a painful upgrade even if we make large scale changes to the site, and prevent the site from destabilizing even if we make additional large scale changes.

Improving Documentation

On a most basic level, good documentation goes a long way in keeping stuff both maintainable and usable. Unfortunately, a large part of the technical documentation on the site is over 14 years old. As such, I've made an effort to go through and try and update the PODs to be more in line, including, but not limited to, a revised README, updated INSTALL instructions, notes on some of the quirkier parts of the site and so forth. While I don't expect a huge uptake of people running rehash for their personal sites, being able to run a development instance of the site may hopefully increase the amount of involvement of drive-by contributions. As of right now, I've implements a "make build-environment" feature which automatically downloads and installs local copies of Apache, Perl, and mod_perl, plus all CPAN modules required for rehash. This both makes it easier for us to update the site, and get security fixes from upstream rolled in.

With luck, we can get to the point that someone can simply read the docs and have a full understanding of how rehash goes together, and as with understanding is always the first step towards succeeding at anything.

Database Upgrade Framework

One thing that came out of the aftermath of the rehash upgrade is that the underlying schema and configuration tables between production and dev have deviated from each other. The reason for this is fairly obvious; our method of doing database upgrades is crud at best. rehash has no automated method of updating the database; instead queries to be executed get written to sql/mysql/upgrades, and executed by hand during a site upgrade, and the initial schema is upgraded for new install. The practical end result of this is that the installation scripts, the dev database, and production all have a slightly different layout due to human error. Wherever possible, we should limit the amount of manual effort required to manage and administrator SoylentNews. If anyone knows of a good pre-existing framework we can use to do database upgrades, I'm all ears. Otherwise, I'll be looking at building one from scratch and intergrating it into our development cycle.

Unit Testing the Site

For anyone who has worked on a large project before, unit testing can be a developer's best friend. It lets you know that your API is doing what you want and acting as expected. Now, in normal circumstances, unit testing is difficult to impossible as much of the logic in many web applications is not exposed in a way that makes testing easy, requiring tools like Windmill to simulate page inputs and outputs. Based on previous projects I've done, I'd normally say this represents more effort than is warranted since you frequently have to update the tests even for minor UI changes. In our case, we have a more realistic option. A quirk of rehash's heritage is that approximately 95% of it exists in global perl modules that are either installed in the site_perl directory, or or in the plugins/ directory. As such, rehash strongly adheres to the Model-View-Controller design and methodology.

As such, we have a clear and (partially) documented API to code against which allows us to write simple tests, and confirm the output of the data structures instead of trying to parse HTML to know if something is good or bad. Such a test suite would have made porting the site to mod_perl 2 much simpler, and will come in useful if we ever change database engines or operating system platforms. As such, I've designated it a high priority to at least get the core libraries connected with unit tests to ensure consistent behavior in light of any updates we may make. This is going to be a considerable amount of effort, but I strongly suspect it will reduce our QA workload, and make our upgrades close to a non-event.

In Closing...

The rehash upgrade was a wakeup call for us that we need to improve our processes and methodology, as well as automate aspects of the upgrade process. Even though we're all volunteers, and operate on a best-effort basis, destabilizing the site for a week is not something I personally consider acceptable, and I accept full responsibility as I was the one who both pushed for it, and deployed the upgrade to production. As a community, you've been incredibly tolerant, but I have no desire to test your collective patience. As such, in practice, our next development cycle will be very low key as we work to build the systems outlined in this document, and further tighten up and polish rehash. To all, keep being awesome, and we'll do our best to meet your expectations.

~ NCommander

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Redundant) by tibman on Friday June 19 2015, @02:36PM

    by tibman (134) Subscriber Badge on Friday June 19 2015, @02:36PM (#198247)

    I've yet to use a database that allowed transactional schema changes : /

    SN won't survive on lurkers alone. Write comments.
    Starting Score:    1  point
    Moderation   -1  
       Redundant=1, Total=1
    Extra 'Redundant' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   1  
  • (Score: 2) by NCommander on Friday June 19 2015, @03:47PM

    by NCommander (2) Subscriber Badge <> on Friday June 19 2015, @03:47PM (#198290) Homepage Journal

    And people think I mad when I want to port the site to postresql.

    Still always moving
    • (Score: 2) by goodie on Friday June 19 2015, @06:46PM

      by goodie (1877) on Friday June 19 2015, @06:46PM (#198358) Journal

      Mmhhh i may be wrong here but ddl is auto commit. There is no rollback possible on pretty much every dbms. But look at it this way. Instead of a rollback you just do a restore, debug and redo the whole db upgrade process. Ideally you would have an environment where you could test a full upgrade beforehand.

      • (Score: 2) by tibman on Friday June 19 2015, @07:01PM

        by tibman (134) Subscriber Badge on Friday June 19 2015, @07:01PM (#198366)

        Restoring the DB is heavy handed but almost every "it failed" arrow on the deployment flowchart points to it. Unfortunately.

        I'm sure someone could chime in on why very few databases allow it. But since user tables are just records in a master table it seems strange that you can't put a transaction on the data that represents your table changes.

        SN won't survive on lurkers alone. Write comments.
        • (Score: 2) by goodie on Friday June 19 2015, @07:34PM

          by goodie (1877) on Friday June 19 2015, @07:34PM (#198385) Journal

          That would be my advice too. Partial rollbacks etc. Are a pain to debug. When a problem arises you probably want to restore so that eventually you can have a process that is 100% successful. But if you have an environment where you can do a dunp of prod/upgrade/unit tests that may already improve your chances of not having any issues during the real deal.

      • (Score: 3, Informative) by choose another one on Friday June 19 2015, @07:38PM

        by choose another one (515) on Friday June 19 2015, @07:38PM (#198390)

        SQL Server will rollback ddl within transaction - confuses some people see e.g.: []

        It is Snapshot Isolation that causes problems (if you have it turned on) because metadata is not versioned so you can't have one process reading one version of a table and the other (in as yet uncommitted transaction) reading another version. So you can't use some ddl within transactions under snapshot isolation - that is documented somewhere.

        Oracle supports it too according to this page: []

        • (Score: 2) by goodie on Friday June 19 2015, @08:20PM

          by goodie (1877) on Friday June 19 2015, @08:20PM (#198407) Journal

          I'm gonna have to try this on my mssql setup at home and specify the isolation level.

          Mor generally i think that the idea here is that the upgrade should either work 100% or fail entirely and result in a restore. If you have somethig that fails halfway through and causes a rollback on that transaction you still have to rollback another 49% of the stuff which is the equivalent of a restore. You either want it fully functional or back to square one. I've seen enough stuff upgraded halfway that had to be debugged by hand... It costs so much time and effort to debug...

          • (Score: 2) by choose another one on Saturday June 20 2015, @01:14PM

            by choose another one (515) on Saturday June 20 2015, @01:14PM (#198652)

            We generally went with "any script that is not transactional must be re-entrant" - i.e. if something goes wrong you can (fix it and) try again to complete the upgrade. Mostly we aimed for transactional.

            But in the end, a restore is always the final rollback process for a failed upgrade.

      • (Score: 1, Insightful) by Anonymous Coward on Saturday June 20 2015, @02:26AM

        by Anonymous Coward on Saturday June 20 2015, @02:26AM (#198524)

        Mmhhh i may be wrong here but ddl is auto commit. There is no rollback possible on pretty much every dbms.

        DDL can be rolled back on just about every commercial quality database. This means PostgreSQL can roll back table changes, but MySQL cannot.

        MySQL cannot roll back DDL due to a structural design / defect (depending on how you look at it). Per table, you can choose a storage engine. This is MySQL specific "feature". Most other database have just one storage technology. That means every table join is equivalent to communicating across databases in other database software. This is why foreign keys suck in MySQL, and why DDL rollback is not possible. In MySQL, DDL is visible among different database storage engines. Everything that is not MVCC [] cannot roll back a table change, so table changes have not point of return. This is a MySQL specific hell and it makes DDL upgrades unnecessarily risky.

        • (Score: 2) by goodie on Saturday June 20 2015, @12:13PM

          by goodie (1877) on Saturday June 20 2015, @12:13PM (#198630) Journal

          Cool, I was under the impression that it was not possible (and that's after years of doing MSSQL work so I feel kinda shamed here thanks for that ;). Back in 2000 I could have sworn this was not doable. The isolation level must be selected properly though, but like other types of transactions anyway. And the default/custom settings in tools like SSMS must then be selected accordingly too. But thanks for that tip, I feel a little less stupid now :).

          I think though that the main point of the DB upgrade (and it certainly is the way I've experienced things in the past) is that overall, if something fails halfway through, you just want to restart the process, not try to revert/pick up where it failed. This is especially true if the upgrade is somewhat complex (e.g, lots of changes). Same goes with source code. If the upgrade fails, you want to redo everything and not necessarily try to figure out which files worked out and which ones did not.

          The other reason is that depending on your data files and logging options you may see your log grow substantially during the upgrade. Doing a rollback to restart the process will just take a long time for nothing. At that point, restore is a better option.

  • (Score: 0) by Anonymous Coward on Friday June 19 2015, @09:05PM

    by Anonymous Coward on Friday June 19 2015, @09:05PM (#198426)

    It seems like a lot of people deploying virtualized servers seem to forget about tools like LVM. Its still useful, even inside virtual machines. I don't deal with databases much, but using LVM snapshots inside VMs has saved me some headaches.

    If, during DB upgrades, the site can function with a single DB master running with no slaves, why not have the DB master use a layer of LVM for its storage? Then the upgrade might become:

    Stop all DB replication and slaves

    Temporarily stop the master DB (just so all its file system data is consistent)

    Snapshot the DB file system

    Restart the master DB. So far this process should only take seconds.

    Perform all DB upgrades/updates

    If any piece fails:
    - informational logs can be copied off for review
    - stop DB
    - roll back the snapshot
    - restart DB
    - die

    If everything succeeds, the slaves and replication are re-enabled

    Eventually the snapshot is deleted

    But like I said, I'm a database newbie. But I have used such a process on other types of servers, and its worked for me.

  • (Score: 0) by Anonymous Coward on Saturday June 20 2015, @02:16AM

    by Anonymous Coward on Saturday June 20 2015, @02:16AM (#198521)

    So you are telling us that you have only used MySQL?