Stories
Slash Boxes
Comments

SoylentNews is people

posted by NCommander on Thursday June 04 2015, @11:38AM   Printer-friendly
from the input-wanted dept.

Moving on from frontend stuff, I'm getting to the point that I want to dig in deep and rewrite most of the database layer of SoylentNews/rehash. For those who are unfamiliar with our codebase, its primarily written in perl, with fairly strict adherence to the MVC model, going as far as installing system modules for code shared between the frontend and backend daemons. Slash was written with database portability in mind, and at least historically, a port to postgreSQL existed in the late 90s/early 2000s, and there was some legacy Oracle code authored by VA Linux as well. This code has bitrotted to the point of unusability, leaving the MySQL backend the only functional mode; I deleted the legacy code about a year ago from our git repo.

However, migrating from MySQL has remaining on my personal TODO list for a long time, due to unreliability, data corruption, and configuration pain. The obvious choice from where I'm sitting is postgreSQL. For those who aren't super familiar with the pro/cons with MySQL, this article by Elnur Abdurrakhimov has a pretty good summary and a list of links explaining in-depth why MySQL is not a good solution for any large site. We've hit a lot of pain in the nearly 1.5 years SN has been up due to limitations in the database layer as well, forcing us to use a clustering solution to provide any sort of real redundancy for our backend. Although I'm familiar with database programming, I'm not a DBA by trade, so I'm hoping to tap into the collective knowledge of the SoylentNews community and work out a reasonable migration plan and design.

[More after the break...]

The "Why" Factor

Beside my personal dislike of MySQL, there's a more important reason to migrate from MySQL. MySQL's support for stored procedures is incredibly poor, which means raw SQL has to be written in the application layer. rehash reduces the danger of injection by providing a set of wrapper functions such as select/insert/update which take four arguments: table, from clause, where clause, and anything extra if necessary; these parameters are assembled into a full query which is in turn properly escaped to prevent most obvious attacks from working. Extensive whitelists are used for sanitizing parameters, but by design, rehash uses a single namespace, with a single user account which has full SELECT/INSERT/UPDATE/DELETE permissions across the board. If any single point is compromised, the entire database is toast. Furthermore, because of poor support for views in MySQL, massive JOINs litter the codebase, making some queries reaching across 5-6 tables (with the most horrific example I can think of being the modbomb SELECT which reaches across almost every user and comment table in the database). This makes debugging and optimizing anything a slow and *painful experience.

The End Goal

What I want to do is remove as much code out of the application layer, and move it down the stack into the database. Each function in Slash/DB/MySQL/MySQL.pm should be replicated with a stored procedure which at a minimum executes the query, and if possible, relocate as much of query processing logic into pg/Perl modules. This should be relatively straightforward to implement, and allow high code reusability due to the fact that almost all of rehash's methods exist in perl modules, and not in individual .pl scripts. The upshot of this is that the only permission the DB account requires is EXECUTE to run the stored procedures; if possible, I'd also like to localize which perl function can call which pgSQL procedure; i.e., the getStories() function can only call procedures relating to that, vs. having access to all stored procedures in the database.

This would greatly reduce the reach of any SQL injection attacks, as well as hardening the site against possible compromise; unrestricted access to the database would require breaching one of the DB servers directly instead of getting lucky via a bug in rehash. As I've stated before, no security can be perfect, but I would love to get this site to the point that only a dedicated, targeted attack would even stand a chance of succeeding. That being said, while it sounds good on paper, I'm not 100% sure this type of design is reasonable. Very few FOSS projects seem to take advantage of stored procedures, triggers, views and other such functionality and I'm wondering if others have tried and failed to implement this level of functionality.

My Plan to Implementation

So, knowing what you want to do is good, but knowing how to do it is just as important. What I think the first step needs to be is a basic straight-port of the site from MySQL to postgreSQL, and implement a better schema upgrade system. As of right now, our upgrade "system" is writing queries in a text file, and just executing them when a site upgrade has to be done. Very low-tech, and error prone for a large number of queries. I don't have a lot of experience in managing very large schemas, so one thing on which I'd like advice is if there's a good, pre-existing framework that could be used to simplify our database upgrades. By far, the ideal scenario would be if we could run a single script which can intelligently upgrade the database from release to release.

Once both these pieces are in play, a slow but steady migration of code from the application layer to the database layer would allow us to handle the transition in a sane manner. Database auditing can be used to keep track of the most frequently used queries and function calls, and keep an eye on our total progress towards reaching an EXECUTE-only world.

That's everything in a nutshell. I want to know what you guys think, and as always, I'll be reading and replying to comments down below!

~ NCommander

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by TheLink on Thursday June 04 2015, @11:53AM

    by TheLink (332) on Thursday June 04 2015, @11:53AM (#192009) Journal

    The obvious choice from where I'm sitting is postgreSQL.

    Yes I'd prefer Postgresql to MySQL.

    What I want to do is remove as much code out of the application layer, and move it down the stack into the database

    That's nice and a good idea if you don't need to scale up/out the DB.

    A bad idea if you need to scale. PostgreSQL can be clustered but it's usually harder to scale PostgreSQL to say 100 nodes than it is to scale web/app nodes to 100 nodes.

    If you ever need to scale shoving everything into stored procs and having the DB do the all the work may not be so good as doing less at the DB level and the rest at identical app/webapp servers, with some caching stuff for performance.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 3, Informative) by TheLink on Thursday June 04 2015, @01:17PM

    by TheLink (332) on Thursday June 04 2015, @01:17PM (#192043) Journal
    Elaborating on at least one scenario where stored procs can be a good idea (in case some get the wrong idea from my first post) - in an environment with many different apps, users and developers using the same DB and you want to ensure consistency in the DB - so you have stored procs to do each standard/exposed task in the DB. All the apps and most people are supposed to use that those stored procs to do those stuff. That way if stuff is changed, in more cases you can just change the stored procs without affecting all the different apps.

    In contrast if you have 100 different apps and the apps did all the SQL directly, if you want to change the way things are done a bit or say extend stuff (e.g. adding a new user/item now should insert extra rows in new other tables), you need to update ALL the different apps that are affected. Now imagine the pain especially if some of the apps are no longer practically maintainable (company has gone out of business, nobody wants to dig through the code and put their necks on the line etc). This sort of thing is common in "Enterprise" environments. The alternative to stored procs is to have another layer around the DB - have an app and "API" that everything and everyone is supposed to use to do stuff to the DB, however this often takes a higher level of effort, discipline and commitment than found in most organizations ;).

    In the case of SoylentNews - you have a lot fewer apps and you can more easily update those apps if the schema or other stuff changes. Also certain sorts of dynamic queries are harder to do with stored procs (e.g. a custom report that allows the user to select which columns are wanted and how those columns should be displayed ).

    As for the SQL injection stuff, use the standard method of bind vars and prepared statements. That way if there's an exploitable bug (some unicode corner case etc) it's really the DB developer or DB lib developer's problem to fix, support and document; not yours. Whereas if you reinvent and reimplement all that stuff yourself, it's all on you when stuff happens.
    • (Score: 4, Insightful) by bryan on Thursday June 04 2015, @04:58PM

      by bryan (29) <bryan@pipedot.org> on Thursday June 04 2015, @04:58PM (#192161) Homepage Journal

      As for the SQL injection stuff, use the standard method of bind vars and prepared statements.

      Defiantly agree on the prepared statements. If you were to change anything related to the database, do this first. Relying on string escaping functions is never a good idea for security reasons.

      As for the stored procedures, I'd prefer most of the logic to stay in the code where it is version controlled and easier to upgrade. Stored procedures are also not quite as universal as standard SQL syntax. By adopting Postgres' version, you may be hurting your ability to switch to a different DBMS down the line.

      • (Score: 2) by NCommander on Thursday June 04 2015, @07:33PM

        by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Thursday June 04 2015, @07:33PM (#192234) Homepage Journal

        Honestly, database portability is not a huge concern. We don't want to rip out the abstraction layer, so if ever becomes necessary again, it can be done, but I don't see us supporting multiple congruent database types for longer than is necessary to transition.

        --
        Still always moving
      • (Score: 4, Funny) by bryan on Thursday June 04 2015, @07:55PM

        by bryan (29) <bryan@pipedot.org> on Thursday June 04 2015, @07:55PM (#192247) Homepage Journal

        s/Defiantly/Definitely/g

        Drat! An "edit comment" button would be more useful than a "multi-tiered database cluster farm with replication and hot-failover"