Stories
Slash Boxes
Comments

SoylentNews is people

posted by NCommander on Thursday June 04 2015, @11:38AM   Printer-friendly
from the input-wanted dept.

Moving on from frontend stuff, I'm getting to the point that I want to dig in deep and rewrite most of the database layer of SoylentNews/rehash. For those who are unfamiliar with our codebase, its primarily written in perl, with fairly strict adherence to the MVC model, going as far as installing system modules for code shared between the frontend and backend daemons. Slash was written with database portability in mind, and at least historically, a port to postgreSQL existed in the late 90s/early 2000s, and there was some legacy Oracle code authored by VA Linux as well. This code has bitrotted to the point of unusability, leaving the MySQL backend the only functional mode; I deleted the legacy code about a year ago from our git repo.

However, migrating from MySQL has remaining on my personal TODO list for a long time, due to unreliability, data corruption, and configuration pain. The obvious choice from where I'm sitting is postgreSQL. For those who aren't super familiar with the pro/cons with MySQL, this article by Elnur Abdurrakhimov has a pretty good summary and a list of links explaining in-depth why MySQL is not a good solution for any large site. We've hit a lot of pain in the nearly 1.5 years SN has been up due to limitations in the database layer as well, forcing us to use a clustering solution to provide any sort of real redundancy for our backend. Although I'm familiar with database programming, I'm not a DBA by trade, so I'm hoping to tap into the collective knowledge of the SoylentNews community and work out a reasonable migration plan and design.

[More after the break...]

The "Why" Factor

Beside my personal dislike of MySQL, there's a more important reason to migrate from MySQL. MySQL's support for stored procedures is incredibly poor, which means raw SQL has to be written in the application layer. rehash reduces the danger of injection by providing a set of wrapper functions such as select/insert/update which take four arguments: table, from clause, where clause, and anything extra if necessary; these parameters are assembled into a full query which is in turn properly escaped to prevent most obvious attacks from working. Extensive whitelists are used for sanitizing parameters, but by design, rehash uses a single namespace, with a single user account which has full SELECT/INSERT/UPDATE/DELETE permissions across the board. If any single point is compromised, the entire database is toast. Furthermore, because of poor support for views in MySQL, massive JOINs litter the codebase, making some queries reaching across 5-6 tables (with the most horrific example I can think of being the modbomb SELECT which reaches across almost every user and comment table in the database). This makes debugging and optimizing anything a slow and *painful experience.

The End Goal

What I want to do is remove as much code out of the application layer, and move it down the stack into the database. Each function in Slash/DB/MySQL/MySQL.pm should be replicated with a stored procedure which at a minimum executes the query, and if possible, relocate as much of query processing logic into pg/Perl modules. This should be relatively straightforward to implement, and allow high code reusability due to the fact that almost all of rehash's methods exist in perl modules, and not in individual .pl scripts. The upshot of this is that the only permission the DB account requires is EXECUTE to run the stored procedures; if possible, I'd also like to localize which perl function can call which pgSQL procedure; i.e., the getStories() function can only call procedures relating to that, vs. having access to all stored procedures in the database.

This would greatly reduce the reach of any SQL injection attacks, as well as hardening the site against possible compromise; unrestricted access to the database would require breaching one of the DB servers directly instead of getting lucky via a bug in rehash. As I've stated before, no security can be perfect, but I would love to get this site to the point that only a dedicated, targeted attack would even stand a chance of succeeding. That being said, while it sounds good on paper, I'm not 100% sure this type of design is reasonable. Very few FOSS projects seem to take advantage of stored procedures, triggers, views and other such functionality and I'm wondering if others have tried and failed to implement this level of functionality.

My Plan to Implementation

So, knowing what you want to do is good, but knowing how to do it is just as important. What I think the first step needs to be is a basic straight-port of the site from MySQL to postgreSQL, and implement a better schema upgrade system. As of right now, our upgrade "system" is writing queries in a text file, and just executing them when a site upgrade has to be done. Very low-tech, and error prone for a large number of queries. I don't have a lot of experience in managing very large schemas, so one thing on which I'd like advice is if there's a good, pre-existing framework that could be used to simplify our database upgrades. By far, the ideal scenario would be if we could run a single script which can intelligently upgrade the database from release to release.

Once both these pieces are in play, a slow but steady migration of code from the application layer to the database layer would allow us to handle the transition in a sane manner. Database auditing can be used to keep track of the most frequently used queries and function calls, and keep an eye on our total progress towards reaching an EXECUTE-only world.

That's everything in a nutshell. I want to know what you guys think, and as always, I'll be reading and replying to comments down below!

~ NCommander

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Thursday June 04 2015, @08:40PM

    by Anonymous Coward on Thursday June 04 2015, @08:40PM (#192262)

    Its the fact that you have to establish or think of relations at all. Properly the data should be nested so your calls aren't confusing long strings pulling separate 'tables' together by their relations it just comes out as one piece.

    There are caveats to that as data grows larger, but the amount of code you need to write is practically nothing compared to php with sql. This makes it more stable, understandable, maintainable, and therefore more usable.

    There is currently a wash of old databases, and they suck. There's no getting around it, they were awesome at the time, but times change and so should our data storage/retrieval. It behooves us to do the job the best we can do it, not the way that makes people playing office politics happy (I'm amazed how much power non-techs are given over tech processes, the logical fallacy and resultant unnecessary project crippling in search of marketing power words is staggering "It's a PDO connection, it's the best (that I think you want to hear), I swear! (please feed my children for another day)").

  • (Score: 2) by mtrycz on Friday June 05 2015, @07:57AM

    by mtrycz (60) on Friday June 05 2015, @07:57AM (#192412)

    The old databases don't suck, what are you talking about?

    Nosql is a tool that has it's usecases, and other usecases where it's not a good idea. There are specific applications (most of them, I'd argue), where you can't get around a relational integrity constraint.

    Also, are you arguing that sql/rdbms is hip and a buzzword for the management? Yeah, I just got trolled.

    --
    In capitalist America, ads view YOU!
    • (Score: 0) by Anonymous Coward on Friday June 05 2015, @01:02PM

      by Anonymous Coward on Friday June 05 2015, @01:02PM (#192489)

      I'd say it's definitely a buzzword, having your ideas torn apart into separate 1 dimensional tables instead of just nesting the data properly creates a spaghetti mess. At the moment sql etc are buzz words, people know them, they like them, mongodb or nedb are not buzz words, they are unknown. However in comparison these newer object oriented databases are faster, simpler, easier to use, and more mnemonic.

      for insance, if I want all the clients I'd do
      database.find({},function(err,allClientObjects){ //do something with client objects });

      that doesn't seem like much at all eh? Now try that with sql, you'll pull from the client table, the accounts table, the user table, the red white blue and purple hickery dragon table etc. Your query string eventually will be massive, and it will be prone to problems whenever you want to change it.

      Sql is a shit database, and it's relations via 1 dimensional tables and associations is utter garbage.

      • (Score: 2) by mtrycz on Friday June 05 2015, @02:19PM

        by mtrycz (60) on Friday June 05 2015, @02:19PM (#192540)

        will be prone to problems whenever you want to change it.

        Yup, that's the essence of it. Nosql is ust magically immune from this, actually it handles datamodel changes gracefully. lol

        Bye bye, troll.

        --
        In capitalist America, ads view YOU!