Stories
Slash Boxes
Comments

SoylentNews is people

posted by NCommander on Thursday June 04 2015, @11:38AM   Printer-friendly
from the input-wanted dept.

Moving on from frontend stuff, I'm getting to the point that I want to dig in deep and rewrite most of the database layer of SoylentNews/rehash. For those who are unfamiliar with our codebase, its primarily written in perl, with fairly strict adherence to the MVC model, going as far as installing system modules for code shared between the frontend and backend daemons. Slash was written with database portability in mind, and at least historically, a port to postgreSQL existed in the late 90s/early 2000s, and there was some legacy Oracle code authored by VA Linux as well. This code has bitrotted to the point of unusability, leaving the MySQL backend the only functional mode; I deleted the legacy code about a year ago from our git repo.

However, migrating from MySQL has remaining on my personal TODO list for a long time, due to unreliability, data corruption, and configuration pain. The obvious choice from where I'm sitting is postgreSQL. For those who aren't super familiar with the pro/cons with MySQL, this article by Elnur Abdurrakhimov has a pretty good summary and a list of links explaining in-depth why MySQL is not a good solution for any large site. We've hit a lot of pain in the nearly 1.5 years SN has been up due to limitations in the database layer as well, forcing us to use a clustering solution to provide any sort of real redundancy for our backend. Although I'm familiar with database programming, I'm not a DBA by trade, so I'm hoping to tap into the collective knowledge of the SoylentNews community and work out a reasonable migration plan and design.

[More after the break...]

The "Why" Factor

Beside my personal dislike of MySQL, there's a more important reason to migrate from MySQL. MySQL's support for stored procedures is incredibly poor, which means raw SQL has to be written in the application layer. rehash reduces the danger of injection by providing a set of wrapper functions such as select/insert/update which take four arguments: table, from clause, where clause, and anything extra if necessary; these parameters are assembled into a full query which is in turn properly escaped to prevent most obvious attacks from working. Extensive whitelists are used for sanitizing parameters, but by design, rehash uses a single namespace, with a single user account which has full SELECT/INSERT/UPDATE/DELETE permissions across the board. If any single point is compromised, the entire database is toast. Furthermore, because of poor support for views in MySQL, massive JOINs litter the codebase, making some queries reaching across 5-6 tables (with the most horrific example I can think of being the modbomb SELECT which reaches across almost every user and comment table in the database). This makes debugging and optimizing anything a slow and *painful experience.

The End Goal

What I want to do is remove as much code out of the application layer, and move it down the stack into the database. Each function in Slash/DB/MySQL/MySQL.pm should be replicated with a stored procedure which at a minimum executes the query, and if possible, relocate as much of query processing logic into pg/Perl modules. This should be relatively straightforward to implement, and allow high code reusability due to the fact that almost all of rehash's methods exist in perl modules, and not in individual .pl scripts. The upshot of this is that the only permission the DB account requires is EXECUTE to run the stored procedures; if possible, I'd also like to localize which perl function can call which pgSQL procedure; i.e., the getStories() function can only call procedures relating to that, vs. having access to all stored procedures in the database.

This would greatly reduce the reach of any SQL injection attacks, as well as hardening the site against possible compromise; unrestricted access to the database would require breaching one of the DB servers directly instead of getting lucky via a bug in rehash. As I've stated before, no security can be perfect, but I would love to get this site to the point that only a dedicated, targeted attack would even stand a chance of succeeding. That being said, while it sounds good on paper, I'm not 100% sure this type of design is reasonable. Very few FOSS projects seem to take advantage of stored procedures, triggers, views and other such functionality and I'm wondering if others have tried and failed to implement this level of functionality.

My Plan to Implementation

So, knowing what you want to do is good, but knowing how to do it is just as important. What I think the first step needs to be is a basic straight-port of the site from MySQL to postgreSQL, and implement a better schema upgrade system. As of right now, our upgrade "system" is writing queries in a text file, and just executing them when a site upgrade has to be done. Very low-tech, and error prone for a large number of queries. I don't have a lot of experience in managing very large schemas, so one thing on which I'd like advice is if there's a good, pre-existing framework that could be used to simplify our database upgrades. By far, the ideal scenario would be if we could run a single script which can intelligently upgrade the database from release to release.

Once both these pieces are in play, a slow but steady migration of code from the application layer to the database layer would allow us to handle the transition in a sane manner. Database auditing can be used to keep track of the most frequently used queries and function calls, and keep an eye on our total progress towards reaching an EXECUTE-only world.

That's everything in a nutshell. I want to know what you guys think, and as always, I'll be reading and replying to comments down below!

~ NCommander

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by goodie on Friday June 05 2015, @05:19AM

    by goodie (1877) on Friday June 05 2015, @05:19AM (#192378) Journal

    A few disclaimers are in order before I go on with my 2 cents here:
    - I have very little experience with MySQL and PostgreSQL. My area of expertise is MSSQL (10+ years) including some migration projects between Oracle and MSSQL. All that within the context of application development (legacy, J2EE etc.). So my views here are influenced (tainted if you prefer ;) by this.
    - I did not read the linked module code nor am I familiar with the software stack for SN so some of my points may be moot.

    Couple of important points to me stand out:

    - As some have mentioned, MySQL vs PostgreSQL does not seem to be very clear here... Every DBMS has its flaws, the point is to pick one that works best and scales out best for you. My main point here is the following: beyond the clustering/failover system, some have alluded to performance issues (mainly scaling from what I gather). What exactly are those issues when taking into account the capacity that your DB server gives you? As in, is the DB server very/too busy? I mean what is the actual average load on the DB server and how is that an issue? This could just be a coding issue, whether on the DB or the app side. To me, it may be that more than anything when you refer to queries having to span 5/6 tables: views, SPs, etc. will not help with this in any way. On the plus side, 5 or 6 tables may not be an issue if it's not too frequently used or if the tables are small. As far as reliability and corruption are concerned, it could be the application code rather than the DBMS that is to blame. But MySQL offers ACID transactins like other DBMSs if you pick the right engine. After that, if you're coding in Perl, I am assuming that you have to handle COMMIT/ROLLBACK in your code rather than have a "container" type of thing do it for you.
    - Again, as some have mentioned, prepared statements should be preferred. This does not entail in any way moving the logic into SPs. I would actually not do that unless you need to package multiple statements into one self-contained SQL object, much like you would inside an app module. Prepared statements and SPs will give you cached, reusable execution plans (something that ad hoc SQL usually will not). This is on top of removing the need to sanitize the input for injection (although it does not in itself remove any possibility of injection if you take strings as parameters for example).
    - As far as the EXECUTE permission vs CRUD permission argument is concerned, I don't think it's a huge issue personally. And having restrictions for accounts based on the type of operation you are calling and the operation you are trying to perform will be rather bothersome to implement and maintain as the DB and the application code are loosely coupled. You could design a policy where you say that READ is one account and INSERT/UPDATE another while DELETE is not permitted if you implement soft deletes. That's about the extent I'd go to not to hinder the development.
    - When you want to have a multi DB server setup, is it for failover or load distribution? Depending on what you are trying to achieve, you need to consider whether you will want to implement something like a 2PC (costs more on writes but provides point in time failover) or an incremental log shipping strategy (e.g., 15 minutes lost at most) for the standby server.
    - As far as DB upgrades as concerned, a simple setup wherein every DB change is contained in one numbered script should be enough (been doing that for many years at my old job, it's a discipline issue, not a technical one). DB upgrades can be automated by running the scripts in sequence from the last one run. Add to that an automated backup before DB upgrades and you should be good. But DB upgrades are always trickier than code upgrades because you're not pushing binaries or anything like that.
    - Also, you want to keep the business logic in one place rather than have it put in multiple places without any clear rules or guidelines. One thing we did at my work was use triggers to archive data automatically as soon as the application would consider it "outside of its scope", i.e. only when reporting on it rather than use for any live operations (could be for example archival of threads that have had no activity after X days etc.).

    So my very superficial conclusion is the following: if your main objective is to sanitize the code base, keep MySQL and implement prepared statements with some SPs. If you're looking to switch, then switch and after that implement prepared statements and some SPs. Sanitization and improvements to the DB schema and application code can be tackled during slower dev periods and be done progressively. I tend to favor this over a long blitz which will inevitably unveil lots of bugs especially if you are switching platform at the same time.

    But again, this is based solely on your original post. I'd be happy to lend a hand if needed.

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by goodie on Friday June 05 2015, @05:22AM

    by goodie (1877) on Friday June 05 2015, @05:22AM (#192379) Journal

    One thing I forgot regarding triggers...

    While they may work great, you have to make sure they save you work rather than force you to reload data that has been updated by a trigger unbeknownst to the application. I've seen cases where using triggers saved 1 SQL statement but required re-loading a bunch of objects... Obviously that was a poor choice, but for maintenance, pure DB stuff that the application does not touch (like in my example above), it could be worth it.