Slash Boxes

SoylentNews is people

posted by NCommander on Thursday June 04 2015, @11:38AM   Printer-friendly
from the input-wanted dept.

Moving on from frontend stuff, I'm getting to the point that I want to dig in deep and rewrite most of the database layer of SoylentNews/rehash. For those who are unfamiliar with our codebase, its primarily written in perl, with fairly strict adherence to the MVC model, going as far as installing system modules for code shared between the frontend and backend daemons. Slash was written with database portability in mind, and at least historically, a port to postgreSQL existed in the late 90s/early 2000s, and there was some legacy Oracle code authored by VA Linux as well. This code has bitrotted to the point of unusability, leaving the MySQL backend the only functional mode; I deleted the legacy code about a year ago from our git repo.

However, migrating from MySQL has remaining on my personal TODO list for a long time, due to unreliability, data corruption, and configuration pain. The obvious choice from where I'm sitting is postgreSQL. For those who aren't super familiar with the pro/cons with MySQL, this article by Elnur Abdurrakhimov has a pretty good summary and a list of links explaining in-depth why MySQL is not a good solution for any large site. We've hit a lot of pain in the nearly 1.5 years SN has been up due to limitations in the database layer as well, forcing us to use a clustering solution to provide any sort of real redundancy for our backend. Although I'm familiar with database programming, I'm not a DBA by trade, so I'm hoping to tap into the collective knowledge of the SoylentNews community and work out a reasonable migration plan and design.

[More after the break...]

The "Why" Factor

Beside my personal dislike of MySQL, there's a more important reason to migrate from MySQL. MySQL's support for stored procedures is incredibly poor, which means raw SQL has to be written in the application layer. rehash reduces the danger of injection by providing a set of wrapper functions such as select/insert/update which take four arguments: table, from clause, where clause, and anything extra if necessary; these parameters are assembled into a full query which is in turn properly escaped to prevent most obvious attacks from working. Extensive whitelists are used for sanitizing parameters, but by design, rehash uses a single namespace, with a single user account which has full SELECT/INSERT/UPDATE/DELETE permissions across the board. If any single point is compromised, the entire database is toast. Furthermore, because of poor support for views in MySQL, massive JOINs litter the codebase, making some queries reaching across 5-6 tables (with the most horrific example I can think of being the modbomb SELECT which reaches across almost every user and comment table in the database). This makes debugging and optimizing anything a slow and *painful experience.

The End Goal

What I want to do is remove as much code out of the application layer, and move it down the stack into the database. Each function in Slash/DB/MySQL/ should be replicated with a stored procedure which at a minimum executes the query, and if possible, relocate as much of query processing logic into pg/Perl modules. This should be relatively straightforward to implement, and allow high code reusability due to the fact that almost all of rehash's methods exist in perl modules, and not in individual .pl scripts. The upshot of this is that the only permission the DB account requires is EXECUTE to run the stored procedures; if possible, I'd also like to localize which perl function can call which pgSQL procedure; i.e., the getStories() function can only call procedures relating to that, vs. having access to all stored procedures in the database.

This would greatly reduce the reach of any SQL injection attacks, as well as hardening the site against possible compromise; unrestricted access to the database would require breaching one of the DB servers directly instead of getting lucky via a bug in rehash. As I've stated before, no security can be perfect, but I would love to get this site to the point that only a dedicated, targeted attack would even stand a chance of succeeding. That being said, while it sounds good on paper, I'm not 100% sure this type of design is reasonable. Very few FOSS projects seem to take advantage of stored procedures, triggers, views and other such functionality and I'm wondering if others have tried and failed to implement this level of functionality.

My Plan to Implementation

So, knowing what you want to do is good, but knowing how to do it is just as important. What I think the first step needs to be is a basic straight-port of the site from MySQL to postgreSQL, and implement a better schema upgrade system. As of right now, our upgrade "system" is writing queries in a text file, and just executing them when a site upgrade has to be done. Very low-tech, and error prone for a large number of queries. I don't have a lot of experience in managing very large schemas, so one thing on which I'd like advice is if there's a good, pre-existing framework that could be used to simplify our database upgrades. By far, the ideal scenario would be if we could run a single script which can intelligently upgrade the database from release to release.

Once both these pieces are in play, a slow but steady migration of code from the application layer to the database layer would allow us to handle the transition in a sane manner. Database auditing can be used to keep track of the most frequently used queries and function calls, and keep an eye on our total progress towards reaching an EXECUTE-only world.

That's everything in a nutshell. I want to know what you guys think, and as always, I'll be reading and replying to comments down below!

~ NCommander

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Thursday June 04 2015, @04:30PM

    by Anonymous Coward on Thursday June 04 2015, @04:30PM (#192144)

    Software quality is difficult to do a cost/benefit analysis on, but it is possible with sophisticated methods, and some are very straight forward. Most sites that use MySQL successfully either heavily patch it, do not mind a small amount of hard to quantify data going missing, or use it as a dumb key value store (a lot of NoSQL proponents were disgruntled MySQL users with little experience with more reliable database technologies). PostgreSQL does more to protect the user from technical and functional surprises, such as designs that cannot loose data in edge cases and feature behaviors that are less likely to cause developers to go WTF‽ when they are almost done with implementing them. MySQL is a soup of WTF‽ if you are paying attention.

    Regarding VARCHAR, people misuse those all the time. In PostgreSQL, VARCHAR and TEXT perform the same, so just use TEXT and put a constraint on it if you need it. There are very few reasons why you want the maximum length to be a part of the data type & interface.