posted by NCommander on Friday March 14 2014, @06:44AM   Printer-friendly
from the timebombs-are-exciting dept.
We had an hour or so or downtime today. After debugging, the root cause came from the SSL certificates we use to establish a database connection from the webserver to the actual DB. As a prelude GoLive, we migrated from unencrypted connections to encrypted connections as we have to cross the Linode internal LAN. In an attempt to improve data security, we generated a set of SSL certificates and used those to encrypt the MySQL connections. In the flurry of golive, no one thought to check the expiry date on said certificates. Out of the box, OpenSSL generates certificates with a one month expiry unless manually changed.

As you might expect, one month later, the certificates expired, and the database stopped accepting remote connections. New certificates were generated with a ten year expiration, and we continue to work towards better documenting our internal processes on the wiki to prevent this sort of thing from happening again. Apache, and slashd are running again, and we appear to be back to status-quo in terms of site operation.

A full incident report will be written up and posted to the wiki in the next few days.
  • (Score: 2) by juggs on Friday March 14 2014, @08:31AM

    by juggs (63) on Friday March 14 2014, @08:31AM (#16229) Journal

    In short - teething pains.

    I'm sure you guys will outgrow them, it's been a very fast journey down a very rough road you've done well to get to where you have so soon, applaud yourself for your successes so far rather than dwell on the negatives, just put in place methods to prevent them happening again and move on.

    Obligatory car analogy:-
    You're put into the driving seat of a WRC (World Rally Championship) car at the starting line of a 30Km gravel stage having never driven on a loose surface, or anything so feisty as a WRC car. The countdown is already at 1 second, your co is shouting something incoherent into your earpiece along the lines of "Go! Go! Go! And in 60 5 left then over crest 4 right then 20 2 right through gate then 100 4 left opening to 6 left 400 CAUTION jump into 1 right 20 and into 1 right over crest to 4 left"

    Well if you survived that without hitting a tree you did well as that was just 20 seconds into the stage. Reality is you already hit a tree, lots of them.

    I think the lack of trees hit so far is laudable.

  • (Score: 2) by NCommander on Friday March 14 2014, @08:33AM

    by NCommander (2) Subscriber Badge <> on Friday March 14 2014, @08:33AM (#16230) Homepage Journal

    To be honest, it was nice to have a crisis right now that was completely technical than the recent drama. Really says the state of recent events that I can say that with a straight face.

    • (Score: 2) by Reziac on Saturday March 15 2014, @05:04AM

      by Reziac (2489) on Saturday March 15 2014, @05:04AM (#16751) Homepage

      Is this why yesterday I got the "503 guru meditation varnish cache" gibberish?

      Very glad too that it was just a technical glitch and not anything Dreadful.

      • (Score: 2) by NCommander on Saturday March 15 2014, @12:01PM

        by NCommander (2) Subscriber Badge <> on Saturday March 15 2014, @12:01PM (#16810) Homepage Journal

        Yeah. Apache (due to mod_perl) shat itself when the database went away, so Varnish started complaining about guru meditation due to ENOBACKEND

