Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Monday February 13 2017, @03:21PM   Printer-friendly
from the how-long-would-it-take-for-YOU-to-restore-a-backup? dept.

Link bookmarking service Instapaper came back online today following a critical database issue that forced it offline for 31 hours over the past two days. According to two blog posts [1, 2] detailing what happened, on February 8, 2016, at around 21:00 GMT, Instapaper's main database mindbogglingly filled up without anyone noticing, and stopped allowing users to save new links to their accounts.

Instapaper developers said that neither its staff or its cloud provider noticed that the database was nearing full capacity, so nobody took precautions to migrate the Instapaper database to a larger server beforehand. When it happened, the service was left with one option, and that was to export all Instapaper content and move it to a new server. Both operations were extremely slow, as most database migration processes generally are.

Instapaper came back online earlier today, on February 10, 2017, at around 3:00 GMT, after a massive and embarrassing 31-hour downtime. Nonetheless, the service isn't 100% yet. Instapaper staff says they only imported a small fraction of the user data into the new database. "In the interest of coming back up as soon as possible, this [database] instance only has the last six weeks of articles," Instapaper staff wrote. "For now, anything you've saved since December 20, 2016 is accessible."

The service expects to restore all data by February 17, next week, a whopping nine days after service went down.

Source:
  https://www.bleepingcomputer.com/news/software/instapaper-needs-one-week-to-restore-full-service-after-31-hour-downtime/


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1, Redundant) by VLM on Monday February 13 2017, @04:13PM

    by VLM (445) Subscriber Badge on Monday February 13 2017, @04:13PM (#466634)

    The nice part about the cloud is the professionals don't have to worry about the hardware, the bad part is they ignore everything...

    I'd cry bogus on the story. Nobody doesn't monitor anything, even the most fly by night wantrapreneuers do better work than that. More mysterious is the whole implication they have one DB and one DB only, which probably doesn't trigger the normies but I'm WTF about that.

    If it happened to me on openstack I'd shut down controllably assuming it didn't crash, expand the filesystem, reboot, call it good.

    One time I'd have to restore from backups onto a fresh server is if I got powned and they needed the old image for forensics and you can't trust a compromised box.

    Starting Score:    1  point
    Moderation   -1  
       Redundant=1, Total=1
    Extra 'Redundant' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   1  
  • (Score: 2) by dyingtolive on Monday February 13 2017, @04:30PM

    by dyingtolive (952) on Monday February 13 2017, @04:30PM (#466640)

    This week long process to return to full functionality is odd to me too. I gotta wonder how large their database was and WTF is going on that it's taking a week to manage this. I've imported ~50GB Sybase DBs in hours. That was on hardware dating back to about the P4 era. Obviously they've got a larger DB than that, but they've also gotta be working with hardware more modern too, and probably something a little less shitty than Sybase.

    --
    Don't blame me, I voted for moose wang!
    • (Score: 1, Redundant) by VLM on Monday February 13 2017, @04:54PM

      by VLM (445) Subscriber Badge on Monday February 13 2017, @04:54PM (#466651)

      they've also gotta be working with hardware more modern too

      I remember that when I started working with multi-rack private NAS as a little corporate user, everything is so huge and fast.

      I'm getting the feeling they "lost it all" and are restoring from offline tape. Even that should be faster, however. Hmm.

  • (Score: 2) by Desler on Monday February 13 2017, @05:07PM

    by Desler (880) on Monday February 13 2017, @05:07PM (#466660)

    Nobody doesn't monitor anything, even the most fly by night wantrapreneuers do better work than that.

    I'm hoping you're veing sarcastic. [theregister.co.uk]

    • (Score: 2) by Desler on Monday February 13 2017, @05:43PM

      by Desler (880) on Monday February 13 2017, @05:43PM (#466677)

      Being of course.

    • (Score: 1, Redundant) by VLM on Monday February 13 2017, @06:04PM

      by VLM (445) Subscriber Badge on Monday February 13 2017, @06:04PM (#466685)

      That's a double fail in that the last time I ran a DB access without checking and appropriately responding to error codes was a couple presidents ago after an unfortunate incident. So no error handling code AND no database monitoring, not bad.

      I think I'm beginning to understand why I sleep thru the night and have plenty of time when the stereotype on HN and other places is devs are always up at 2am debugging stuff and nothing never gets done on time.

      I got flymake on my emacs so I "cant" make syntax errors, and jenkins, and unit testing, and a complete ELK stack, and zabbix, and gitlab, and ... frankly because of all that I got it pretty easy. Stuff just works.

      • (Score: 2) by bzipitidoo on Monday February 13 2017, @10:40PM

        by bzipitidoo (4388) on Monday February 13 2017, @10:40PM (#466764) Journal

        It's almost certainly a management fail. They may have decided to save money by not having a db admin, pushing that work onto one overworked sysadm who was steadily falling behind, barely able to tamp down one fire before he has to fight two more. When he warned them that the db was about to run out of room, they may have blown him off. I have seen management that was that bad. They couldn't be bothered to understand the situation, choosing to view the risk of swerving off a mountainside road as about the same as swerving off any other the road, no matter how much their knowledgeable experts tried to tell them otherwise.

        If the disaster is down to the extreme incompetence of their technical people-- and the incompetence required to muff a simple problem of running out of room is extremely extreme-- then it is still management responsibility. Have to ask why they had such bad help? They were too cheap to pay prevailing salaries? Maybe they indulged in nepotism? They didn't check their hires to see if they were lying about the technical work they could do?

  • (Score: 1, Insightful) by Anonymous Coward on Monday February 13 2017, @06:09PM

    by Anonymous Coward on Monday February 13 2017, @06:09PM (#466688)

    I have seen the same thing with owned servers. Didn't even know it was cloud until you pointed it out.

    When you hire college kids and h1bs with no exp you get what you get and they learn a lesson and you go out of business. The older dudes have already been burned and know 'uh you *really* want to do xyz'.