Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Monday February 13 2017, @03:21PM   Printer-friendly
from the how-long-would-it-take-for-YOU-to-restore-a-backup? dept.

Link bookmarking service Instapaper came back online today following a critical database issue that forced it offline for 31 hours over the past two days. According to two blog posts [1, 2] detailing what happened, on February 8, 2016, at around 21:00 GMT, Instapaper's main database mindbogglingly filled up without anyone noticing, and stopped allowing users to save new links to their accounts.

Instapaper developers said that neither its staff or its cloud provider noticed that the database was nearing full capacity, so nobody took precautions to migrate the Instapaper database to a larger server beforehand. When it happened, the service was left with one option, and that was to export all Instapaper content and move it to a new server. Both operations were extremely slow, as most database migration processes generally are.

Instapaper came back online earlier today, on February 10, 2017, at around 3:00 GMT, after a massive and embarrassing 31-hour downtime. Nonetheless, the service isn't 100% yet. Instapaper staff says they only imported a small fraction of the user data into the new database. "In the interest of coming back up as soon as possible, this [database] instance only has the last six weeks of articles," Instapaper staff wrote. "For now, anything you've saved since December 20, 2016 is accessible."

The service expects to restore all data by February 17, next week, a whopping nine days after service went down.

Source:
  https://www.bleepingcomputer.com/news/software/instapaper-needs-one-week-to-restore-full-service-after-31-hour-downtime/


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by Desler on Monday February 13 2017, @05:07PM

    by Desler (880) on Monday February 13 2017, @05:07PM (#466660)

    Nobody doesn't monitor anything, even the most fly by night wantrapreneuers do better work than that.

    I'm hoping you're veing sarcastic. [theregister.co.uk]

    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by Desler on Monday February 13 2017, @05:43PM

    by Desler (880) on Monday February 13 2017, @05:43PM (#466677)

    Being of course.

  • (Score: 1, Redundant) by VLM on Monday February 13 2017, @06:04PM

    by VLM (445) Subscriber Badge on Monday February 13 2017, @06:04PM (#466685)

    That's a double fail in that the last time I ran a DB access without checking and appropriately responding to error codes was a couple presidents ago after an unfortunate incident. So no error handling code AND no database monitoring, not bad.

    I think I'm beginning to understand why I sleep thru the night and have plenty of time when the stereotype on HN and other places is devs are always up at 2am debugging stuff and nothing never gets done on time.

    I got flymake on my emacs so I "cant" make syntax errors, and jenkins, and unit testing, and a complete ELK stack, and zabbix, and gitlab, and ... frankly because of all that I got it pretty easy. Stuff just works.

    • (Score: 2) by bzipitidoo on Monday February 13 2017, @10:40PM

      by bzipitidoo (4388) on Monday February 13 2017, @10:40PM (#466764) Journal

      It's almost certainly a management fail. They may have decided to save money by not having a db admin, pushing that work onto one overworked sysadm who was steadily falling behind, barely able to tamp down one fire before he has to fight two more. When he warned them that the db was about to run out of room, they may have blown him off. I have seen management that was that bad. They couldn't be bothered to understand the situation, choosing to view the risk of swerving off a mountainside road as about the same as swerving off any other the road, no matter how much their knowledgeable experts tried to tell them otherwise.

      If the disaster is down to the extreme incompetence of their technical people-- and the incompetence required to muff a simple problem of running out of room is extremely extreme-- then it is still management responsibility. Have to ask why they had such bad help? They were too cheap to pay prevailing salaries? Maybe they indulged in nepotism? They didn't check their hires to see if they were lying about the technical work they could do?