Link bookmarking service Instapaper came back online today following a critical database issue that forced it offline for 31 hours over the past two days. According to two blog posts [1, 2] detailing what happened, on February 8, 2016, at around 21:00 GMT, Instapaper's main database mindbogglingly filled up without anyone noticing, and stopped allowing users to save new links to their accounts.
Instapaper developers said that neither its staff or its cloud provider noticed that the database was nearing full capacity, so nobody took precautions to migrate the Instapaper database to a larger server beforehand. When it happened, the service was left with one option, and that was to export all Instapaper content and move it to a new server. Both operations were extremely slow, as most database migration processes generally are.
Instapaper came back online earlier today, on February 10, 2017, at around 3:00 GMT, after a massive and embarrassing 31-hour downtime. Nonetheless, the service isn't 100% yet. Instapaper staff says they only imported a small fraction of the user data into the new database. "In the interest of coming back up as soon as possible, this [database] instance only has the last six weeks of articles," Instapaper staff wrote. "For now, anything you've saved since December 20, 2016 is accessible."
The service expects to restore all data by February 17, next week, a whopping nine days after service went down.
(Score: 2) by Desler on Monday February 13 2017, @05:07PM
Nobody doesn't monitor anything, even the most fly by night wantrapreneuers do better work than that.
I'm hoping you're veing sarcastic. [theregister.co.uk]
(Score: 2) by Desler on Monday February 13 2017, @05:43PM
Being of course.
(Score: 1, Redundant) by VLM on Monday February 13 2017, @06:04PM
That's a double fail in that the last time I ran a DB access without checking and appropriately responding to error codes was a couple presidents ago after an unfortunate incident. So no error handling code AND no database monitoring, not bad.
I think I'm beginning to understand why I sleep thru the night and have plenty of time when the stereotype on HN and other places is devs are always up at 2am debugging stuff and nothing never gets done on time.
I got flymake on my emacs so I "cant" make syntax errors, and jenkins, and unit testing, and a complete ELK stack, and zabbix, and gitlab, and ... frankly because of all that I got it pretty easy. Stuff just works.
(Score: 2) by bzipitidoo on Monday February 13 2017, @10:40PM
It's almost certainly a management fail. They may have decided to save money by not having a db admin, pushing that work onto one overworked sysadm who was steadily falling behind, barely able to tamp down one fire before he has to fight two more. When he warned them that the db was about to run out of room, they may have blown him off. I have seen management that was that bad. They couldn't be bothered to understand the situation, choosing to view the risk of swerving off a mountainside road as about the same as swerving off any other the road, no matter how much their knowledgeable experts tried to tell them otherwise.
If the disaster is down to the extreme incompetence of their technical people-- and the incompetence required to muff a simple problem of running out of room is extremely extreme-- then it is still management responsibility. Have to ask why they had such bad help? They were too cheap to pay prevailing salaries? Maybe they indulged in nepotism? They didn't check their hires to see if they were lying about the technical work they could do?