El Reg published an article [theregister.co.uk] today that goes into detail about why AWS went tits up (as El Reg likes to put it) on Sunday (reported here [soylentnews.org] on Monday):
Today, it's emerged the mega-outage was caused by vital systems in one part of AWS taking too long to send information to another part that was needed by customers.
Picture a steakhouse in which the cooks are taking so long to prepare the food, the side dishes have gone cold by the time the waiters and waitresses take the plates from the chef to the hungry diners. The orders have to be started again from scratch, the whole operation is overwhelmed, the chef walks out, and ultimately customers aren't getting fed. A typical Gordon Ramsay kitchen nightmare.
In technical terms, the internal metadata servers in AWS's DynamoDB database service were not answering queries from the storage systems within a particular time limit.
It gets worse, however, as it seems to be happening again:
As your humble hack hammers away at the keyboard, the Amazon DynamoDB service in the US-East-1 region is suffering from "increased error rates", which started at 0633 PT today. Hours into the disruption, the team is battling to improve the situation.
El Reg was kind enough to provide a link [amazon.com] to a page from Amazon's engineers describing in great detail the cause of the problem.