Amazon has published a detailed postmortem explaining how a critical fault in DynamoDB's DNS management system cascaded into a day-long outage that disrupted major websites and services across multiple brands – with damage estimates potentially reaching hundreds of billions of dollars.
The incident began at 11:48 PM PDT on October 19 (7.48 UTC on October 20), when customers reported increased DynamoDB API error rates in the Northern Virginia US-EAST-1 Region. The root cause was a race condition in DynamoDB's automated DNS management system that left an empty DNS record for the service's regional endpoint.
The DNS management system comprises two independent components (for availability reasons): a DNS Planner that monitors load balancer health and creates DNS plans, and a DNS Enactor that