Amplitude Dashboard Outage: Post Mortem
On Monday, January 4, 2016, from 8:22 PM PST to 11:37 PM PST, we experienced an outage that prevented our customers from accessing their data on Amplitude. Following the outage, data on Amplitude remained stale until 3:23 PM PST on Monday, January 11, and several important features on Amplitude were inaccessible. We know many of our customers rely on Amplitude being available and up-to-date for their businesses, and we let you down. We’d like to take this opportunity to explain what happened, how we responded, and steps we are taking to prevent future outages like this from happening again.
On Monday, January 4, at 8:22 PM PST, an engineer erroneously ran a script in the production environment that was meant to run on a development environment. The script deleted four tables on DynamoDB that contained metadata used for processing events and querying data. Specifically, these tables contained the following information:
- Internal configuration of services
- File metadata used by the query engine
- Metadata pertaining to all device IDs we have seen
- Metadata pertaining to all Amplitude IDs we have assigned
When these tables were deleted, the web reporting dashboards on Amplitude became inaccessible. In addition, our processing pipeline halted, as it could not proceed without the ID information. Event data from clients was still being collected and stored in a queue that could be processed later.
Our immediate priority was to make the dashboards accessible again. Our query engine uses the internal configuration to determine which partitions to query and the file metadata to determine where the data physically lives. We were able to recover the internal configuration and file metadata from backups within a few hours.
At 11:37 PM PST, customers were able to access most of the dashboards. Since processing was still paused, the dashboards reflected data collected prior to 8:22 PM PST. Real-time activity, user timelines, Microscope, cohort recomputation and downloads all relied on information in the tables we hadn’t recovered yet, so these features remained unavailable.
The next step was to recover the two ID tables. Unfortunately, we did not have backups for these tables. We did, however, have all the historical events, which we could use to recreate the data in those tables. At 1 AM PST on Tuesday, January 5, we began developing and testing a sequence of MapReduce jobs to reconstruct and then repopulate the data. At 1 PM PST, we started the job to reconstruct the data; it took about 14 hours to complete.
On Wednesday, January 6 at 4:30 AM PST, we began repopulating the ID tables. We kicked off the final MapReduce job at 3:30 PM PST and began validating the repopulated dataset in parallel. The jobs and validation completed on Thursday, January 7 at 1:30 PM PST. At this point, the dashboards were fully functional for data prior to January 4 8:22 PM PST. We then resumed data processing on the event backlog.
We originally anticipated it taking 1-2 days to process the backlog, but we had to push back the estimate by several days. In typical operation, our collection servers will throttle devices that send us data volumes that are many orders of magnitude more than realistic, as informed by our processing pipeline. During the outage, this functionality was inactive and resulted in us collecting significantly more data than usual. This caused the backlog to take longer than expected to catch up.
On Monday, January 11 at 9:30 AM PST, we completed processing the backlog and began doing data validation on the dashboards. After extensive testing, at 3:23 PM PST, we confirmed that all data had been correctly processed and resumed normal operation. Throughout the incident data collection was fully operational.
Why did it happen?
This incident and subsequent length of recovery were a result of a combination of factors.
We unfortunately did not have sufficient protection against a script running on the production environment that could delete operationally critical tables. The recovery was made difficult because we did not have usable backups for some of our tables in DynamoDB, which forced us to reconstruct a large amount of state from historical data. Even for tables with backups, their recovery was delayed because we did not have procedures in place to recover data from those backups in an efficient manner.
The engineering team worked to resolve the problem throughout Monday night, but did not notify the rest of the organization until the following day. We did not have a clear process for escalation in place, which caused our initial response and communication to customers to be significantly delayed.
Once the incident was escalated properly, we notified all customers via email with an explanation of the situation and our best estimate of when we would be fully recovered. However, we underestimated how long it would take to get back to a fully recovered state and thus presented estimates that were incorrect and had to be pushed back.
What are we doing to prevent it from happening again?
There’s a lot to learn and improve on from this incident.
We’ve already taken steps to restrict AWS accounts from having delete access to critical data, and will be using finer grained permissions on our AWS accounts. We will reevaluate the permissions we grant to each account and role and make sure those permissions are the minimum necessary.
We are setting up automated backups for the few remaining databases that currently do not have backups. In addition, we plan to develop and rehearse methods of quickly recovering from the backups.
Additionally, we’ll be performing a comprehensive review of our system over the next few months to identify weak points and ensure that we’re not vulnerable to an incident like this in the future. We plan to share the results from the review on this blog a few months from now.
Lastly, we are also putting in place policies and procedures for incident response, to reduce the time it takes for customers to be notified about outages and the time it takes for services to come back online.
Thank you for being patient with us throughout the outage. We sincerely apologize for the downtime and understand that our customers rely on our service being available for their businesses. We will do everything we can to improve our processes to ensure that you can rely on Amplitude in the future.
Thank you for your support.
Last Updated: 05/03/18
Release Notes: May 2018
New in Amplitude in May 2018: exclude events from funnels, add descriptions to events in bulk, impro...
How I PM: Sam Goertler, Lead Product Manager at theSkimm
Sam is the Lead Product Manager at theSkimm, and former Senior PM at Asana and this is how she produ...
How to Organize Your Product Team Around Your North Star
Organizing a product team is hard but having a north star makes it easier.