We started off the new year with a failure. On January 4, 2016, one of our engineers mistakenly deleted several tables containing Amplitude metadata critical for processing and visualizing event data. Although we were able to fully recover all data, the deletion resulted in a week-long outage that affected all of our users worldwide. This was our first major outage as a company. We made a mistake – one that was costly to us, and more importantly, costly to our customers. We failed to maintain the quality of service that we committed ourselves to provide.
The days following the outage were a frenzy of figuring out how to best remedy our mistake and how to communicate what had happened to our customers. I wanted to provide insight into our decision-making process during the outage, what we’ve learned, and what we’re doing going forward.