We started off the new year with a failure. On January 4, 2016, one of our engineers mistakenly deleted several tables containing Amplitude metadata critical for processing and visualizing event data. Although we were able to fully recover all data, the deletion resulted in a week-long outage that affected all of our users worldwide. This was our first major outage as a company. We made a mistake - one that was costly to us, and more importantly, costly to our customers. We failed to maintain the quality of service that we committed ourselves to provide.
The days following the outage were a frenzy of figuring out how to best remedy our mistake and how to communicate what had happened to our customers. I wanted to provide insight into our decision-making process during the outage, what we’ve learned, and what we’re doing going forward.
A few years ago, Curtis and I decided to work on something that would have a positive impact on businesses all over the world. Today, Amplitude has 34 people who work together everyday to enable app developers to build better products. We’ve worked hard to create a company culture that feels like a family and define values to help guide how we run the company. During our outage, we turned to our values to guide our decision-making:
- Velocity: Move as quickly as possible.
- Transparency: Be totally honest and open within the company and with our customers.
- Ownership: Be accountable for your mistakes and strive for continuous improvement.
It’s nice to have a long list of values, but they’re useless unless you live by them. This outage would put them to the test. Our competitors will use everything they can against us - some have already shared the details of our outage with prospects (and they will undoubtedly share the link to this post, too) but we needed to stick by our values and do the right thing for our customers.
The Cost of Velocity
We’re a high velocity company. Every company in Silicon Valley tries to iterate as fast as they can to build the best possible product or service, but what people don’t realize is the fine line between high velocity innovation and risk that entrepreneurs walk in the Valley.
We failed to perform the necessary due diligence to identify high risk areas of our system and made a mistake that resulted in critical metadata tables being permanently deleted. Data processing had to be immediately paused for all users. The mistake caused our dashboard to become unavailable; and, once we were able to bring the dashboard back up, it only showed old data prior to the incident.
The issue was thankfully independent of data collection. Though data processing was paused for several days, absolutely no customer data was lost during the incident and we continued to collect data the entire time.
But it was clear that we had moved too fast and, worst of all, we weren’t as prepared to avoid and recover from an outage as I thought we were. The outage cautioned us to temper our velocity. While we still want to keep moving fast, it’s clear that we need to put appropriate safeguards in place.
Transparency in Our Plan of Attack
This incident was significantly worse than anything else we have experienced as a company. We knew we’d measure this outage in days, rather than the minutes we were used to.
We first created a war room where the majority of our engineering team could work uninterrupted. The team worked in shifts to provide 24-hour coverage.
Once our technical strategy was in place, we were faced with another set of difficult questions: How do we communicate this to our customers? How would we be perceived for making an error that should have never been possible to make? How would our competitors use this against us? We relied on our values to guide our thinking and made the decision to be fully transparent and to own up to the gravity of our mistake. Yes, our competitors used our transparency as an opportunity for themselves. At the end of the day, though, we care about our customers more; we view all of our customers as long-term partners and being transparent was the right thing to do.
We established an ad hoc incident communication team made up of marketing, customer success, and engineering. We wanted to ensure that we addressed the concerns of all of our key stakeholders - our customers, the general public, and engineers - would be appropriately addressed in each communication.
After doing some research into best practices for incident communication, we came up with a plan that let us maintain a consistent stream of communication with our customers.
- Send a detailed email communication to all customers every 1-2 days.
- Update our status page every two hours.
- Add a prominent status bar to the website that’s updated along with the status page.
- Have our Customer Success Managers reach out to enterprise customers almost daily to provide updates and address concerns.
Owning Our Mistake
Determining the best way to make things right for our customers tested our sense of ownership. The outage disrupted how our customers do business. Several spent a significant amount of time and resources to perform ad hoc analysis from legacy systems to gain visibility into how their apps were performing. We were responsible for losing customer time and forcing them to make decisions without data. In order to demonstrate our long-term commitment to our customers, we decided to compensate them as best we could, even if it meant incurring financial losses. We decided to provide:
- 50% off monthly service for Enterprise customers
- 5 million additional events for free customers
We couldn’t replace our customer’s lost time, but compensating them financially or with additional value was our best solution to make up for the damage we caused.
We’ve grown as a company in the past few weeks. For the most part, customers were understanding and appreciated our transparency. We received lots supportive messages from customers, saying that our transparency during the outage validated their decision to choose us as their analytics provider (thank you to all our customers who sent me or our team a kind note- it meant a lot to us during the outage). Other customers were upset, and rightly so.
We learned that velocity cannot impact quality and reliability of our services. In the coming months, we will be making the necessary investments in risk assessments across the organization to prevent any future disruptions that can be avoided with proper preparation. These will include auditing delete permissions on production data, automating backups, and creating an incident response plan. Our engineering team will communicate the details of our changes in a future blog post.
Failures happen during course of innovation, but building a company on a foundation of strong values can help to navigate those difficult times. This was a serious failure on our part. We’ve learned to take a step back and carefully assess risk, instead of letting velocity run unchecked. We took ownership of our mistake and committed ourselves to being transparent with our customers during the outage. Most importantly, we began establishing the necessary processes to prevent something similar from happening in the future.
We’ve done our best to learn from our mistake and are excited to move forward and make a positive impact on businesses all over the world.