On Monday, January 4, 2016, from 8:22 PM PST to 11:37 PM PST, we experienced an outage that prevented our customers from accessing their data on Amplitude. Following the outage, data on Amplitude remained stale until 3:23 PM PST on Monday, January 11, and several important features on Amplitude were inaccessible. We know many of our customers rely on Amplitude being available and up-to-date for their businesses, and we let you down. We’d like to take this opportunity to explain what happened, how we responded, and steps we are taking to prevent future outages like this from happening again.
Laying the foundation with pre-aggregation and lambda architecture
Three weeks ago, we announced that we are giving away a compelling list of analytics features for free for up to 10 million events per month. That’s an order of magnitude more data than any comparable service, and we’re hoping it enables many more companies to start thinking about how they can leverage behavioral analytics to improve their products. How can we scale so efficiently? It comes down to understanding the nature of analytics queries and engineering the system for specific usage patterns. We’re excited to share what we learned while scaling to hundreds of billions of events and two of the key design choices of our system: pre-aggregation and lambda architecture. Continue reading
Amazon Redshift has served us very well at Amplitude. Redshift is a cloud-based, managed data warehousing solution that we use to give our customers direct access to their raw data (you can read more about why we chose it over other Redshift alternatives in another post from a couple months ago). This allows them to write SQL queries to answer ad hoc questions about user behavior in their apps.
But, as we scaled the number of customers and amount of data stored, issues began emerging in our original schema. Namely, sometimes our customer’s queries took a long time to complete, and we started getting some support tickets like this:
It was clearly time for an overhaul.
A couple months ago, our co-founder and CEO Spenser had the pleasure of giving a tech talk hosted by our good friends at KeepSafe. He went over some of the key data challenges that we face when we’re simultaneously collecting data from hundreds of millions of devices, including:
How can you create a bucketing algorithm for an arbitrary dataset you don’t know in advance?
We get a constant stream of numerical data from our customers. We’re not sure what the range of this data might be or how many orders of magnitude it may span. The distribution of the data is constantly changing as new data is added.
Analytics and the inaccuracy of phone-reported timestamps
When your analytics code runs on tens of millions of phones, strange things can happen.