Here at Amplitude we’ve chosen to build and deploy with docker and saltstack. Docker allows us to minimize configuration and customization required to deploy our services. Saltstack is a powerful systems and configuration management tool that is fast, scales well and is highly extendable for solving just about any infrastructure automation or orchestration problem.
It’s impossible to design the perfect product from day one.
At some point, you realize that the type of user you care about has changed, or that you built something designed for 1999, or that the designer you hired has better ideas than you (the engineer) did.
It’s all too easy to forget to build for the future, especially when you think you know exactly where your product is headed.
A lot of things have changed design-wise and product-wise over the years at Amplitude, which got us thinking again about our front-end architecture. It’s been a long time coming, but we’ve designed a new front-end architecture – which we’re calling Lightning – that is capable of handling all the changes we have on the horizon without sacrificing how fast we can build.
This post is an initial look at our motivations for coming up with a new front-end architecture and the basic approach we took. We’ll be talking more specifically about what we’ve learned along the way in our upcoming posts.
On Monday, January 4, 2016, from 8:22 PM PST to 11:37 PM PST, we experienced an outage that prevented our customers from accessing their data on Amplitude. Following the outage, data on Amplitude remained stale until 3:23 PM PST on Monday, January 11, and several important features on Amplitude were inaccessible. We know many of our customers rely on Amplitude being available and up-to-date for their businesses, and we let you down. We’d like to take this opportunity to explain what happened, how we responded, and steps we are taking to prevent future outages like this from happening again.
Update: In May 2016 we updated our analytics architecture to NOVA. Read the article here.
Laying the foundation with pre-aggregation and lambda architecture
Three weeks ago, we announced that we are giving away a compelling list of analytics features for free for up to 10 million events per month. That’s an order of magnitude more data than any comparable service, and we’re hoping it enables many more companies to start thinking about how they can leverage behavioral analytics to improve their products. How can we scale so efficiently? It comes down to understanding the nature of analytics queries and engineering the system for specific usage patterns. We’re excited to share what we learned while scaling to hundreds of billions of events and two of the key design choices of our system: pre-aggregation and lambda architecture. Continue reading
Amazon Redshift has served us very well at Amplitude. Redshift is a cloud-based, managed data warehousing solution that we use to give our customers direct access to their raw data (you can read more about why we chose it over other Redshift alternatives in another post from a couple months ago). This allows them to write SQL queries to answer ad hoc questions about user behavior in their apps.
But, as we scaled the number of customers and amount of data stored, issues began emerging in our original schema. Namely, sometimes our customer’s queries took a long time to complete, and we started getting some support tickets like this:
It was clearly time for an overhaul.
A couple months ago, our co-founder and CEO Spenser had the pleasure of giving a tech talk hosted by our good friends at KeepSafe. He went over some of the key data challenges that we face when we’re simultaneously collecting data from hundreds of millions of devices, including: