In 2014, Amazon made VPC (virtual private cloud) the standard deployment environment for all new applications created on Amazon Web Services. If you started your AWS account before then, and you haven’t migrated, it’s likely that your application still uses EC2-Classic (elastic compute cloud).
There are a lot of great security, price, and architectural reasons to migrate your existing EC2-Classic architecture over to VPC, if you haven’t yet.
The problem is, if you run 1,400 separate EC2 instances and 50+ services like Amplitude does—many of which have either strict uptime requirements or the potential for data corruption—migrating to VPC can be quite the experience.
It took us over a year to complete the full end-to-end migration, involving 30,000 lines of code in our devops repo alone, but we did it and also managed to build in various other upgrades to boot. Here are 4 of our most significant learnings that we picked up along the way.
1. Don’t use a weighted DNS round-robin to progressively roll out a new version of a customer-facing service
When you’re rolling out a new version of a customer facing service in AWS, there’s a bit of faux DNS magic that is tempting to use but ultimately dangerous: the weighted round robin.
Here’s how it works: each of the resource sets that a DNS record points to receives traffic based on a weight factor you specify. That means that if you’re rolling out a new service, changing those weight factors based on cluster can seem like a magically convenient way of doing things—you can send just 10% of your traffic to a new spritely test cluster, keeping the remaining 90% going to your curmudgeonly battle-hardened veteran cluster, for example.
It can also appear great when the spritely test cluster decides it hates you and your 500 rate shoots up by 10x. When that happens, you can simply set the weight to 0% on the test cluster, and after a brief TTL period, traffic seems to magically stop flowing to it.
But, even if you have an appropriate TTL on your DNS record, there are actually ISPs that will completely ignore your TTLs (and even in some cases individual customers who have their own overly aggressive caching). In the worst case scenario, 100% of your customers might happen to use an ISP that ignores your TTL, resulting in 100% of your customers continuing to hit the test cluster with some percentage of their requests for some amount of time that is completely outside your control.
This pitfall may actually be fine for some folks – it all depends on the context. However, given that there are easier and more reliable alternatives available, I’d suggest just generally avoiding weighted round robin as a progressive rollout mechanism.
There are certainly less finicky ways of progressively rolling out a test cluster that will give you a lot more control. One of the simplest options in AWS is to use two Auto Scaling groups, assuming you’re dealing with a service that is appropriate for Auto Scaling. All you have to do is manage the scaling of the two groups independently. For example, you can arrange the first cluster to be fixed at 10 nodes, while the second cluster scales between 50 and 100 nodes. Then, once you’re more comfortable with the new cluster, you can tweak the scaling.
Whatever option you choose, ideally it will allow you to control the traffic flow as reliably as possible, otherwise you may find yourself in a nightmare situation when you discover that you’re dealing with an ISP whose company slogan is “we might connect you to something, but only if we feel like it.”
2. Failing to plan is planning to fail, but trying to plan everything is failing before you’ve even started.
For a migration of this size, an iterative, gung-ho approach actually gives you a higher likelihood of project completion than trying to have a perfect plan. This is because it allows you to experiment early and often, and it enables you to communicate incremental progress to the rest of the company much more effectively.
For anyone who is passionate about their work, it’s easy to get sucked up into trying to plan every detail about how things are going to work. Spoiler alert: you’ll probably miss most of the things that actually matter. See, once upon a time at another company, I attempted a project that was very similar to the Amplitude VPC migration. We spent a year planning out the migration, trying to anticipate every step we’d likely encounter along the way, and never really got around to making concrete progress.
At Amplitude, we didn’t hesitate and I believe that was absolutely crucial to the eventual completion of the VPC migration. By the end of the first week of the project, I had a production service in the new VPC, and it grew organically from there. We began by building out the core VPC networking architecture, then we figured out how to make our configuration management systems behave differently depending on whether a server was in EC2-classic or VPC, and we made some first-pass decisions on how to organize our terraform code. After that, we were pretty much off to the races.
Nobody in the world can effectively manage a software project with a deadline that is one year in the future. However, any somewhat decent engineer can manage a project with a deadline that is two weeks away. So, if you have a long-term project, break it up into a series of projects that each add incremental value.
This micro-project approach tends to work far, far better from a project management perspective, as well as from a business perspective. Every new feature you add is in some sense an experiment and a dialogue with your customers. You add the feature, you receive feedback from your customers, and you adjust as needed based on your learnings. So, if your process is full of experiments that take an entire year just to setup, then by the time you start learning anything from your customers, there’s probably a competitor who has already learned all of those same lessons and is miles ahead of you.
In the context of the VPC migration at Amplitude, this process of breaking down the year long project into smaller ones essentially amounted to having an idea of the value add for each individual service as it was moved to VPC. For example, when we moved our frontend load balancer to VPC, we knew that this would give us an easy way to leverage HTTP/2 because amazon’s VPC-only ALB service supports it out of the box. This is independently valuable – even if we had aborted the rest of the VPC project at that point, we still would have maintained the additional value. As another example, when we moved our main query engine to VPC, we also switched the instance type to make use of the NVMe instance stores that come with i3 instances, which gave us some nice performance gains. If we had just decided to quit and work on something else at that point, those performance gains would still have been worthwhile and valuable.
3. Beware of the shiny new AWS instance type—do your own testing, and don’t commit too early.
While on the subject of the i3s, I do have to issue a strong warning about experimenting with instance types. Do your research. Investigate errors others have encountered, read the fine print, and test under workloads that resemble your production workload as closely as possible.
In our case, we first tried out the i3 instances on one of our Kafka clusters, which promised better performance for less money than our previous setup due to the fact that we had been using large EBS attachments instead of free instance stores. I wrote a custom Kafka migration script to handle moving the huge amounts of data in our system without being too disruptive to anything upstream or downstream, complete with a goofy slackbot that would ping everyone with a deliberately obnoxious emoji whenever there was progress.
Victory was at hand, until Kafka did what it does best and reminded us all that we’re merely foolish mortals and Trix are for kids. In dramatic fashion at 3AM one morning, the majority of the hosts in that Kafka cluster began crashing and spewing scary looking messages into syslog that looked like:
blk_update_request: I/O error, dev nvme0n1, sector 123290040
Everytime a Kafka process crashed, we would go and restart it, only to see it crash again a few minutes later, and always with those menacing nvme errors piling up.
Google seemingly came to the rescue, as we discovered a bug thread related to buffer I/O errors on i3 nvme devices. After scrolling through the thread long enough for my doubts to simmer and thicken into a demi-glace of suck, I eventually found a concrete recommendation that seemed too crazy to be made up, which also required a reboot of all the impacted hosts.
The recommendation turned out to just buy a bit of time at best. The next day, all of the hosts crashed again for the same reason. We brought back the custom Kafka migration script, naturally still with the deliberately obnoxious emoji notifications, and reluctantly moved all the data back to the original cluster so that we could regroup in peace. After a series of AWS support messages, we eventually realized i3 instances are only officially recommended for certain operating systems – and ours was not on the list.
The exact quote from the i3 launch announcement however was “In order to benefit from the performance made possible by the NVMe storage, you must run one of the following operating systems…”, which doesn’t sound quite like “if you don’t use one of the following operating systems, i3 will go Jack the Ripper on your poor unsuspecting town of peaceful machines”. That second thing would have been nice to know. We also discovered that the nvme errors typically only happened under heavy load, and that we could actually trigger them on a test i3 box by simply using a free benchmarking tool called Bonnie++
In true Amplitude fashion, what we actually ended up doing here was recreating all of the progress in the VPC migration up to that point and re-migrating everything onto ubuntu hosts so we could use the i3s safely, which only took a week.
As a relevant aside, this feels like a good spot to point out one of the elements of Amplitude’s culture I’m most proud to be a part of. This is not a company that wavers in the face of setback. When the going gets tough and the proverbial fan is being pelted, that’s actually when I feel like we’re at our best. Redoing the entire VPC migration at 10x the pace to unblock the performance benefits of i3s was something I was directly a part of, but it’s certainly not the only time in Amplitude’s history when we’ve pivoted to come back better than ever. I see it as a key part of what makes Amplitude Amplitude.
As a result of all this, my personal policy is to never use any new AWS instance type unless a few conditions are met:
- We have used Bonnie++ on a sample instance and have not been able to successfully break it
- The rollback procedure that will be employed at the first sign of trouble has been tested and is straightforward
- We have properly researched the problems with the new instance type that others are running into.
- We have carefully read all of the fineprint in the release announcement for the instance type and have made pessimistic assumptions about anything that is unclear.
4. In the absence of trust, a yearlong project isn’t going to work.
To pull off an engineering project of this scale, you need to wholeheartedly trust your team.
I know from personal experience that I probably would have failed to complete this VPC migration project at many other companies. I’ve learned a lot since the first time I attempted something of this sort, but I actually don’t think that can account for the dramatically different results on its own; what it really boils down to is that Amplitude has invested in an incredible engineering culture over the years, and it shows.
Create a culture that empowers engineers to make the decisions they need to make to drive things forward instead of setting up roadblocks, and allows them to learn from their mistakes. While maintaining a culture like this requires an enormous amount of trust at every level of the company, I think every Amplitude engineer appreciates it tremendously, and it is a key part of what enables our success.