Amplitude Recommend: Behind The Algorithm

William Pentney

Staff Software Engineer

people reacted
5 -minute Read,

Posted on May 10, 2021

Learn how Amplitude Recommend leverages machine learning and k-means clustering to power digital personalization at scale.

“Any sufficiently advanced technology is indistinguishable from magic.”
– Arthur C. Clarke, Profiles of the Future

The Challenges of Personalization

Marketers everywhere understand the value of personalization. The ability to recognize what your users want before they even ask for it can truly feel like a power indistinguishable from magic. Here at Amplitude, our new Recommend product offers customers a convenient, off-the-shelf service to provide personalized recommendations based on common business goals.

Creating a quality recommendation system can be out of reach for many businesses. Amplitude Recommend provides an off-the-shelf system that provides customers with recommendations for individual users from a predefined catalog of items, based on prior interactions with, and knowledge about, those users. The Recommend team has put extensive effort into building and iteratively improving our system for general quality, without requiring excessive input or advanced domain knowledge from customers. What lies underneath the hood of Recommend? Here we take a tour of the technology that powers Amplitude’s approach to personalization.

Modeling Users in Recommend

For our recommendation system, we specifically frame the problem in terms of a conversion event: Given a user to present a recommended content item to, what item do we think will be most likely to inspire a user to convert? To model users’ preferences, we must first select an appropriate representation of a user, as a set of features representing some set of meaningful information about that user. Our customers provide us with two valuable sources of data: a stream of events from user activity that we can use to collect these features, and a set of properties associated with users, such as their tech platform and locale.

For each customer, we collect the most common events within a given time period and collect statistics about those events for each user, such as the count of how many times the user performed those events, and the first and last time those events occurred. Between this and the set of properties, we collect a large set of features for user representation.

Our feature engineering makes use of Amplitude’s Nova AutoML system to process event data, maintained by our Nova Query datastore, into relevant features for modeling. For Recommend, we have added functionality to AutoML to provide user-level data for funnel analysis, which permits us to better identify the connection between a specific content item and the resulting conversion event. This helps us ensure, for example, that when a customer shops for products X, Y, and Z but then only buys Y, we credit the conversion to interaction with Y in particular.

Feature Matrix

Machine Learning in Amplitude Recommend

Recommend makes use of unsupervised learning techniques which learn relationships between a set of users. The most common unsupervised learning technique is clustering, where examples are separated into distinct groups based on similarity of features. In our context, we can use clustering techniques to group users together to recognize notable segments of a customer’s user base.

User Clustering

Amplitude Recommend makes use of unsupervised learning, employing a technique known as k-means clustering. Our customers provide us with a stream of event data from their users, along with properties specific to that user, such as locale and tech platform. We process this collected data into a set of user features. K-means clustering takes these user features and clusters together users with similar values for these features, e.g. with similar frequency of certain events and similar properties.

We select the number of clusters to use—the “k” in k-means refers to this number of clusters—by creating clusterings over a range of different numbers for k, and then selecting the clustering which provides us with the “tightest” clusters. Tightness here is defined by a ratio of similarity of users within the clusters to similarity of users between clusters—ideally, the similarity within clusters is high, and the similarity between clusters is low. The best clustering by this measure is chosen as our final clustering.

Using these clusters, we may then recognize content items that are particularly popular among users within that cluster, and thus provide a unique ranking of content items for users in that cluster. Importantly, we can perform a similar analysis upon new users, or users who have not interacted with many content items, recognizing what these users might like based on what “similar” users have previously shown an affinity for.

One common issue that clustering techniques face is noisy features. A feature set can be a random mishmash of statistics—some containing important signals about the user and their preferences, and some of which are unrelated to the problem and distracting. Adding noisy, irrelevant features to our model not only wastes computing resources, but it can cause users to be clustered together on similarities that are meaningless and merely coincidental.

To avoid problems with noisy features in clustering, we specifically identify a subset of potentially meaningful features by recognizing those that have a minimum threshold of correlation with the goal event and filter out the remaining features. This threshold is set low to collect less significant associations, or with smaller cohorts, but large enough to remove likely irrelevant features.

Our feature filtering and clustering are done using custom Apache Spark code for feature engineering, run on Amazon EMR for increased efficiency. In addition, this Spark code performs some additional post-processing to make the feature data more manageable, including feature scaling and outlier removal to reduce the effects of unusual feature data on clustering quality.

Clusters to rankings

Having selected an optimal clustering of users, we then provide for each cluster a ranking of items based on their relative popularity within that cluster, based on existing interactions with those items. We use a modified conditional conversion rate for each cluster (probability of conversion given interaction with the item) as the primary factor to rank the items. We modify the conversion rate slightly to indicate a degree of confidence in the result—some items may have had very few interactions in the cluster (e.g. because they were added relatively recently), and so we indicate lesser confidence in these items. This is done using the addition of a prior model that assumes some likelihood of “average” behavior for content items that we have less data for.

Our final assignment of ranking is done using Python code and Google’s TensorFlow technology. We run this code on Amazon Sagemaker, a cloud platform dedicated specifically to management of machine learning tasks.

Personalization, Piping Hot

We run our pipeline to generate new rankings for a customer-selected cohort of users every two hours. To enable all the different components of our pipeline to interact seamlessly, we use Apache Airflow for pipeline management, allowing us to pull together computing tasks on different platforms in an optimized fashion. Each time we run, we regenerate the features and model, collect statistics about the rankings and distribution of items, and refresh the rankings in a key-value store to be efficiently served up through an API easily accessible to our customers. Our API serves up JSON data providing top N rankings of items per user, for a selected value N.

As of publication of this post, we are serving up our database of more than a quarter of a billion personalized recommendations to dozens of customers daily through Recommend. Some of our current planned enhancements for Recommend include:

  • The use of supervised learning techniques, such as neural networks, to improve personalization even further;
  • Content-based item features, which will give us improved ability to recognize relationships between content items, as well as find appropriate audiences for brand new content items, with little or no associated event data; and
  • Smart item retrieval to allow Recommend to scale to very large sets of content (>1000 items) while still maximizing conversion.

We continue every day to iterate on ways of improving recommendation quality, and bringing improved personalization, to our customers. Learn more about Amplitude Recommend today.

William Pentney

William Pentney is a staff software engineer at Amplitude. He focuses on providing insightful, efficient, and inclusive machine learning solutions for business analytics.

More from William

Inside Amplitude