Don’t Trust Client Data

When your analytics code runs on tens of millions of phones, strange things can happen.

Perspectives
July 22, 2014
Image of Spenser Skates
Spenser Skates
CEO and Co-founder
Don’t Trust Client Data

Analytics and the inaccuracy of phone-reported timestamps

When your analytics code runs on tens of millions of phones, strange things can happen.

One of the biggest issues is with timestamps. Phones are often offline, so an analytics SDK needs to cache data locally before uploading. Once a phone receives internet connectivity it can then upload event data to the server. This upload can happen anytime from immediately after an event has been logged to days later. As you can imagine, a number of potential problems can arise throughout the event logging and uploading process.

What happens if the phone’s clock is incorrect?

For example, here’s the number of events per day of a mobile app that first included analytics instrumentation on April 25th: There are a non-zero number of events being logged in the days leading up to April 24th. In other words, according to this graph, data has been logged by the phone before instrumentation was put in place. Going back further in time for the same app, here’s a graph logging events from January 1970 up to 1 month before instrumentation: From the examples above, it’s clear that some people’s clocks are significantly off, in some cases by years. It’s not possible that clients were logging data before they instrumented Amplitude! So how do we get around this problem? We could timestamp all events to the time when our servers receive the event (the server upload time), but this would be inaccurate as well — around 5% of data is uploaded at least 24 hours after the event actually happened.

Screen Shot 2014-07-21 at 5.53.28 PM
Screen Shot 2014-07-21 at 5.53.37 PM

What if we tracked the phone’s timestamp when it uploaded the data?

For each event, let’s compare the difference between the phone event timestamp and the server upload time versus the difference between the phone upload timestamp and the server upload time: Difficult to read! Let’s put both axes on a log scale: There are virtually no data points above the y=x line. This makes sense, since the time the event is logged on the phone shouldn’t be later than the time the event is uploaded to our servers. This suggests a very straightforward solution to our timestamp problem: for each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time, and then adjust the phone’s event timestamp accordingly. This should account for how much the phone’s clock was off when the event was first saved. So there you have it — our simple solution for keeping track of when your events happened. Hope this helps you get more accurate timestamp data from all those phones!

timestamp fig3
timestamp fig4

A few caveats (for those interested): Of course, this doesn’t account for latency between when the upload was sent and when it reached our server. However, in practice this latency period ends up being much smaller than the amount that a phone’s clock is off-target. Additionally, this solution assumes the phone’s clock is off by a constant amount. It’s possible for the phone’s clock to change between when an event was logged and when it was uploaded. If this happens, the correction could actually make the timestamp less accurate. Unfortunately, there’s no easy way to tell if a phone’s clock has changed. We’ve found that setting a cutoff window where an event is discarded if its adjusted time is greater than 60 days is a reasonable compromise. We’ve also found that using the current server time for all timestamps from the “future” is a reasonable guess.

About the Author
Spenser is the CEO and Co-founder of Amplitude. He experienced the need for a better product analytics solution firsthand while developing Sonalight, a text-to-voice app. Out of that need, Spenser created Amplitude so that everyone can learn from user behavior to build better products.