The world of analytics is full of red herrings and false paths.

red-herring

When there’s so much data to work with, it’s easy to get careless and assume that the numbers right under your nose are always telling you the truth:

  1. A new piece of code breaks your homepage, but after looking at your analytics you see that users are spending more time on it than ever—and therefore must be more engaged.
  2. You collect the birthday of every user that signs up for your service. What you find shocks you—nearly 5% of all your users were born on January 1st.
  3. Your web analytics team comes to you with a surprising revelation. Your e-commerce business—usually active 24 hours a day, 365 days a week—shows no sales, no visitors, no nothing for an entire hour between 2am and 3am on March 12th, 2017.

Each one of these brief anecdotes is an illustration of what’s known as Twyman’s Law, which simply states that scientific results that appear extreme or out of the ordinary are usually not—they’re usually wrong.

We all know the feeling of seeing data that’s too good to be true (or too bad to be true). For the cases above, there are perfectly innocuous explanations for all of them:

  1. Visitors are spending more time on your homepage, but only because it’s broken and it’s taking longer for them to do what they want to do.
  2. The fastest way to fill out your mandatory birthday collection form is to simply pick January 1st from the dropdown menu.
  3. In spring, many countries turn clocks one hour forward in a tradition known as Daylight Savings Time—hence the lack of any sales (or any activity, for that matter) during an hour that simply doesn’t exist.

In startup analytics, where the feedback cycles are short and the pressure to launch great, it’s especially easy to fall prey to Twyman’s Law and make sloppy statistical mistakes.

Twyman’s Law and its audience research origins

Tony Twyman is regarded as one of the pioneers of audience research. In a career that spanned from the 1950’s to the early years of the 21st century, Twyman contributed to the technical and methodological development of the field for both TV and radio measurement in the UK.

One of his most famous contributions to the field is the law named after him, which states:

Any piece of data or evidence that looks interesting or unusual is probably wrong!

The practical implication of this for anyone in product management or analytics is that every time a test bears results that are unexpected and cannot be explained by an obvious factor, there’s a high probability that they are wrong.

Another academic, Prof. Richard De Vaux of Williams College, has further defined two corollaries to Twyman’s Law, that apply to anyone working on developing software products:

  • “If it’s perfect, it’s wrong.”
  • “If it isn’t wrong, you probably knew it already.”

Beyond the theory and the rules of statistics, we should have a look at what Twyman’s Law looks like in practice. Two examples from the team working on Microsoft’s search engine Bing give us ample evidence.

Twyman’s Law as user experience trap

There are many ways in which the Twyman’s Law can manifest itself in product analytics. The fast-paced and demanding environment, in which product managers operate, makes them especially susceptible to the law.

Related Reading: 5 Cognitive Biases Ruining Your Growth

The team at search engine Bing are used to running thousands of experiments in which even a small change in performance can have an impact on revenue measured in millions of dollars. Obtaining reliable results to those test is, therefore, extremely important to their work. In a paper authored by a member of the team, they outline a number of unexpected outcomes to such tests they’ve produced, which can be attributed to Twyman’s Law.

Lower quality of search results led to better performance in key metrics

A bug in one of those experiments run on the search engine led users to be shown very poor results in the so-called “10 blue links” — the main results shown to users in a search. This led to an increase in queries per user by 10% and average revenue per user by 30%.

Investigating deeper, the team found out that users had to make more searches until they found what they were looking for and as a result clicked on more paid results, leading to a higher overall revenue user.

If Microsoft were prioritizing only metrics like queries per user and average revenue, they might have reached the conclusion that deliberately lowering the quality of search results is the way to go. Obviously, such a tactic would work only in the short term. As users find themselves constantly annoyed by the results their searches yield, they would be more likely to convert to an alternative search engine.

In this case, Bing’s team understood that the relevant aim in this case that aligns with their long-term goals is to lower the average number of queries per user.

Small change in code leading to a sharp rise in search result clicks

Another example comes from an experiment in which an extra piece of JavaScript code was added to search result pages so that the destination was recorded before the browser was allowed to proceed to it.

This resulted in a spike in the number of users who successfully clicked on search result pages.

In this case, the differences came down to technological aspects of the JavaScript code:

“Chrome, Firefox, and Safari are aggressive about terminating requests on navigation away from the current page and a non-negligible percentage of click beacons never make it to the server. This is especially true for the Safari browser, where losses are sometimes over 50%. Adding even a small delay gives the beacon more time, and hence more click request beacons reach the server. We have seen multiple experiments where added delays made an experiment look better artificially.”

Clearly, the success in this case was not due to better performance, but because of an instrumental difference. Because the team was aware of the technical aspects of how browsers work and they knew there was something wrong the minute they saw a sharp increase attributable to non-IE browsers, they were able to quickly catch the issue.

Many people who engage in analytics and testing don’t have the same level of understanding and could easily fall victim to Twyman’s Law. That doesn’t mean they have to get expert understanding in computer networking to be able to avoid it — a basic understanding of one of the main concepts of statistics would suffice.

Statistical Significance 101: How to run better experiments and avoid Twyman’s Law

The way to evade the curse of Twyman’s Law is by grounding your experiments in the rules of statistics and making sure each result is statistically significant before you take it as it is. Here are several ways in which you can achieve this.

Pick the right metrics

Going back to the first example from Bing’s experience, we saw that having a solid understanding of what moves their business forward was essential.

Choosing metrics that represent progress towards your business goals, rather than specific “feature” metrics should be your first concern. Feature metrics are especially easy to improve, but they rarely lead to significant improvement in overall business results.

As the authors of “[Seven Rules of Thumb for Web Site Experimenters]7” point out:

“[…] When building a feature, it is easy to significantly increase clicks to that feature (a feature metric) by highlighting it, or making it larger, but improving the overall page clickthrough-rate, or the overall experience is what really matters. Many times all the feature is doing is shifting clicks around and cannibalizing other areas of the page.”

Moreover, when you’re measuring the effect of an experiment or change that affects only a segment of your audience, the metrics that you use should be diluted by the size of that segment:

“That 10% improvement to a 1% segment has an overall impact of approximately 0.1% (approximate because if the segment metrics are different than the average, the impact will be different).”

Figuring out the right set of metrics and developing a sound framework to track them goes a long way in preventing costly mistakes down the road.

Limit the impact of false positives

With iterative improvement, teams who move quickly to build, test, and ship run a significant risk of getting a false positive — a favorable change in an observed metric, that’s the result of chance rather than real improvement.

As the number of iterations tested and treatments in each experiment rise, so does the probability of getting a false positive. For example, a test with two iterations stands only a 2.5% chance of getting statistical significance, while a test with six iterations of 5 treatments each has a >50% chance of getting positive lift backed by statistics.

To counter the effect of this, you can use two mechanisms that will make your testing more robust:

  • Use lower p-value in order to require a higher level of statistical significance before you accept the result of a test. If you’re currently using a p-value of 0.05, that means there’s a 5% chance of error. Adjusting your p-value to 0.01 will mean you’ll be correct in 99% of cases.
  • Replicate test results: While testing multiple variants of a single feature — or set of features — is always a good idea, running a final experiment is when the funnel of options narrows down is optimal. Doing this provides an additional level of scrutiny, which should save you from falling victim to Twyman’s Law.

Avoid statistical interactions

When you are testing multiple elements at the same time, you run the risk of causing a statistical interaction. It happens when the combined result of two changes does not equal the sum of the change each would cause on its own.

Interactions are a problem because the main assumption when running tests is that each is done in isolation as we can treat its result solely as the product of the changes made for each treatment. When you have an interaction, you tend to get skewed results for all experiments involved.

In organizations that run multiple tests daily, interactions are also dangerous because they can trigger unexpected bugs that cause bad user experience.

Preventing statistical interactions altogether is hard and even impossible for large organizations that run hundreds of tests simultaneously. The best way to avoid them from happening is by adding constraints when running tests: for example, making sure that one subject — i.e. site visitor — does not participate in two tests at the same time.

The persistent and recklessly critical quest for truth

Product people can be naturally inclined to take positive test results at face value and move forward without putting too much thought into validating their findings. A startup isn’t a laboratory—time is always the #1 limiting factor on your survival.

Being aware of Twyman’s Law and its implications, however, is bound to improve the way you analyze, experiment with, and improve your product.

Karl Popper wrote that science, at its heart, is about the “persistent and recklessly critical quest for truth.”

Similarly, the key to mastering experimental analytics is not to identify all the possible pitfalls and traps out there, but to get a solid understanding of the foundations of running and analyzing experiments. Once you have that, avoiding Twyman’s Law is simply about staying diligent—and always checking twice on any numbers that look especially out of the ordinary.