How to Rescue a Failing A/B Test

André Cohen |
2019-04-04

How to Rescue a Failing A/B Test

Every year at GDC there are numerous talks about successful A/B tests. However, the reality is, the likelihood that your A/B test is successful is less than 20%. A successful test requires a good hypothesis, a large sample size, sufficient time to gather data, and assurance that no external factors affect the measurements. Almost always, a failed A/B test can be attributed to missing one of those requirements. Even to conclude that an A/B test will fail can be time-consuming because you’re looking for a needle in a haystack.

Not all hope is lost when a test is struggling to grow its statistical significance, though. Most A/B tests are salvageable, provided the hypothesis is bold and able to generate an effect.

Narrow the Observations

Often the lack of statistical significance is due to a large and varied test/control population. In mobile games localized to different markets with different demographics, player groups can cancel each other out. Other times one player group can dramatically overpower smaller groups. For example, the US/Canada region is the largest market for Western games makes it hard to conduct A/B tests that also includes Latin America and South Asia because the US/Canada preferences will dominate the results.

When a test begins to show no difference between the control and experimental group, and there is a sufficient number of data points, consider narrowing the population. Conduct the test in a single region only. In the mobile games industry, consider regions such as US/Canada, Latin America, or Western Europe. In an average mobile game, this might reduce the population in the experiment by 50%, but that’s a small price to pay for a statistically significant A/B test.

Go bigger

The number one problem I see in A/B tests is the fear of unintended consequences. So, to reduce the fear, the test is scaled back to a minimum number of players over a short time window, often in markets that are not uniform. Picking South Asia as the market to run a new paywall redesign is unlikely to work. The reason why is that the poorer the country, the larger the discrepancy between rich and poor; this means that while the average revenue per user (ARPU) will be lower than in other markets, the whales present greater value than in other markets.

Remember, while running a small test you are forgoing the possible benefits of the new paywall and incurring a small cost of running the experiment. Extending the length of the experiment increases the cost. If you can’t afford to use a larger more representative population for the test, your experiment is probably not worth the effort. So, when a test is showing little to no results, check if the population can be increased to include some additional markets.

Switch Testing Methodology

A/B tests are very rigorous and formal ways of testing a hypothesis. There are however lightweight versions that are still good ways of measuring the results of an experiment. It is easy to switch from A/B to measuring the difference-in-differences (DID) of the experiment.

Explaining DID is outside of the scope of this article, but the benefits are twofold. First, for DID to work you need to verify that prior to the experiment the control and experimental groups showed similar trends. This check is done simply by looking at trendlines and making sure they roughly match up before the start of the experiment. If this check fails, DID won’t work. I’ve encountered this situation multiple times,and I took it as good news because it also explains why the A/B test was not working either. Second, DID calculates the improvement coming from the experiment (if any exists). While it comes with no statistical significance, DID is a widely used method for social science experiments because it overcomes the barriers to running a successful A/B test.

Conclusion

When an A/B test is not developing the way you expect, it’s important to keep track of time. You could spend weeks analyzing data, trying to find that needle in the haystack called statistical significance. Instead, consider narrowing the hypothesis down to a specific population if possible, or, if you started with a very small user group, expand your population. Finally, have a plan B ready before starting the test. Yes, the goal is to have a successful A/B test with rock solid results, but an alternative method could salvage the experiment and be valuable to the game and company.