Why Most A/B Tests in F2P are Doomed

Niklas Herriger |

Why Most A/B Tests in F2P are Doomed

In virtually every online industry, product managers (PMs) rely heavily on multivariate testing for purposes of product development and marketing campaigns. A/B testing, in particular, has had a long run as the go-to solution for improving live products that are concurrently used by millions of users, including free-to-play (F2P) games. The reason is that the law of large numbers (LLN) guarantees stable long-term results for the averages of some random events. The more trials performed, the more stable and reliable the test results become.

Given the above, a F2P game with millions of monthly average users (MAU) may appear to be the perfect use case for A/B testing. However, A/B testing can be problematic - detrimental even - when used to optimize revenue in F2P games. There are several reasons why.

The first and probably most important difference about F2P games is that, as opposed to pretty much any other product, only a tiny fraction of the user base pays. In other industries, customers pay a specific dollar amount on a pay-per-use (food delivery, ride services) or recurring monthly basis (Internet service, gym membership). In F2P gaming, it is entirely common for 99% of the user base to not make a single in-app purchase (IAP) or microtransaction and therefore spend zero dollars in their lifetime The remaining 1% generate all the revenue. So, in a game with one million MAU, you only have about 10,000 paying players on a monthly basis. Within that group of 10,000 spenders, there is probably a dramatic difference in monthly spending. 90% of the paying players likely spend anywhere between $1 and $10, while the remaining 10% (0.1% of player base, or 1,000 players) spend anywhere between $10 and $10,000 per month. The curve is exponential, and spenders with monthly budgets of $500-plus are very rare.

When it comes to A/B testing, this means the following: For anything that is not directly monetization-related, such as content updates, new game modes, or additional characters, the product manager can rely on a vast user base to take full advantage of LLN and can expect reliable results in reasonable amounts of time. However, the majority of tests related to sustaining or increasing revenue rely on the behavior of paying players. This factor changes the sample size profoundly. Now the pool of suitable players shrinks down to about 10,000 spenders per month, even less if only recurring spenders or whales are the subject of the test. This small segment of players is divided further to create at least two groups (A and B), a control group and (at least) one test group.

To generate statistically significant results from an A/B test with only two groups requires an extended amount of testing time in a non-changing environment (3 or more player groups require even more time/players). Modern F2P games rely heavily on live operations, events, bi-weekly tournaments, and one-off special content sales. Anywhere between 30-50% of revenue from IAPs and microtransactions come from limited time offers. Monthly A/B test cycles are therefore not suitable for a “Christmas Special,” a “Tournament Deal,” or any other offer/promotion that is available to players for 7 days or less. That means that the testable fraction of players who spend is further reduced, putting the whole concept of “micro-segmentation” in any F2P monetization context into question. The fact that, by definition, A/B tests expose at least one group to a sub-optimal experience, generates even more time pressure to conclude the test as fast as possible.

Additionally, F2P as an industry is a constantly unstable environment. Rarely ever will the gameplay experience for a specific game remain unchanged for an entire month. Moreover, the competitive landscape and other games in the same genre are subject to constant change. Player spending behavior in your game is impacted if your main competitor releases a new multiplayer mode, adds newly licensed cars to his lineup, or runs a major sales event. It becomes even more extreme if an entirely new game or sequel enters the marketplace.

In conclusion, testing of monetization features in F2P games can only lead to reliable results if the test has a) a sufficient amount of players, b) a stable environment and c) enough time to reach statistical significance. Most monetization A/B tests fail at least one of these requirements, often multiple. Still, many critical design and business decisions are made based on A/B test setups that are doomed to be unreliable from the get-go.

There is however an alternative to A/B testing. That alternative is Multi-Armed Bandit testing (MAB). At Gondola, we're turning to MAB for our In-Game Offer and Video Ad Optimization Platform. In our next article, we'll tell you why.