From the Trenches: Reinforcement Learning from Lab to Real World

André Cohen |

From the Trenches: Reinforcement Learning from Lab to Real World

Reinforcement learning (RL) is a popular machine learning framework that speaks to people that love video games. In fact, I discovered RL through video games, specifically Atari’s Pitfall. But, that’s a story for another day. RL is popular because it offers a framework for modeling problems, agents, and outcomes in a way that is easily understood by humans. It fits into the human mold of learning where we expect to reach a goal through a series of actions and learn through environmental feedback. Pretty much how babies learn to walk, and adults learn to play a new video game. The downside of RL is that taking an agent from the lab to the real world is never a straightforward affair. Unlike video games where every action has a reaction and a clear goal, the real world often lacks the same determinism (a single action can result in a multitude of reactions), and goals can be elusive.

Because this is the first RL article in the blog, it’s almost mandatory to describe what RL is and how it works. RL uses what is called a Markov Decision Process (MDP) that dictates that the environment is observed by an agent, and the observation is called a state. At every tick of a clock this agent observes the environment and takes a single action. In video games this could mean reading every pixel on the screen (the state would be a large matrix), and the action would be pressing any button on the joypad . There are rewards associated with actions which guide the agent towards the direction of the goal. MDP is thus interested in actions that move the agent from one state to the next until it reaches the goal. Any one of the terms I just described could be an unknown that needs to be learned. For this article we are mostly concerned about the rewards associated with the actions the agent takes.

Long-Term Feedback

Training an agent when the feedback is delayed is difficult. For example, any agent focused on optimizing 30-day retention (or even 7-day retention) will have a hard time with the delayed feedback. This is because learning will be very slow, many other factors may affect the results, and in real-world scenarios, it’s possible that the agent will need to take action before the reward from the previous action has been observed.

For these reasons, long-term feedback must be formulated as the goal of the agent. The long-term reward needs to be defined by local rewards that happen in the short-term. In the example of retention, which is a finite problem that is capped at 30 days, the number of days the player has been seen could be a valid reward function. What’s important is that the local reward functions are capped and do not grow infinitely.

One way to adjust the retention example is to observe if the player returns the next day. This is observable on a daily basis and functions as a sort of breadcrumb leading towards the big 30-day retention objective.


RL is a fun field of research because of its simplifications that make problem-solving enjoyable and satisfying. One standard simplification is that all observations are without noise or bias. That is to say, imagine an agent that is responsible for showing a single offer to the player every time the game loads. The action is the offer to show, and there is a reward associated with the outcome (1=player makes a purchase, 0=player doesn’t buy, -1=player quits the game). By default, RL would assume that the world would behave the same way, without bias and noise. Meaning, every time the action takes place, the reward would be exactly the same. This, of course, is not true. All actions (offers) shown will be purchased at some point, and the observed rewards will all be very similar. Similar because 99% of players will neither purchase the game nor quit the game. Noise will be as large as the reward we are trying to measure.

Solving this problem is a field of research itself because adding noise to an RL model can be done in many ways (Checkout partially observable RL and adversarial RL for some of the more unique ways). It has been shown that for the example above the No Free Lunch theorem is true. That is, without making some assumptions about the noise present in the rewards being observed, all agents can be misled due to the noise. The most common way of solving this problem is by injecting noise as part of the rewards we are trying to learn about. This is where some assumption is needed because you need to specify what the distribution of the noise looks like. Is the noise constant? Meaning, is the conversion rate for the IAP is 1.2% with a constant error of .3%? Or, is it more like a normal distribution where the error rate peaks at .3% but sometimes the error is less/more? In the end, the RL problem becomes much harder because not only do you have to know the reward for each action but also the noise.


Anytime an algorithm goes from the lab to the real world, its developers shouldn’t expect it to work quite as well, and they should expect it will require tweaks. In RL this is definitely the case. Unlike in video games, in the real world actions don’t always have a predictable consequence, and observing the consequence may be delayed and with error. The good news is that machine learning has been chipping away at this problem for decades, and many battle-tested solutions exist.