# Gym Experiments: CartPole with DQN

To introduce ourselves to reinforcement learning with Deep-Q Networks (DQN), we’ll visit a standard OpenAI Gym problem, CartPole. We’ll cover deeper RL theory in a later post, but let’s get our hands dirty first, to build some intuition!

The complete series can be found on the bottom of this post and the latest version of the GitHub repo can be found **here**. Be sure to get **set up** before you begin.

# The CartPole Experiment

The **CartPole** gym environment is a simple introductory RL problem. The problem is described as:

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart’s velocity.

More simply put, there’s a wobbly pole on a cart and the goal is to keep it upright.

In RL, this problem can be described as a *fully-observable, deterministic, continuous state space,* with a *discrete action space* and *frequent rewards*. This is important to recognize before attacking the problem, because different configurations can make for very different kinds of RL problems to solve.

**Fully-observable:**All aspects of the problem state are always fully available to us.**Deterministic:**Taking a specific action in a specific state will always yield the same result.**Continuous state:**The state space is defined by real, floating-point values, rather than discrete points. Because we have an (infinite) continuous state space, we’ll need to use a neural network (DQN) to solve the problem, rather than use a simpler solution, such as to solve a lookup table.**Discrete actions:**The action space is defined by discrete choices. DQN is generally a better solution for discrete action spaces than it is for continuous actions spaces.**Frequent rewards:**Every step through the environment provides a significant amount of feedback. This means that we generally don’t have to play to the end of an episode in order to determine if we’re doing the right thing.

Deep-Q Networks can be described as *model-free, value-based, off-policy* methods. I’ll save the interpretation of that for the later theory post. For now, just keep in mind that this is a solution configuration that is useful for particular problems, such as the CartPole environment. More specifically, they are a function approximation (using neural networks) of Q-Learning methodology. We’ll save most of that for the theory post, but we’ll touch on that a bit later on. There are a number of different variations on DQN’s, but today I’ll focus on three: The “vanilla” (basic) DQN, adding experience replay, and fixed-Q targets.

For the remainder of this post, I’m going to touch on a lot of the algorithmic details very lightly. This is because I believe that people learn better by starting with *concrete examples* before moving on to *abstract intuition* (e.g., the **“Monad Tutorial Fallacy”**). My eyes certainly glaze over when the first thing I’m shown is a new equation – what do all these strange new symbols mean?!? This post is meant as more of a set of code to demonstrate the results; the later theory post will better explain why we’re doing what we’re doing.

Before we dive into DQN’s, it’s always good practice to start by visualizing the problem. And just to get this out of the way, here are the imports we’ll be using:

from agents.dqn import DQNAgent from algorithms.egreedy import EpsilonGreedyExploration from algorithms.experience import ExperienceReplay from algorithms.fixed_q_target import FixedQTarget from algorithms.schedule import ExponentialSchedule from collections import deque import gym from helpers import data from helpers.env_wrapper import EnvironmentWrapper from helpers.model import ModelWrapper from keras.layers import Dense from keras.models import Sequential from keras.optimizers import Adam import matplotlib.pyplot as plt import numpy as np import sys

# Data Exploration

Let’s start by creating the environment and running a random walk through it for 500 episodes, to see what kind of data pops out. A CartPole episode is defined as ending once the pole falls past a certain angle, or 200 “steps” (actions) through the environment have been taken.

Here’s what the code looks like:

def data_exploration(env, n_episodes): # Random exploration to establish a baseline exp_returns = data.random(env, n_episodes=n_episodes) return exp_returns env = EnvironmentWrapper(gym.make('CartPole-v1')) n_episodes = 500 baseline_returns = data_exploration(env, n_episodes) data.report([(baseline_returns, 'b', 'Baseline')], title='Random Walk')

Running this, you should see a plot similar to the following:

As you can see, a random walk will roughly average a final score of about 25 through an episode. This is helpful to know, as it establishes a useful baseline for us to compare our DQN results against. If we can’t beat a random walk, we’re either doing something wrong or applying the wrong solution to the problem!

# Exploring Exploration

Before we leap into the actual DQN algorithm, let’s talk for a moment about one of the constant worries for RL algorithms: Exploration vs. exploitation. Unlike with supervised learning, RL doesn’t typically have a “teacher” to tell an algorithm whether it’s doing a good job or not; the algorithm simply has to discover it’s environment as it goes along. The way it “learns” is typically by maximizing the reward it is able to achieve in the problem environment. However, to avoid converging to a sub-optimal solution, the algorithm requires some amount of *exploration* before it can fully *exploit* the environment. There are a number of ways in which exploration can occur, but I’ll save most of that discussion for a later post. For now, we’ll just use epsilon-greedy exploration [Sutton98].

The way that ε-greedy exploration works is a simple idea: You start at some high probability value (which we call ε) and decay it over time to a low probability. We’ll use that ε to randomly choose whether to take a random action in the environment, or to use the most “greedy” known action. When ε is a high probability, we have a high chance of taking random actions to explore our environment. When ε is a low probability, we’re more focused on exploiting the environment, to take advantage of what we know.

This is pretty straightforward to implement:

import numpy as np import random class EpsilonGreedyExploration: def __init__(self, epsilon_start, epsilon_min, epsilon_decay): self._epsilon = epsilon_start self._epsilon_min = epsilon_min self._epsilon_decay = epsilon_decay @property def epsilon(self): return self._epsilon def act(self, model, state): if np.random.rand() <= self._epsilon: return random.randrange(model.action_size) # predict() returns a matrix tensor (even for a single state prediction), # but the action values are always a vector, so grab the first (and only) row return np.argmax(model.predict(state)[0]) def step(self): self._epsilon = max(self._epsilon * self._epsilon_decay, self._epsilon_min)

Here, we're using a particular variant that uses exponential decay. You can do this in other ways (e.g., using a linear decay).

We'll see later on how this is used, but we basically ask the instance for an action to perform against our model (given some state) and at a later point, we decay ε.

# The Basic DQN Algorithm

Now, we can discuss the basic DQN algorithm. The basic idea is this (again, I’ll save the theory for a later post):

- We’re given a neural network that acts as a
*function approximation*of the*value*of a particular state input (this is why Q-Learning is referred to as a*value-based method)*. - As we step through the environment, we collect immediate rewards for actions taken.
- We also make an estimate of the
*discounted future reward*that we can attain by taking continuing to follow an optimal policy from this point on into the future. (This is one of the things that is unique to reinforcement learning!) - Based on that immediate reward and our estimate of the future reward (the combination of which we call the
*return*), we train the neural network that acts as our value function approximation. - Over time, as more states are visited and rewards are gathered, our network should converge (in theory!) to a true approximation of the value of taking a particular action in a given state.
- By greedily taking the highest-value (given by our network) action at every state, we can find an optimal
*policy*for interacting with the environment.

As pseudocode, the algorithm looks like this:

In [Mnih15], the Q-Learning algorithm was extended to deep neural networks and DQN was born! As code, this gives us the following:

class DQNAgent: def __init__(self, env, model, gamma, exploration): self._env = env self._model = model self._gamma = gamma self._exploration = exploration @property def exploration(self): return self._exploration def _get_predictions(self, samples): states, actions, rewards, next_states, dones = samples predictions = np.zeros((len(states), self._model.action_size)) action_returns = self._model.predict(states) next_action_returns = self._get_next_action_returns(next_states) for idx in range(len(states)): action, reward, done, action_return = actions[idx], rewards[idx], dones[idx], action_returns[idx] policy_action = np.argmax(next_action_returns[idx]) discounted_return = self._gamma * next_action_returns[idx][policy_action] * (not done) action_return[action] = reward + discounted_return predictions[idx] = action_return return predictions def _get_next_action_returns(self, next_states): # Get the next action returns from the on-policy model return self._model.predict(next_states) def _sample_experience(self, state, action, reward, next_state, done): return np.array([state]), np.array([action]), np.array([reward]), np.array([next_state]), np.array([done]) def train(self, render=False, debug_func=None): state = self._env.reset() total_reward = 0 done = False n_steps = 0 start_time = time.time() losses = [] while not done: if render: self._env.render() action = self._exploration.act(self._model, np.array([state])) next_state, reward, done, _ = self._env.step(action) samples = _sample_experience(self, state, action, reward, next_state, done) states = samples[0] predictions = self._get_predictions(samples) history = self._model.fit(states, predictions) losses.extend(history.history['loss']) state = next_state total_reward += reward n_steps += 1 self._exploration.step() # Allow the chance to examine the model for debugging if debug_func is not None: debug_func(self._model) elapsed_time = time.time() - start_time return total_reward, n_steps, elapsed_time, np.mean(losses)

You can see how we’re using ε-greedy exploration to explore, while also making calculated future estimates of the state/action pair values. The `_get_predictions()`

function handles the internals of the Q update step in the pseudocode.

To see what this looks like in practice, first let’s create a network and run an experiment against our environment:

def build_network(env, verbose=True): model = Sequential() model.add(Dense(24, input_dim=env.observation_space.shape[0], activation='relu')) model.add(Dense(24, activation='relu')) model.add(Dense(env.action_space.n, activation='linear')) model.compile(loss='mse', optimizer=Adam(lr=0.001)) if verbose: model.summary() return ModelWrapper(model) def train_dqn(agent, n_episodes=None, debug=False): # Experiment described by: https://github.com/openai/gym/wiki/CartPole-v0 # CartPole-v1 defines "solving" as getting average reward of 195.0 over 100 consecutive trials. # This environment corresponds to the version of the cart-pole problem described by # Barto, Sutton, and Anderson [Barto83]. exp_returns = [] training_complete = False e = 0 action_vals = [] # Arbitrary maximum at 2000 episodes, in case of divergent training while not training_complete and e = 195 print('Training complete after {} episodes'.format(e)) return exp_returns def basic_dqn(env, n_episodes): # Basic DQN with ε-greedy exploration model = build_network(env) exploration = EpsilonGreedyExploration(epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.99) agent = DQNAgent(env, model, gamma=0.99, exploration=exploration) # Perform the training return train_dqn(agent, n_episodes) basic_dqn_returns = basic_dqn(env, n_episodes) data.report([(basic_dqn_returns, 'b', 'Basic DQN'), (baseline_returns, 'r', 'Baseline')], title='Vanilla DQN')

Here, we’re creating a simple network and using it with classes I’ve discussed previously to train the model using our DQN. We also plot the results against our random walk baseline for comparison. You will see an output similar to this:

As you can see, the basic DQN only *just* starts to improve upon a random walk in 500 episodes, so let’s discuss some ways to improve the algorithm!

# Reducing Correlation

The first problem we can quickly identify is with sample correlation. As we step through each episode, we’re training a neural network against each step in the episode. However, these samples are *not* independent and identically distributed (i.i.d.)! This can be clearly understood in the context of CartPole: Every step through the environment is very closely related to the step that came before. And, as we know, neural networks that are trained against correlated data tend to badly overfit to the correlations.

To reduce these correlations, [Mnih15] introduced *experience replay*. The idea is that, rather than immediately training upon every step through an episode, the data from that step is instead stored in a memory buffer. Then, batches from the buffer are *randomly sampled* in order to break the correlations.

The code is pretty simple, so I’ll just show it below. We just store all of the different bits of data from each environment step. The last thing to note is that we store the samples in a *deque* object, so that older (more out-of-date) samples will be dropped as we gather new experiences.

from collections import deque import numpy as np import random class ExperienceReplay: def __init__(self, maxlen, sample_batch_size, min_size_to_sample): self._states = deque(maxlen=maxlen) self._actions = deque(maxlen=maxlen) self._rewards = deque(maxlen=maxlen) self._next_states = deque(maxlen=maxlen) self._dones = deque(maxlen=maxlen) self._sample_batch_size = sample_batch_size self._min_size_to_sample = min_size_to_sample def add(self, state, action, reward, next_state, done): self._states.append(state) self._actions.append(action) self._rewards.append(reward) self._next_states.append(next_state) self._dones.append(done) def bootstrap(self, env): print('Bootstrapping experience samples...') while not self.can_sample(): state = env.reset() done = False while not done: action = np.random.randint(low=0, high=env.action_space.n) next_state, reward, done, _ = env.step(action) self.add(state, action, reward, next_state, done) def can_sample(self): return len(self) >= self._min_size_to_sample def sample(self): mem_size = len(self) indices = random.sample(range(mem_size), min(mem_size, self._sample_batch_size)) states = np.array([self._states[idx] for idx in indices]) actions = np.array([self._actions[idx] for idx in indices]) rewards = np.array([self._rewards[idx] for idx in indices]) next_states = np.array([self._next_states[idx] for idx in indices]) dones = np.array([self._dones[idx] for idx in indices]) return states, actions, rewards, next_states, dones

We’re just going to make a couple of changes to our DQN code to support this. First, we’re going to pass in an instance of our experience class into the DQN constructor:

class DQNAgent: def __init__(self, env, model, gamma, exploration, experience=None): ... self._experience = experience ...

Then, we’re going to update our `_sample_experience()`

function:

def _sample_experience(self, state, action, reward, next_state, done): if self._experience is not None: self._experience.add(state, action, reward, next_state, done) return self._experience.sample() else: # This is a "vanilla" DQN return np.array([state]), np.array([action]), np.array([reward]), np.array([next_state]), np.array([done])

And that’s it! Our changes are easy to test, we just have to create an `ExperienceReplay`

object and pass it into our `DQNAgent`

. It can be helpful to bootstrap the experience buffer with initial samples. This can prevent the first experiences from being over-sampled, causing the network to overfit to these samples.

def dqn_with_experience(env, n_episodes): # DQN with e-greedy exploration and experience replay model = build_network(env) experience = ExperienceReplay(maxlen=2000, sample_batch_size=32, min_size_to_sample=100) exploration = EpsilonGreedyExploration(epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.99) agent = DQNAgent(env, model, gamma=0.99, exploration=exploration, experience=experience) # Pre-load samples in experience replay. # This can also be done implicitly during regular training episodes, # but the early training may overfit to early samples. experience.bootstrap(env) # Perform the training return train_dqn(agent, n_episodes) dqn_w_exp_returns = dqn_with_experience(env, n_episodes) data.report([(dqn_w_exp_returns, 'b', 'DQN w/ ER'), (baseline_returns, 'r', 'Baseline')], title='Experience Replay')

Running this, we can see that we start to get a bit more positive results, but it’s not a huge improvement yet. However, we’ll keep this and layer on more improvements.

# Reducing Bias

As the network trains, it can be thought of as chasing a moving target: A Q-value will improve, which will cause an update, which can cause the target to move again, etc. This introduces bias into the network and it may eventually converge to a sub-optimal policy.

To help reduce this bias, the second major improvement made by [Mnih15] was to reduce bias by introducing *fixed-Q targets*. Two networks are used, with the “target” network merely being a copy of the training network. The training network will have its Q-values updated, but will use the Q-value of the target network when estimating future discounted return (i.e., it will use the “old” network to generate estimates). The target network will periodically be updated to match the training network, so that these estimates don’t drift too far apart.

**Note:** Much of the research refers to fixed-Q targets as Double DQN. This is in reference to Double Q-Learning, by [vanHasselt10]. However, Mnih originally referred to this as fixed-Q targets and van Hasselt’s version has some additional nuances that were not used by Mnih (which also have theoretical differences). Therefore, I prefer to simply call these fixed-Q targets.

This can be implemented in a few different ways: [Mnih15] used “hard” fixed-Q targets which would periodically be reset to the current trained network, but [Lillicrap15] used “soft” fixed-Q targets which would merely slowly trail the training network. This has the advantage of being a smoother adjustment, but in practice both methods have similar effectiveness.

The code for a fixed-Q behavior is pretty simple. The only tricky part is in the `step()`

function. You can see here that we support both the “soft” update version (smoothly updating the target network towards the training network in smaller increments), as well as the “hard” update version (which simply copies the training network every *N* steps).

import numpy as np class FixedQTarget: def __init__(self, target_model, target_update_step, use_soft_targets=False): self._target_model = target_model self._target_update_step = target_update_step self._use_soft_targets = use_soft_targets self._tau = 1.0 / self._target_update_step self._n_steps = 0 def predict(self, states): return self._target_model.predict(states) def reset(self, policy_model): self._target_model.set_weights(policy_model.get_weights()) self._n_steps = 0 def step(self, policy_model): if self._use_soft_targets: # Soft update fixed-Q targets weights_model = policy_model.get_weights() weights_target = self._target_model.get_weights() new_weights = [] for i in range(len(weights_model)): new_weights.append(self._tau * weights_model[i] + (1. - self._tau) * weights_target[i]) self._target_model.set_weights(new_weights) else: if self._n_steps % self._target_update_step == 0: self._target_model.set_weights(policy_model.get_weights())

We also just need a few slight tweaks to support this in our `DQNAgent`

. We’ll again update the constructor:

class DQNAgent: def __init__(self, env, model, gamma, exploration, experience=None, fixed_q_target=None): ... self._fixed_q_target = fixed_q_target if self._fixed_q_target is not None: self._fixed_q_target.reset(self._model) ...

We’ll also update the `_get_next_action_returns()`

function to generate estimates from the target network, if it is available:

def _get_next_action_returns(self, next_states): if self._fixed_q_target is not None: # Fixed-Q targets use next action returns from the target policy (off-policy) return self._fixed_q_target.predict(next_states) else: # Get the next action returns from the on-policy model return self._model.predict(next_states)

We also need to make *one* small change to the `train()`

function, to update the target network by calling `self._fixed_q_target.step()`

:

def train(self, render=False, debug_func=None): ... action = self._exploration.act(self._model, np.array([state])) next_state, reward, done, _ = self._env.step(action) samples = _sample_experience(self, state, action, reward, next_state, done) # This part is new: if self._fixed_q_target is not None: self._fixed_q_target.step(self._model) states = samples[0] predictions = self._get_predictions(samples) history = self._model.fit(states, predictions) losses.extend(history.history['loss']) ...

Now, we can easily test this out, like we’ve done before:

def dqn_with_fixed_targets(env, n_episodes=None): # DQN with e-greedy exploration, experience replay, and fixed-Q targets model = build_network(env) target_model = build_network(env) experience = ExperienceReplay(maxlen=2000, sample_batch_size=32, min_size_to_sample=100) exploration = EpsilonGreedyExploration(epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.99) fixed_target = FixedQTarget(target_model, target_update_step=500, use_soft_targets=True) agent = DQNAgent(env, model, gamma=0.99, exploration=exploration, experience=experience, fixed_q_target=fixed_target) # Pre-load samples in experience replay. # This can also be done implicitly during regular training episodes, # but the early training may overfit to early samples. experience.bootstrap(env) # Perform the training return train_dqn(agent, n_episodes, debug=n_episodes is None) dqn_w_fixed_targets_returns = dqn_with_fixed_targets(env, n_episodes) data.report([(dqn_w_fixed_targets_returns, 'b', 'DQN w/ Fixed-Q'), (baseline_returns, 'r', 'Baseline')], title='Fixed-Q Targets')

And we should now start to see a result that looks *much* more stable:

# Variance in DQN’s

Now, after all these changes, you may still have seen some inconsistent results running your DQN’s. This is sometimes because network initialization can have a big impact on the effectiveness of a DQN, other times because of the particular random samples that are drawn from the environment. One thing I’ve learned is never to trust a single test run of a DQN; a good testing procedure should run multiple iterations (once you’ve tuned your hyperparameters!) to generate a better expectation of results.

For example, here is a comparison of the four variations we’ve seen so far (the random walk, plus the three DQN variations), each run 10 times. The solid lines are the mean values, while the shaded regions show one standard deviation for each.

As you can see, experience replay *on average* does better than what we had generated earlier; that had simply been a particularly less effective training run. But, using fixed-Q targets along with experience replay yields a much smoother progression (and is much less susceptible to “flat-lines”, in general).

This brings up a particular pain point with building RL algorithms: Debugging.

# Debugging

Debugging RL can be tricky. There can be many hyperparameters, some quite sensitive to change, and sometimes it takes a particular combination of hyperparameters to even just beat a random walk. As mentioned earlier, even just the *variance* between test runs can mislead you into thinking you have an effective solution when you actually just got lucky, or the opposite with an unlucky training run.

At these times, it can be very helpful to visualize the results that your network is actually generating. Using **TensorBoard** is one solution, but I find that it works better for supervised learning than for RL. You may have noticed some `debug_func`

references in the `DQNAgent`

code. We can use this to log particular bits of information about our network as training progresses. For example, we can change our `train_dqn()`

function to this:

def train_dqn(agent, n_episodes=None, debug=False): # Experiment described by: https://github.com/openai/gym/wiki/CartPole-v0 # CartPole-v1 defines "solving" as getting average reward of 195.0 over 100 consecutive trials. # This environment corresponds to the version of the cart-pole problem described by # Barto, Sutton, and Anderson [Barto83]. exp_returns = [] training_complete = False e = 0 action_vals = [] def debug_func(model): # Just an arbitrary first state/action pair from a new episode of a fully trained model state = np.array([[0.3604471, 0.21131558, 5.13830467, 0.07171951]]) action = 0 x = model.predict(state)[0][action] action_vals.append(x) # Arbitrary maximum at 2000 episodes, in case of divergent training while not training_complete and e = 195 print('Training complete after {} episodes'.format(e)) if debug: plt.plot(exp_returns, color='b', label='Rewards') plt.plot(action_vals, color='r', label='Q-value') plt.legend(loc='upper left') plt.show() return exp_returns

We’ve defined the internal `debug_func()`

to be something to track the predicted Q-value of a particular state/action pair. This can sometimes give us a better indicator about how our network is actually training than the loss or even reward at times.

If we visualize this output from the “vanilla” DQN, it might look something like this:

Here, we can see how the predicted Q-value is changing based on particular rewards being received. While neural networks are a great way to get a grasp on continuous state spaces, they have been known for a long time ([Thrun93]) to have a particular sensitivity to overestimation of Q-values. This can be seen in the chart above, where spikes in reward estimation can cause the Q-value to spike up as well. This particular behavior is often followed by the “flat-lining” that we’ve seen previously; basically, the network weights will become misaligned by too big of a gradient adjustment, causing the whole function approximator to fail miserably for a while.

In contrast, let’s look at the same result when applied to the `DQNAgent`

that used experience replay and fixed-Q targets:

This is clearly much better behaved during gradient updates! Rewards don’t swing back and forth nearly as much, causing a much more stable learning procedure.

This is just one example of how to debug RL training. I also highly recommend **this post** by Jaromír Janisch for visualizing policy behaviors.

# Conclusion

We’ve taken our first steps into the world of DQN’s and it’s been a handful! But, now that we’ve gotten some experience with how different tweaks can change the algorithm, we’ll be able to build a better intuition for how the theory behind it all works. Stay tuned for a deep dive into the mathematics!

If you have any questions or comments, please reach out to me! And if you see a bug or if I’ve incorrectly described something, let me know and I’ll update the post!

# OpenAI Gym Experiments Series

**Setting Up**

CartPole with DQN (you are here)

# References

[Sutton98] Sutton, R. S. & Barto, A. G. 1998 **Reinforcement learning: an introduction**. Cambridge, MA: MIT Press.

[Mnih15] Mnih, V., et al. 2015 **Human-level control through deep reinforcement learning**. Nature 518 (7540), 529-533.

[Lillicrap15] Lillicrap, T., et al. 2015 **Continuous control with deep reinforcement learning**. arXiv:1509.02971 [cs.LG]

[vanHasselt10] van Hasselt, H. 2010 **Double Q-Learning**. Advances in Neural Information Processing Systems 23 (NIPS 2010).

[Thrun93] Thrun, S., Schwartz, A. 1993 **Issues in Using Function Approximation for Reinforcement Learning**. IN PROCEEDINGS OF THE FOURTH CONNECTIONIST MODELS SUMMER SCHOOL.