Gym Experiments: CartPole with DQN

To introduce ourselves to reinforcement learning with Deep-Q Networks (DQN), we’ll visit a standard OpenAI Gym problem, CartPole. We’ll cover deeper RL theory in a later post, but let’s get our hands dirty first, to build some intuition!

The complete series can be found on the bottom of this post and the latest version of the GitHub repo can be found here. Be sure to get set up before you begin.

The CartPole Experiment

cartpole

The CartPole gym environment is a simple introductory RL problem. The problem is described as:

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart’s velocity.

More simply put, there’s a wobbly pole on a cart and the goal is to keep it upright.

In RL, this problem can be described as a fully-observable, deterministic, continuous state space, with a discrete action space and frequent rewards. This is important to recognize before attacking the problem, because different configurations can make for very different kinds of RL problems to solve.

  • Fully-observable: All aspects of the problem state are always fully available to us.
  • Deterministic: Taking a specific action in a specific state will always yield the same result.
  • Continuous state: The state space is defined by real, floating-point values, rather than discrete points. Because we have an (infinite) continuous state space, we’ll need to use a neural network (DQN) to solve the problem, rather than use a simpler solution, such as to solve a lookup table.
  • Discrete actions: The action space is defined by discrete choices. DQN is generally a better solution for discrete action spaces than it is for continuous actions spaces.
  • Frequent rewards: Every step through the environment provides a significant amount of feedback. This means that we generally don’t have to play to the end of an episode in order to determine if we’re doing the right thing.

Deep-Q Networks can be described as model-free, value-based, off-policy methods. I’ll save the interpretation of that for the later theory post. For now, just keep in mind that this is a solution configuration that is useful for particular problems, such as the CartPole environment. More specifically, they are a function approximation (using neural networks) of Q-Learning methodology. We’ll save most of that for the theory post, but we’ll touch on that a bit later on. There are a number of different variations on DQN’s, but today I’ll focus on three: The “vanilla” (basic) DQN, adding experience replay, and fixed-Q targets.

For the remainder of this post, I’m going to touch on a lot of the algorithmic details very lightly. This is because I believe that people learn better by starting with concrete examples before moving on to abstract intuition (e.g., the “Monad Tutorial Fallacy”). My eyes certainly glaze over when the first thing I’m shown is a new equation – what do all these strange new symbols mean?!? This post is meant as more of a set of code to demonstrate the results; the later theory post will better explain why we’re doing what we’re doing.

Before we dive into DQN’s, it’s always good practice to start by visualizing the problem. And just to get this out of the way, here are the imports we’ll be using:

from agents.dqn import DQNAgent
from algorithms.egreedy import EpsilonGreedyExploration
from algorithms.experience import ExperienceReplay
from algorithms.fixed_q_target import FixedQTarget
from algorithms.schedule import ExponentialSchedule
from collections import deque
import gym
from helpers import data
from helpers.env_wrapper import EnvironmentWrapper
from helpers.model import ModelWrapper
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam
import matplotlib.pyplot as plt
import numpy as np
import sys

Data Exploration

Let’s start by creating the environment and running a random walk through it for 500 episodes, to see what kind of data pops out. A CartPole episode is defined as ending once the pole falls past a certain angle, or 200 “steps” (actions) through the environment have been taken.

Here’s what the code looks like:

def data_exploration(env, n_episodes):
    # Random exploration to establish a baseline
    exp_returns = data.random(env, n_episodes=n_episodes)
    return exp_returns

env = EnvironmentWrapper(gym.make('CartPole-v1'))
n_episodes = 500
baseline_returns = data_exploration(env, n_episodes)
data.report([(baseline_returns, 'b', 'Baseline')], title='Random Walk')

Running this, you should see a plot similar to the following:

cartpole_single_random_walk

As you can see, a random walk will roughly average a final score of about 25 through an episode. This is helpful to know, as it establishes a useful baseline for us to compare our DQN results against. If we can’t beat a random walk, we’re either doing something wrong or applying the wrong solution to the problem!

Exploring Exploration

Before we leap into the actual DQN algorithm, let’s talk for a moment about one of the constant worries for RL algorithms: Exploration vs. exploitation. Unlike with supervised learning, RL doesn’t typically have a “teacher” to tell an algorithm whether it’s doing a good job or not; the algorithm simply has to discover it’s environment as it goes along. The way it “learns” is typically by maximizing the reward it is able to achieve in the problem environment. However, to avoid converging to a sub-optimal solution, the algorithm requires some amount of exploration before it can fully exploit the environment. There are a number of ways in which exploration can occur, but I’ll save most of that discussion for a later post. For now, we’ll just use epsilon-greedy exploration [Sutton98].

The way that ε-greedy exploration works is a simple idea: You start at some high probability value (which we call ε) and decay it over time to a low probability. We’ll use that ε to randomly choose whether to take a random action in the environment, or to use the most “greedy” known action. When ε is a high probability, we have a high chance of taking random actions to explore our environment. When ε is a low probability, we’re more focused on exploiting the environment, to take advantage of what we know.

This is pretty straightforward to implement:

import numpy as np
import random

class EpsilonGreedyExploration:
    def __init__(self, epsilon_start, epsilon_min, epsilon_decay):
        self._epsilon = epsilon_start
        self._epsilon_min = epsilon_min
        self._epsilon_decay = epsilon_decay

    @property
    def epsilon(self):
        return self._epsilon

    def act(self, model, state):
        if np.random.rand() <= self._epsilon:
            return random.randrange(model.action_size)

        # predict() returns a matrix tensor (even for a single state prediction),
        # but the action values are always a vector, so grab the first (and only) row
        return np.argmax(model.predict(state)[0])

    def step(self):
        self._epsilon = max(self._epsilon * self._epsilon_decay, self._epsilon_min)

Here, we're using a particular variant that uses exponential decay. You can do this in other ways (e.g., using a linear decay).

We'll see later on how this is used, but we basically ask the instance for an action to perform against our model (given some state) and at a later point, we decay ε.

The Basic DQN Algorithm

Now, we can discuss the basic DQN algorithm. The basic idea is this (again, I’ll save the theory for a later post):

  • We’re given a neural network that acts as a function approximation of the value of a particular state input (this is why Q-Learning is referred to as a value-based method).
  • As we step through the environment, we collect immediate rewards for actions taken.
  • We also make an estimate of the discounted future reward that we can attain by taking continuing to follow an optimal policy from this point on into the future. (This is one of the things that is unique to reinforcement learning!)
  • Based on that immediate reward and our estimate of the future reward (the combination of which we call the return), we train the neural network that acts as our value function approximation.
  • Over time, as more states are visited and rewards are gathered, our network should converge (in theory!) to a true approximation of the value of taking a particular action in a given state.
  • By greedily taking the highest-value (given by our network) action at every state, we can find an optimal policy for interacting with the environment.

As pseudocode, the algorithm looks like this:

Screen Shot 2018-08-13 at 6.06.35 PM

In [Mnih15], the Q-Learning algorithm was extended to deep neural networks and DQN was born! As code, this gives us the following:

class DQNAgent:
    def __init__(self, env, model, gamma, exploration):
        self._env = env
        self._model = model
        self._gamma = gamma
        self._exploration = exploration

    @property
    def exploration(self):
        return self._exploration

    def _get_predictions(self, samples):
        states, actions, rewards, next_states, dones = samples
        predictions = np.zeros((len(states), self._model.action_size))

        action_returns = self._model.predict(states)
        next_action_returns = self._get_next_action_returns(next_states)

        for idx in range(len(states)):
            action, reward, done, action_return = actions[idx], rewards[idx], dones[idx], action_returns[idx]
            policy_action = np.argmax(next_action_returns[idx])
            discounted_return = self._gamma * next_action_returns[idx][policy_action] * (not done)
            action_return[action] = reward + discounted_return
            predictions[idx] = action_return

        return predictions

    def _get_next_action_returns(self, next_states):
        # Get the next action returns from the on-policy model
        return self._model.predict(next_states)

    def _sample_experience(self, state, action, reward, next_state, done):
        return np.array([state]), np.array([action]), np.array([reward]), np.array([next_state]), np.array([done])

    def train(self, render=False, debug_func=None):
        state = self._env.reset()
        total_reward = 0
        done = False
        n_steps = 0
        start_time = time.time()
        losses = []

        while not done:
            if render:
                self._env.render()

            action = self._exploration.act(self._model, np.array([state]))
            next_state, reward, done, _ = self._env.step(action)
            samples = _sample_experience(self, state, action, reward, next_state, done)

            states = samples[0]
            predictions = self._get_predictions(samples)
            history = self._model.fit(states, predictions)
            losses.extend(history.history['loss'])

            state = next_state
            total_reward += reward
            n_steps += 1

        self._exploration.step()

        # Allow the chance to examine the model for debugging
        if debug_func is not None:
            debug_func(self._model)

        elapsed_time = time.time() - start_time
        return total_reward, n_steps, elapsed_time, np.mean(losses)

You can see how we’re using ε-greedy exploration to explore, while also making calculated future estimates of the state/action pair values. The _get_predictions() function handles the internals of the Q update step in the pseudocode.

To see what this looks like in practice, first let’s create a network and run an experiment against our environment:

def build_network(env, verbose=True):
    model = Sequential()
    model.add(Dense(24, input_dim=env.observation_space.shape[0], activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(env.action_space.n, activation='linear'))
    model.compile(loss='mse', optimizer=Adam(lr=0.001))

    if verbose:
        model.summary()

    return ModelWrapper(model)

def train_dqn(agent, n_episodes=None, debug=False):
    # Experiment described by: https://github.com/openai/gym/wiki/CartPole-v0
    # CartPole-v1 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.
    # This environment corresponds to the version of the cart-pole problem described by
    # Barto, Sutton, and Anderson [Barto83].
    exp_returns = []
    training_complete = False
    e = 0
    action_vals = []

    # Arbitrary maximum at 2000 episodes, in case of divergent training
    while not training_complete and e = 195

    print('Training complete after {} episodes'.format(e))
    return exp_returns

def basic_dqn(env, n_episodes):
    # Basic DQN with ε-greedy exploration
    model = build_network(env)
    exploration = EpsilonGreedyExploration(epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.99)
    agent = DQNAgent(env, model, gamma=0.99, exploration=exploration)

    # Perform the training
    return train_dqn(agent, n_episodes)

basic_dqn_returns = basic_dqn(env, n_episodes)
data.report([(basic_dqn_returns, 'b', 'Basic DQN'),
             (baseline_returns, 'r', 'Baseline')], title='Vanilla DQN')

Here, we’re creating a simple network and using it with classes I’ve discussed previously to train the model using our DQN. We also plot the results against our random walk baseline for comparison. You will see an output similar to this:

cartpole_single_vanilla_dqn

As you can see, the basic DQN only just starts to improve upon a random walk in 500 episodes, so let’s discuss some ways to improve the algorithm!

Reducing Correlation

The first problem we can quickly identify is with sample correlation. As we step through each episode, we’re training a neural network against each step in the episode. However, these samples are not independent and identically distributed (i.i.d.)! This can be clearly understood in the context of CartPole: Every step through the environment is very closely related to the step that came before. And, as we know, neural networks that are trained against correlated data tend to badly overfit to the correlations.

To reduce these correlations, [Mnih15] introduced experience replay. The idea is that, rather than immediately training upon every step through an episode, the data from that step is instead stored in a memory buffer. Then, batches from the buffer are randomly sampled in order to break the correlations.

The code is pretty simple, so I’ll just show it below. We just store all of the different bits of data from each environment step. The last thing to note is that we store the samples in a deque object, so that older (more out-of-date) samples will be dropped as we gather new experiences.

from collections import deque
import numpy as np
import random

class ExperienceReplay:
    def __init__(self, maxlen, sample_batch_size, min_size_to_sample):
        self._states = deque(maxlen=maxlen)
        self._actions = deque(maxlen=maxlen)
        self._rewards = deque(maxlen=maxlen)
        self._next_states = deque(maxlen=maxlen)
        self._dones = deque(maxlen=maxlen)
        self._sample_batch_size = sample_batch_size
        self._min_size_to_sample = min_size_to_sample

    def add(self, state, action, reward, next_state, done):
        self._states.append(state)
        self._actions.append(action)
        self._rewards.append(reward)
        self._next_states.append(next_state)
        self._dones.append(done)

    def bootstrap(self, env):
        print('Bootstrapping experience samples...')

        while not self.can_sample():
            state = env.reset()
            done = False

            while not done:
                action = np.random.randint(low=0, high=env.action_space.n)
                next_state, reward, done, _ = env.step(action)
                self.add(state, action, reward, next_state, done)

    def can_sample(self):
        return len(self) >= self._min_size_to_sample

    def sample(self):
        mem_size = len(self)
        indices = random.sample(range(mem_size), min(mem_size, self._sample_batch_size))
        states = np.array([self._states[idx] for idx in indices])
        actions = np.array([self._actions[idx] for idx in indices])
        rewards = np.array([self._rewards[idx] for idx in indices])
        next_states = np.array([self._next_states[idx] for idx in indices])
        dones = np.array([self._dones[idx] for idx in indices])
        return states, actions, rewards, next_states, dones

We’re just going to make a couple of changes to our DQN code to support this. First, we’re going to pass in an instance of our experience class into the DQN constructor:

class DQNAgent:
    def __init__(self, env, model, gamma, exploration, experience=None):
        ...
        self._experience = experience
        ...

Then, we’re going to update our _sample_experience() function:

    def _sample_experience(self, state, action, reward, next_state, done):
        if self._experience is not None:
            self._experience.add(state, action, reward, next_state, done)
            return self._experience.sample()
        else:
            # This is a "vanilla" DQN
            return np.array([state]), np.array([action]), np.array([reward]), np.array([next_state]), np.array([done])

And that’s it! Our changes are easy to test, we just have to create an ExperienceReplay object and pass it into our DQNAgent. It can be helpful to bootstrap the experience buffer with initial samples. This can prevent the first experiences from being over-sampled, causing the network to overfit to these samples.

def dqn_with_experience(env, n_episodes):
    # DQN with e-greedy exploration and experience replay
    model = build_network(env)
    experience = ExperienceReplay(maxlen=2000, sample_batch_size=32, min_size_to_sample=100)
    exploration = EpsilonGreedyExploration(epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.99)
    agent = DQNAgent(env, model, gamma=0.99, exploration=exploration, experience=experience)

    # Pre-load samples in experience replay.
    # This can also be done implicitly during regular training episodes,
    # but the early training may overfit to early samples.
    experience.bootstrap(env)

    # Perform the training
    return train_dqn(agent, n_episodes)

dqn_w_exp_returns = dqn_with_experience(env, n_episodes)
data.report([(dqn_w_exp_returns, 'b', 'DQN w/ ER'),
             (baseline_returns, 'r', 'Baseline')], title='Experience Replay')

Running this, we can see that we start to get a bit more positive results, but it’s not a huge improvement yet. However, we’ll keep this and layer on more improvements.

cartpole_single_er_dqn

Reducing Bias

As the network trains, it can be thought of as chasing a moving target: A Q-value will improve, which will cause an update, which can cause the target to move again, etc. This introduces bias into the network and it may eventually converge to a sub-optimal policy.

To help reduce this bias, the second major improvement made by [Mnih15] was to reduce bias by introducing fixed-Q targets. Two networks are used, with the “target” network merely being a copy of the training network. The training network will have its Q-values updated, but will use the Q-value of the target network when estimating future discounted return (i.e., it will use the “old” network to generate estimates). The target network will periodically be updated to match the training network, so that these estimates don’t drift too far apart.

Note: Much of the research refers to fixed-Q targets as Double DQN. This is in reference to Double Q-Learning, by [vanHasselt10]. However, Mnih originally referred to this as fixed-Q targets and van Hasselt’s version has some additional nuances that were not used by Mnih (which also have theoretical differences). Therefore, I prefer to simply call these fixed-Q targets.

This can be implemented in a few different ways: [Mnih15] used “hard” fixed-Q targets which would periodically be reset to the current trained network, but [Lillicrap15] used “soft” fixed-Q targets which would merely slowly trail the training network. This has the advantage of being a smoother adjustment, but in practice both methods have similar effectiveness.

The code for a fixed-Q behavior is pretty simple. The only tricky part is in the step() function. You can see here that we support both the “soft” update version (smoothly updating the target network towards the training network in smaller increments), as well as the “hard” update version (which simply copies the training network every N steps).

import numpy as np

class FixedQTarget:
    def __init__(self, target_model, target_update_step, use_soft_targets=False):
        self._target_model = target_model
        self._target_update_step = target_update_step
        self._use_soft_targets = use_soft_targets
        self._tau = 1.0 / self._target_update_step
        self._n_steps = 0

    def predict(self, states):
        return self._target_model.predict(states)

    def reset(self, policy_model):
        self._target_model.set_weights(policy_model.get_weights())
        self._n_steps = 0

    def step(self, policy_model):
        if self._use_soft_targets:
            # Soft update fixed-Q targets
            weights_model = policy_model.get_weights()
            weights_target = self._target_model.get_weights()
            new_weights = []

            for i in range(len(weights_model)):
                new_weights.append(self._tau * weights_model[i] + (1. - self._tau) * weights_target[i])

            self._target_model.set_weights(new_weights)
        else:
            if self._n_steps % self._target_update_step == 0:
                self._target_model.set_weights(policy_model.get_weights())

We also just need a few slight tweaks to support this in our DQNAgent. We’ll again update the constructor:

class DQNAgent:
    def __init__(self, env, model, gamma, exploration, experience=None, fixed_q_target=None):
        ...
        self._fixed_q_target = fixed_q_target

        if self._fixed_q_target is not None:
            self._fixed_q_target.reset(self._model)
        ...

We’ll also update the _get_next_action_returns() function to generate estimates from the target network, if it is available:

    def _get_next_action_returns(self, next_states):
        if self._fixed_q_target is not None:
            # Fixed-Q targets use next action returns from the target policy (off-policy)
            return self._fixed_q_target.predict(next_states)
        else:
            # Get the next action returns from the on-policy model
            return self._model.predict(next_states)

We also need to make one small change to the train() function, to update the target network by calling self._fixed_q_target.step():

    def train(self, render=False, debug_func=None):
            ...

            action = self._exploration.act(self._model, np.array([state]))
            next_state, reward, done, _ = self._env.step(action)
            samples = _sample_experience(self, state, action, reward, next_state, done)

            # This part is new:
            if self._fixed_q_target is not None:
                self._fixed_q_target.step(self._model)

            states = samples[0]
            predictions = self._get_predictions(samples)
            history = self._model.fit(states, predictions)
            losses.extend(history.history['loss'])

            ...

Now, we can easily test this out, like we’ve done before:

def dqn_with_fixed_targets(env, n_episodes=None):
    # DQN with e-greedy exploration, experience replay, and fixed-Q targets
    model = build_network(env)
    target_model = build_network(env)
    experience = ExperienceReplay(maxlen=2000, sample_batch_size=32, min_size_to_sample=100)
    exploration = EpsilonGreedyExploration(epsilon_start=1.0, epsilon_min=0.01, epsilon_decay=0.99)
    fixed_target = FixedQTarget(target_model, target_update_step=500, use_soft_targets=True)
    agent = DQNAgent(env, model, gamma=0.99, exploration=exploration, experience=experience, fixed_q_target=fixed_target)

    # Pre-load samples in experience replay.
    # This can also be done implicitly during regular training episodes,
    # but the early training may overfit to early samples.
    experience.bootstrap(env)

    # Perform the training
    return train_dqn(agent, n_episodes, debug=n_episodes is None)

dqn_w_fixed_targets_returns = dqn_with_fixed_targets(env, n_episodes)
data.report([(dqn_w_fixed_targets_returns, 'b', 'DQN w/ Fixed-Q'),
             (baseline_returns, 'r', 'Baseline')], title='Fixed-Q Targets')

And we should now start to see a result that looks much more stable:

cartpole_single_fixedq_dqn

Variance in DQN’s

Now, after all these changes, you may still have seen some inconsistent results running your DQN’s. This is sometimes because network initialization can have a big impact on the effectiveness of a DQN, other times because of the particular random samples that are drawn from the environment. One thing I’ve learned is never to trust a single test run of a DQN; a good testing procedure should run multiple iterations (once you’ve tuned your hyperparameters!) to generate a better expectation of results.

For example, here is a comparison of the four variations we’ve seen so far (the random walk, plus the three DQN variations), each run 10 times. The solid lines are the mean values, while the shaded regions show one standard deviation for each.

cartpole_multi_all_dqn

As you can see, experience replay on average does better than what we had generated earlier; that had simply been a particularly less effective training run. But, using fixed-Q targets along with experience replay yields a much smoother progression (and is much less susceptible to “flat-lines”, in general).

This brings up a particular pain point with building RL algorithms: Debugging.

Debugging

Debugging RL can be tricky. There can be many hyperparameters, some quite sensitive to change, and sometimes it takes a particular combination of hyperparameters to even just beat a random walk. As mentioned earlier, even just the variance between test runs can mislead you into thinking you have an effective solution when you actually just got lucky, or the opposite with an unlucky training run.

At these times, it can be very helpful to visualize the results that your network is actually generating. Using TensorBoard is one solution, but I find that it works better for supervised learning than for RL. You may have noticed some debug_func references in the DQNAgent code. We can use this to log particular bits of information about our network as training progresses. For example, we can change our train_dqn() function to this:

def train_dqn(agent, n_episodes=None, debug=False):
    # Experiment described by: https://github.com/openai/gym/wiki/CartPole-v0
    # CartPole-v1 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.
    # This environment corresponds to the version of the cart-pole problem described by
    # Barto, Sutton, and Anderson [Barto83].
    exp_returns = []
    training_complete = False
    e = 0
    action_vals = []

    def debug_func(model):
        # Just an arbitrary first state/action pair from a new episode of a fully trained model
        state = np.array([[0.3604471, 0.21131558, 5.13830467, 0.07171951]])
        action = 0
        x = model.predict(state)[0][action]
        action_vals.append(x)

    # Arbitrary maximum at 2000 episodes, in case of divergent training
    while not training_complete and e = 195

    print('Training complete after {} episodes'.format(e))

    if debug:
        plt.plot(exp_returns, color='b', label='Rewards')
        plt.plot(action_vals, color='r', label='Q-value')
        plt.legend(loc='upper left')
        plt.show()

    return exp_returns

We’ve defined the internal debug_func() to be something to track the predicted Q-value of a particular state/action pair. This can sometimes give us a better indicator about how our network is actually training than the loss or even reward at times.

If we visualize this output from the “vanilla” DQN, it might look something like this:

cartpole_solve_qvalues_vanilla_dqn

Here, we can see how the predicted Q-value is changing based on particular rewards being received. While neural networks are a great way to get a grasp on continuous state spaces, they have been known for a long time ([Thrun93]) to have a particular sensitivity to overestimation of Q-values. This can be seen in the chart above, where spikes in reward estimation can cause the Q-value to spike up as well. This particular behavior is often followed by the “flat-lining” that we’ve seen previously; basically, the network weights will become misaligned by too big of a gradient adjustment, causing the whole function approximator to fail miserably for a while.

In contrast, let’s look at the same result when applied to the DQNAgent that used experience replay and fixed-Q targets:

cartpole_solve_qvalues_fixedq_dqn

This is clearly much better behaved during gradient updates! Rewards don’t swing back and forth nearly as much, causing a much more stable learning procedure.

This is just one example of how to debug RL training. I also highly recommend this post by Jaromír Janisch for visualizing policy behaviors.

Conclusion

We’ve taken our first steps into the world of DQN’s and it’s been a handful! But, now that we’ve gotten some experience with how different tweaks can change the algorithm, we’ll be able to build a better intuition for how the theory behind it all works. Stay tuned for a deep dive into the mathematics!

If you have any questions or comments, please reach out to me! And if you see a bug or if I’ve incorrectly described something, let me know and I’ll update the post!

OpenAI Gym Experiments Series

GitHub repo

Setting Up
CartPole with DQN (you are here)

References

[Sutton98] Sutton, R. S. & Barto, A. G. 1998 Reinforcement learning: an introduction. Cambridge, MA: MIT Press.

[Mnih15] Mnih, V., et al. 2015 Human-level control through deep reinforcement learning. Nature 518 (7540), 529-533.

[Lillicrap15] Lillicrap, T., et al. 2015 Continuous control with deep reinforcement learning. arXiv:1509.02971 [cs.LG]

[vanHasselt10] van Hasselt, H. 2010 Double Q-Learning. Advances in Neural Information Processing Systems 23 (NIPS 2010).

[Thrun93] Thrun, S., Schwartz, A. 1993 Issues in Using Function Approximation for Reinforcement Learning. IN PROCEEDINGS OF THE FOURTH CONNECTIONIST MODELS SUMMER SCHOOL.