Gym Experiments: Setting Up

Kicking off a series of posts on OpenAI’s gym environment, I’ll cover some light bootstrapping to get us up and running more quickly. I promise, it will be short and sweet! I’ll be referring back to this from later posts in the series.

The complete series can be found on the bottom of this post and the latest version of the GitHub repo can be found here.

Setting Up Your Environment

The first thing you’ll want to do is to set up a virtual environment for experimenting with the code. On OS X, I use virtualenv, while on Windows I use Anaconda. YMMV, so use whatever you’re comfortable with. I’ll focus on using Python 3 and I’ll be using Keras for the machine learning parts. If you prefer to use a different ML library, feel free to do so! I’ll be trying to make the code as agnostic as possible as we go along . Just implement similar function interfaces (such as Model .fit() and .predict() functions) and you should be able to just swap in a different library.

After activating your virtual environment of choice, install the packages we’ll be using:

pip install keras matplotlib gym Box2D box2d-py

Note: Some of the Gym environment dependencies can be a bit tricky when installing (and may not all work gracefully on different platforms). For now, we’ll just be using the simpler environments, but I’ll update this post as needed when we run into the more complex setups.

Note: You may need to install SWIG to get Box2D to install. This can be tricky on different environments. On Mac, a simple brew install swig should do the trick, but Windows can be a more strenuous process.

Now, on to the custom code! This can all be found in the helpers/ directory of the repo.

Wrapping the Environment

We’ll be working with an environment wrapper around the existing Gym environments. This is primarily for standardizing the state values our environments may give to us. This is intended as a drop-in replacement for the expected environment interfaces. If you don’t want to use the wrapper, you can just ignore this.

Let’s create env_wrapper.py and define the EnvironmentWrapper class:

import numpy as np
import random

class EnvironmentWrapper:
    def __init__(self, env, n_bootstrap_episodes=10000, verbose=1):
        self._env = env
        self._n_samples = 0
        self._mean = None
        self._std = None
        self._verbose = verbose

        if n_bootstrap_episodes is not None:
            self._bootstrap(n_bootstrap_episodes)

    @property
    def action_space(self):
        return self._env.action_space

    @property
    def observation_space(self):
        return self._env.observation_space

    def render(self):
        self._env.render()

    def reset(self):
        state = self._env.reset()
        self._update_env_stats(state)
        return self._standardize(state)

    def step(self, action):
        state, reward, done, info = self._env.step(action)
        self._update_env_stats(state)
        return self._standardize(state), reward, done, info

As you can see, this is a pretty simple wrapper around an existing Gym environment. The only other thing we’re doing is some standardization as we retrieve each state from the environment. This has the implicit assumption that our state data is normally distributed, which may be a bad assumption! But in practice, it tends to work well with neural networks. This might be something you want to play with changing in your own problem domain.

There are a few missing functions we still need to define.

We’ll use incremental calculations for the mean and standard deviation of the state data. This allows us to avoid having to store all of the sample data in memory.

We’ll also bootstrap the initial mean/standard deviation from sample data when we instantiate the class. This can be punted for later down the road, when we’re training our model, but this can cause the model to have bad state estimates early on, until the mean/standard deviation converge to more stable values.

Here’s the code for these functions:

    def _bootstrap(self, n_bootstrap_episodes):
        self._mean = None
        self._std = None

        if self._verbose > 0:
            print('Bootstrapping environment stats over {} random episodes...'.format(n_bootstrap_episodes))

        for _ in range(n_bootstrap_episodes):
            done = False
            _ = self.reset()

            while not done:
                action = random.randrange(self._env.action_space.n)
                _, _, done, _ = self.step(action)

        if self._verbose > 0:
            print('Bootstrapping complete; mean {}, std {}'.format(self._mean, self._std))

    def _update_env_stats(self, sample):
        # Incremental mean/standard deviation
        self._n_samples += 1

        if self._mean is None:
            self._std = np.repeat(1.0, len(sample))
            self._mean = sample
        else:
            self._std = (self._n_samples - 2) / (self._n_samples - 1) * self._std + \
                        (1 / self._n_samples) * np.square(sample - self._mean)
            self._mean += (sample - self._mean) / self._n_samples

    def _standardize(self, state):
        if self._mean is None or self._std is None:
            return state

        return (state - self._mean) / self._std

Wrapping the Model

We’re also going to make a small wrapper around our ML model. This is mainly to avoid having to insert magic numbers and other boiler-plate within our algorithms. Again, this is using Keras, so if you use a different ML package, you may need to tweak this.

Here’s model.py (note that it needs no imports):

class ModelWrapper:
    def __init__(self, model, fit_batch_size=32):
        self._model = model
        self._fit_batch_size = fit_batch_size

    @property
    def action_size(self):
        return self._model.layers[-1].output_shape[1]

    def fit(self, states, predictions, sample_weight=None):
        return self._model.fit(states, predictions, epochs=1, verbose=0, batch_size=self._fit_batch_size, sample_weight=sample_weight)

    def get_weights(self):
        return self._model.get_weights()

    def predict(self, state):
        return self._model.predict(state)

    def set_weights(self, weights):
        self._model.set_weights(weights)

The action_size() function simply interprets the final layer of the model as the available actions that can be taken.

The fit() function you’ll notice has the argument epochs=1. If you’re coming from other areas of machine learning (ML), you might think this is inefficient, but in reinforcement learning (RL), we’re often constrained to single samples of the environment. This is because we’re trying to build estimations of the environment (I’ll go more into this in later posts); fitting the model for multiple epochs will skew these estimations. Of course, RL has many open areas for research, so if you think you have an idea for how to better learn from samples, go explore it!

Lastly, you may note that the default batch size is 32 for learning from samples. This has a similar reasoning; smaller batches may be slower, but we don’t want to build over-estimations of the network gradient updates (if anything, constraining these gradient updates is an active area of research!). Traditionally, much of the RL research literature has used a batch size of 32.

Data Reporting

Lastly, we have data.py, which we’ll use for reporting data statistics.

import matplotlib.pyplot as plt
import numpy as np

def _smooth_returns(returns, window=10):
    output = [np.nan] * window

    for i in range(window, len(returns)):
        output.append(np.mean(returns[i-window:i]))

    return output

def _plot_series(series, color, label, smooth_window=10):
    series = np.array(series)

    if series.ndim == 1:
        plt.plot(series, color=color, linewidth=0.5)
        plt.plot(_smooth_returns(series, window=smooth_window), color=color, label=label, linewidth=2)
    else:
        mean = series.mean(axis=0)
        plt.plot(mean, color=color, linewidth=1, label=label)
        plt.fill_between(range(series.shape[1]),
                         mean + series.std(axis=0), mean - series.std(axis=0),
                         color=color, alpha=0.2)

def random(env, n_episodes=1000):
    returns = []

    for _ in range(n_episodes):
        _ = env.reset()
        done = False
        total_reward = 0

        while not done:
            action = np.random.randint(low=0, high=env.action_space.n)
            _, reward, done, _ = env.step(action)
            total_reward += reward

        returns.append(total_reward)

    return returns

def report(returns, render=True, title=None, legend_loc='upper right', smooth_window=10):
    for i in range(len(returns)):
        series, color, label = returns[i]

        if i == 0:
            print('Experiment stats for {}:'.format(label))
            print('  Mean reward: {}'.format(np.mean(series)))
            print('  Median reward: {}'.format(np.median(series)))
            print('  Std reward: {}'.format(np.std(series)))
            print('  Max reward: {}'.format(np.max(series)))
            print('  Min reward: {}'.format(np.min(series)))

        if not render:
            continue

        _plot_series(series, color=color, label=label, smooth_window=smooth_window)

    if not render:
        return

    if title is not None:
        plt.title(title)

    plt.legend(loc=legend_loc)
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.show()

The random() function generates results from a random walk through our environment. We’ll often come back to this as a baseline for comparison in our later experiments.

The report() function outputs some general data statistics and can also generate charts. There are a couple things going on here that you may find useful:

  • You can send multiple series to the function for plotting. This can be handy for comparing different series. Just pass a list with each element as a 3-tuple, defined as: (series_data, chart_color, chart_label)
  • If a 1D series of data is provided for plotting, a tunable moving average will be plotted over the data. This is because RL data is often quite noisy.
  • If a 2D series is provided, the series will be interpreted as multiple trial runs. The mean of the data at each iteration will be plotted, as well as a +/- 1 standard deviation band.

Concluding

That’s all we need to dig in to some experiments! This boiler-plate might seem unnecessary, but with RL it can often be difficult to yield good results. Having some simple code for reporting, standardization, and baseline comparisons can be a big help in making progress on an experiment.

OpenAI Gym Experiments Series

GitHub repo

Setting Up (you are here)
CartPole with DQN