Prioritized Experience Replay – Open Question

I have an open question about Prioritized Experience Replay from [Schaul15]. From my experiments, it seems that an equation in the publication is wrong, but maybe I’m overlooking something. I’d appreciate input.

The Problem

I won’t go into an intro to the research, because I’m asking for help from someone more familiar with it than I. But, my crux is being stuck on the Importance Sampling (IS) step. The algorithm calls for a (-beta) exponent, but in all of my tests so far, this causes bad divergences. It seems that a (+beta) exponent yields the expected result.

Here’s the algorithm, as presented in [Schaul15]. Line 10 is where I’m seeing problems.

Screen Shot 2018-08-14 at 5.38.45 PM.png

At first, I thought I was crazy, because every implementation I’ve found reflects the algorithm above (as well as within equations in tutorial videos). However, I can’t seem to reproduce good results with my own code.

Here’s a stripped-down example of the algorithm in simple Python code that demonstrates what I perceive to be the issue (you can run this in a simple REPL loop):

import numpy as np
from collections import deque

d = deque(maxlen=10)
d.append(1)
d.append(2)
d.append(3)

# Test variations of alpha...
alpha = 0.0
dist = np.array(list(d)) ** alpha

# We can see alpha=0 correctly makes everything uniform
print(dist)
# Outputs: array([1., 1., 1.])

alpha = 1.0
dist = np.array(list(d)) ** alpha

# We can see alpha=1 preserves the priorities
print(dist)
# Outputs: array([1., 2., 3.])

alpha = 0.5
dist = np.array(list(d)) ** alpha
print(dist)
# Outputs: array([1.        , 1.41421356, 1.73205081])

dist /= dist.sum()
print(dist)
# Outputs: array([0.24118095, 0.34108138, 0.41773767])

num_samples = len(d)
print(num_samples * dist)
# Outputs: array([0.72354286, 1.02324413, 1.253213  ])

# Test variations of beta...
beta = 0.0

# We can see beta=0 correctly makes everything uniform
print((num_samples * dist) ** (-beta))
# Outputs: array([1., 1., 1.])

beta = 1.0

# It seems that beta=1 with the negative exponent INCORRECTLY downsamples weights on the prioritized samples
importances = np.array((num_samples * dist) ** (-beta))
print(importances / importances.max())
# Outputs: array([1.        , 0.70710678, 0.57735027])

# It seems that beta=1 with the positive exponent seems to correctly weight according to priority
importances = np.array((num_samples * dist) ** (beta))
print(importances / importances.max())
# Outputs: array([0.57735027, 0.81649658, 1.        ])

As seen with this simple Python code, it seems that keeping the positive exponent on beta causes the priorities to be correctly re-weighted.

Both within [Schaul15] and in other references, I have seen examples that show an equivalent representation of line 10 in a different form (with a positive exponent on beta, but with the other parameters inverted).

Screen Shot 2018-08-14 at 5.55.53 PM

Perhaps some confusion between these two representations has led others astray? Or have I (much more likely) completely overlooked something that is ruining my own results? I welcome any and all feedback.

References

[Schaul15] Schaul, T., et al. 2015 Prioritized Experience Replay. arXiv:1511.05952 [cs.LG].