I have an open question about Prioritized Experience Replay from [Schaul15]. From my experiments, it seems that an equation in the publication is wrong, but maybe I’m overlooking something. I’d appreciate input.
I won’t go into an intro to the research, because I’m asking for help from someone more familiar with it than I. But, my crux is being stuck on the Importance Sampling (IS) step. The algorithm calls for a (-beta) exponent, but in all of my tests so far, this causes bad divergences. It seems that a (+beta) exponent yields the expected result.
Here’s the algorithm, as presented in [Schaul15]. Line 10 is where I’m seeing problems.
At first, I thought I was crazy, because every implementation I’ve found reflects the algorithm above (as well as within equations in tutorial videos). However, I can’t seem to reproduce good results with my own code.
Here’s a stripped-down example of the algorithm in simple Python code that demonstrates what I perceive to be the issue (you can run this in a simple REPL loop):
import numpy as np from collections import deque d = deque(maxlen=10) d.append(1) d.append(2) d.append(3) # Test variations of alpha... alpha = 0.0 dist = np.array(list(d)) ** alpha # We can see alpha=0 correctly makes everything uniform print(dist) # Outputs: array([1., 1., 1.]) alpha = 1.0 dist = np.array(list(d)) ** alpha # We can see alpha=1 preserves the priorities print(dist) # Outputs: array([1., 2., 3.]) alpha = 0.5 dist = np.array(list(d)) ** alpha print(dist) # Outputs: array([1. , 1.41421356, 1.73205081]) dist /= dist.sum() print(dist) # Outputs: array([0.24118095, 0.34108138, 0.41773767]) num_samples = len(d) print(num_samples * dist) # Outputs: array([0.72354286, 1.02324413, 1.253213 ]) # Test variations of beta... beta = 0.0 # We can see beta=0 correctly makes everything uniform print((num_samples * dist) ** (-beta)) # Outputs: array([1., 1., 1.]) beta = 1.0 # It seems that beta=1 with the negative exponent INCORRECTLY downsamples weights on the prioritized samples importances = np.array((num_samples * dist) ** (-beta)) print(importances / importances.max()) # Outputs: array([1. , 0.70710678, 0.57735027]) # It seems that beta=1 with the positive exponent seems to correctly weight according to priority importances = np.array((num_samples * dist) ** (beta)) print(importances / importances.max()) # Outputs: array([0.57735027, 0.81649658, 1. ])
As seen with this simple Python code, it seems that keeping the positive exponent on beta causes the priorities to be correctly re-weighted.
Both within [Schaul15] and in other references, I have seen examples that show an equivalent representation of line 10 in a different form (with a positive exponent on beta, but with the other parameters inverted).
Perhaps some confusion between these two representations has led others astray? Or have I (much more likely) completely overlooked something that is ruining my own results? I welcome any and all feedback.
[Schaul15] Schaul, T., et al. 2015 Prioritized Experience Replay. arXiv:1511.05952 [cs.LG].