Abstract:
Episode-based reinforcement learning (ERL) algorithms treat rein-
forcement learning (RL) as a black-box optimization problem where we learn to
select a parameter vector of a controller, often represented as a movement primi-
tive, for a given task descriptor called a context. ERL offers several distinct bene-
fits in comparison to step-based RL. It generates smooth control trajectories, can
handle non-Markovian reward definitions, and the resulting exploration in param-
eter space is well suited for solving sparse reward settings. Yet, the high dimen-
sionality of the movement primitive parameters has so far hampered the effective
use of deep RL methods. In this paper, we present a new algorithm for deep ERL.
It is based on differentiable trust region layers, a successful on-policy deep RL
algorithm. These layers allow us to specify trust regions for the policy update that
are solved exactly for each state using convex optimization, which enables poli-
cies learning with the high precision required for the ERL. We compare our ERL
algorithm to state-of-the-art step-based algorithms in many complex simulated
robotic control tasks. In doing so, we investigate different reward formulations
- dense, sparse, and non-Markovian. While step-based algorithms perform well
only on dense rewards, ERL performs favorably on sparse and non-Markovian re-
wards. Moreover, our results show that the sparse and the non-Markovian rewards
are also often better suited to define the desired behavior, allowing us to obtain
considerably higher quality policies compared to step-based RL