ERK'2021, Portorož, 179-182 179
Using Neural Networks for Synthesizing Importance Sampler
Database in Reinforcement Learning
Zvezdan Lonˇ carevi´ c
1;2
, Andrej Gams
1;2
1
Joˇ zef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
2
Joˇ zef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
zvezdan.loncarevic@ijs.si
Abstract
Reinforcement learning is a widely used method of ac-
quiring new skills in robotics. However, it is usually
rather slow and a lot of learning iterations are needed
until robot successfully learns the skill. During learning
attempts, parameters of the actions together with the cor-
responding reward are stored and used in the following
update. In this paper, we present the possibility of using
neural networks for expanding the database containing
the knowledge from previous learning iterations. Results
of throwing examples show that this can lead to accel-
erated robot learning with less iterations and real-world
repetitions.
1 Introduction
In order for robots to move into unstructured environ-
ments, adaptation to the current state of the world is crit-
ical. One of the approaches to adaptation is robot learn-
ing, i.e., the process of task performance improvement
over the course of several repetitions [1]. However, robot
learning can be complicated and can take a long time. It
might also not be safe for the robot or its vicinity.
Many approaches have been proposed to improve the
speed and safety of robot learning. The literature states
that a good starting point for learning is really important
[2]. For example, the initial task execution or policy can
be acquired by demonstration [3] or by generalization [4].
Another key approach is in the reduction of search space,
for example with principal component analysis or autoen-
coder neural networks [5].
Learning approaches themselves can also have a pro-
found effect on the required number of task iterations.
Sample efﬁciency of methods is a known problem of deep
reinforcement learning approaches [6]. However, even
different reinforcement learning methods require differ-
ent amounts of samples. For example, gradient-based
methods, such as Reinforce, eNAC and CMA-ES require
several roll-outs for one iteration (update of policy pa-
rameters) [7]. Non-gradient based algorithms, such as
PoWER and PI2, do not need to ﬁrst calculate the gradi-
ent but can update the policy based on a few best attempts
and, for example random noise.
A few best attempts in reinforcement learning can be
classiﬁed as those attempts that collect the most rewards.
Figure 1: PA10 robot in MuJoCo Dynamic Simulation (left) and
2-DoF planar robot in Matlab (right)
For example, the reward from all task executions is com-
pared, and only the ones with the highest rewards are then
used to generate the new iteration. In order to classify the
attempts, specially when there is no model available, can
also be time consuming.
In this paper we propose a method that reduces the
number of required attempts by training a neural network
that maps between policy parameters and the expected re-
ward. Before generating a new set of policy parameters,
the algorithm predicts their reward, and then uses this vir-
tually acquired reward, together with the rewards of pre-
vious attempts, as the input into the importance sampler.
The results show that the approach reduces the required
amount of needed iterations and the highest number of
iterations until the task if learned. We used simulated
throwing at a target as the demonstration task. The sim-
ulated set-up in MuJoCo [8] for dynamic simulation and
planar kinematic simulation are shown in Fig. 1.
2 Search Space Reduction
In order for RL to be applied in robotics, it has to learn
fast enough. As neural networks have ﬁxed number of in-
puts, we need to represent trajectories so that each exam-
ple has the same number of parameters. For that purpose,
we used Dynamic Movement Primitives and autoencoder
networks.
180
2.1 Dynamic Movement Primitives
The basic idea of Dynamic Movement Primitives (DMPs)
[9] is to represent the trajectory with the mass on spring-
damper system, with learned external accelerations - the
forcing term. For the each joint space coordinatey, DMP
is based on the following second order differential equa-
tion:
  2
 y =  z
(  z
(g  y)    _ y)+f(x); (1)
where  is the time constant and it is used for time scal-
ing,  z
and  z
are damping constants (  z
=   z
=4) that
make system critically damped and x is the phase vari-
able. The forcing term f(x) encodes the shape of the
trajectory from the initial positiony
0
to the ﬁnal conﬁgu-
rationg. It is given by:
f(x) =
P
N
i=1
 (x)w
i
P
N
i=1
 
i
(x)
x; (2)
 
i
(x) = exp(  1
2  2
i
(x  c
i
)
2
); (3)
wherec
i
are the centers of radial basis functions ( 
i
(x))
distributed along the trajectory and
1
2  2
i
their widths.
Phasex makes the forcing termf(x) disappear when the
goal is reached as it exponentially converges to0. Its dy-
namics are given by
  _ x =    x
x; (4)
where   x
is a positive constant and x starts from 1 and
converges to 0 as the goal is reached. Multiple DoFs are
realized by maintaining separate sets of (1–3), while a
single canonical system given by (4) is used to synchro-
nize them.
2.2 Autoencoder networks
Autoencoder networks are neural networks that are often
used for dimensionality reduction. Autoencoders with
nonlinear layers are capable of extracting the most rel-
evant features of robotic movements. They are composed
of two parts: encoder and decoder (Fig. 2). An autoen-
coder neural network is trained so that the output data
~
  DMP
matches the input data   DMP
as close as possi-
ble. The encoder part pushes data to the bottleneck of
the neural network called latent space and the decoder
part recreates the original data, therefore F
d
  F
  1
e
.
An autoencoder is trained on a large set of executable
kinematic trajectories represented with DMP parameters
  DMP
i
;i = 1;:::;m by optimizing the following criteria:
  ?
= argmin
  1
m
m
X
i=1
      DMP
i
  F
d
  F
e
    DMP
i
       ; (5)
where   ?
are the autoencoder parameters (weights and
biases of neurons in the AE network). Once the network
is fully trained, latent space parameters can be computed
by applying the encoder part of the network:
  AE
= F
e
    DMP
  ; (6)
and the decoder part can return latent space values into
DMP:
~
  DMP
= F
d
    AE
  : (7)
𝚯𝚯 𝐴𝐴𝐴𝐴 𝚯𝚯 𝐷𝐷𝐷𝐷 𝐷𝐷 �
𝚯𝚯 𝐷𝐷𝐷𝐷 𝐷𝐷 Encoder Decoder
Figure 2: Example of simple autoencoder network
3 Reward Weighted Policy Learning with
NN Extended Importance Sampler
As RL algorithm of choice we used reward-weighted pol-
icy learning with importance sampling (RWPL), which is
a simpliﬁed variant of Policy Learning by Weighting Ex-
ploration with the Returns (PoWER) method [10]. It uses
a parametrized skill policy and a reward function to max-
imize the expected return of skill performance trials.
Under the assumption that there is only terminal re-
ward and that only a single basis function is active at any
given time (note that this is only approximately true for
DMPs), the policy parameters  n
are updated using:
  n+1
=  n
+
h(  n
    n
)R(  n
)i
w(  k
)
hR(  n
)i
w(  k
)
; (8)
wherew(  k
) denotes the policy parameters ofk-th iter-
ation.   n
=f    k
g
n
k=1
denotes the set of all policy pa-
rameters    k
executed until then-th iteration andR > 0
the terminal reward received at the end of each rollout.
h i
w(  k
)
denotes importance sampling [11]. Its role is to
select a predeﬁned number of best trials to compute the
update in order to reduce the number of iterations until
optimal policy is learnt.
In order to increase the convergence rate, we used
neural network (NN) to artiﬁcially augment our dataset
of parameters-reward pairs. It was trained so that it takes
policy parameters as input and gives the approximated
reward as an output. It was retrained after each itera-
tion and the training dataset used for this consisted of
n already executed trajectory parameters   n
and corre-
sponding rewardsR(  n
). After the network is trained,
we have chosenp new random sets of the parameters and
afterwards used NN to approximate the rewards. This
way, we generated extended dataset of parameters   0
n
and corresponding rewardsR(  0
n
).
The update rule speciﬁed by Eq. (8), together with
using augmented dataset is equivalent to
  n+1
=
P
m
i=1
R
in(n
0
;i)
    in(n
0
;i)
P
m
i=1
R
in(n
0
;i)
; (9)
where function in(n
0
;i) selects the trial with the i-th
highest reward from the extended trial setf    k
;R
k
g
m
k=1
,
m is the number of best trials selected by importance
181
sampling, and the exploration parameters are computed
by adding exploration noise to the current estimate  n
    n
=  n
+"
n
: (10)
Here"
n
is zero-mean Gaussian noise. Its variance   is
usually a diagonal matrix and needs to be speciﬁed by
the user. In general, high variance does a more thorough
exploration of the parameter space but may cause oscil-
lations around the solution, while small variance can get
stuck in a local minimum.
4 Experimental Evaluation
As the use-case scenario, we used learning of robotic
throwing of a ball into the basket. Throwing was the cho-
sen task because it has been already widely studied in RL
settings and because it is easy to deﬁne the reward. Our
method for accelerating RL algorithms was evaluated in
two different experimental setups. In Matlab we used a
kinematic model of a 2-DoF planar robot that was throw-
ing a ball at the target (Fig. 1 - right). In this simulation,
air drag and friction were neglected. With this model,
we tested the increase in the performance of the RL algo-
rithm when we use NN for approximating the reward with
randomly given DMP parameters to increase the number
of examples. As learning in more reduced space is faster
[12], for learning in the latent space of neural network,
we used MuJoCo simulation with complete robot and ball
dynamics (Fig. 1 - left). In this case NN was approxi-
mating reward for the corresponding latent space values.
Experiments in the dynamic simulation were repeated for
30 different randomly chosen targets and results together
with the spreading of the data are reported.
4.1 Extending the database in DMP space
For accelerating the learning in DMP space, in each it-
eration we were using NN with 45-50-10-1 neurons re-
spectively. Size of NN was determined empirically. Pa-
rameters of the input layer were weights describing the
shape of trajectory for two joints of the robot. This sets
the input of the neural network to:
  0
n
=
  w
j
;y
0;j
;g
j
;    l
j=1
; (11)
wherel = 2 is number of active joints,  is time scaling
factor, w
j
=fw
i
g
N
i=1
where N = 20 are weights de-
scribing the trajectory proﬁle andy
0;j
,g
j
are initial and
ﬁnal points of the trajectory for each joint. Output of the
neural network was predicted normalized reward of the
shot given based on the distance of the ball landing spot
from the desired target. After each trial, NN was used
to approximatep = 10000 new examples and add them
to the dataset of executed trajectories. Length of the im-
portance sampler was set tom = 5. In this experiment,
we were shooting at the (same) randomly chosen target
with our approach with extended database for importance
sampler and with regular RWPL algorithm.
4.2 Extending the database in latent space
For accelerating the learning in the latent space of AE,
we used NN with 3-10-7-5-1 neurons respectively. In or-
der to obtain latent space values, a large dataset of ex-
ecutable trajectories was calculated with respect to the
kinematic properties of the used Mitsubishi PA10 robot
as described in [5]. In order to obtain latent space values,
we used autoencoder with 67-15-10-3-10-15-67 neurons.
Input and output of the AE network were DMP parame-
ters for 3 active DoF of the robot. With settingl = 3 in
Eq. (11), 67 input/output values (  DMP
) were obtained
and AE network was trained using Eq. (5). Using Eq.
(6) we obtained latent space of the AE (  AE
). This pa-
rameters were used as an input to our neural network for
dataset augmentation. Since in our case, there was only
3 latent space values, input parameters to this neural net-
work were:
  0
n
=f  i
g
3
i=1
; (12)
and the output was predicted normalized reward of the
shot with that parameters. Same as in the experiment
in DMP space, importance sampler length was set to
m = 5 and NN approximated p = 10000 examples.
Because learning is much faster in the reduced (latent)
space, in this experiment we were able to use MuJoCo
dynamic simulation. For this experiment we have chosen
30 (same) random targets and learned throwing with our
approach and with regular RWPL algorithm.
5 Results
Figure 3 shows the results of throwing in kinematic sim-
ulation. Blue line shows convergence of the error and
reward for the case of regular RWPL algorithm while red
line shows convergence when NN for approximating re-
ward based on DMP parameters was used.
0 50 100 150 200 250 300 350 400 450 500
0
0.1
0.2
0.3
0.4 with NN
without NN
0 50 100 150 200 250 300 350 400 450 500
0.7
0.8
0.9
1
Figure 3: Error and reward convergence for the kinematic simu-
lation with neural network synthesized examples (red line) and
without neural network synthesized examples (blue line). Neu-
ral network connects DMP parameters with the corresponding
reward.
The average convergence of the error and reward to-
gether with spreading of the data for the case when NN
was used for connecting latent space parameters with cor-
responding reward is shown in Fig. 4. Blue line shows
the results for regular RWPL and red line shows the re-
sults of our approach. Since learning in the latent space is
anyway signiﬁcantly faster than learning in DMP space,
182
the results show less difference for this case. However,
from Fig. 5 it is visible that our approach still managed
to achieve some increase in the convergence rate since the
average number of required iterations as well as highest
number of iterations both reduced when compared with
regular RWPL approach.
Figure 4: Mean error and reward convergence for the dynamic
simulation with neural network synthesized examples (red line)
and without neural network synthesized examples (blue line).
Neural network connects latent space values with the corre-
sponding reward. Shaded area shows the spreading of the data
among 30 different targets.
with NN
without NN
Figure 5: Average (left) and maximal (right) number of itera-
tions needed to hit the target with and without neural network
synthesized examples. Blue bars present the results of the reg-
ular RWPL algorithm and red bars present the results of our
approach.
6 Conclusion
Results show that neural networks can accelerate RL by
artiﬁcially expanding dataset of known trajectories. In-
crease in convergence rate in both DMP and AE space
shows that our approach has potential to be applied in
robotic tasks where the each learning iteration saves time
and reduces wear of the equipment. Much bigger increase
in performance is noticed in the less reduced DMP space
where algorithms need to ﬁnd more parameters. This was
expected since the AE search space reduction already sig-
niﬁcantly increases RL performance. However, it should
be noted that it can be used only when the large dataset
of trajectories is available which is usually not the case.
Other beneﬁt of this approach is that less parameters need
to be tuned for RL because NN gets retrained after each
trial and there is less possibility of RL algorithm getting
stuck in some local minimum.
In the future, we plan to combine this approach with
another RL algorithm and use the additional neural net-
work to predict the parameters of the initial trajectory and
reward for each following iteration of learning. We also
plan to implement this algorithm on the physical robot
instead of only simulation and to check whether trajecto-
ries generated with our approach are smoother and safer
for execution than the ones found by using only RL algo-
rithm.
References
[1] F. Stulp, E. A. Theodorou, and S. Schaal, “Reinforcement
learning with sequences of motion primitives for robust
manipulation,” IEEE Transactions on Robotics, vol. 28,
no. 6, pp. 1360–1370, 2012.
[2] B. Siciliano and O. Khatib, Springer Handbook of
Robotics. Berlin, Heidelberg: Springer-Verlag, 2007.
[3] M. Deniˇ sa, A. Gams, A. Ude, and T. Petriˇ c, “Learn-
ing compliant movement primitives through demonstra-
tion and statistical generalization,” IEEE/ASME Transac-
tions on Mechatronics, vol. 21, no. 5, pp. 2581–2594,
2016.
[4] Z. Lonˇ carevi´ c, R. Pahiˇ c, A. Ude, and A. Gams,
“Generalization-based acquisition of training data for mo-
tor primitive learning by neural networks,” Applied Sci-
ences, vol. 11, p. 1013, 2021.
[5] R. Pahiˇ c, Z. Lonˇ carevi´ c, A. Gams, and A. Ude, “Robot
skill learning in latent space of a deep autoencoder neu-
ral network,” Robotics and Autonomous Systems, vol. 135,
p. 103690, 2021.
[6] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau,
and R. Fergus, “Improving sample efﬁciency in model-
free reinforcement learning from images,” 2020.
[7] Z. Lonˇ carevi´ c, A. Gams, S. Reberˇ sek, B. Nemec,
J.
ˇ
Skrabar, J. Skvarˇ c, and A. Ude, “Specifying and
optimizing robotic motion for visual quality inspec-
tion,” Robotics and Computer-Integrated Manufacturing,
vol. 72, p. 102200, 2021.
[8] E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics
engine for model-based control,” in 2012 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems,
pp. 5026–5033, 2012.
[9] A. Ijspeert, J. Nakanishi, and S. Schaal, “Movement im-
itation with nonlinear dynamical systems in humanoid
robots,” Proceedings 2002 IEEE International Conference
on Robotics and Automation, vol. 2, no. May, pp. 1398–
1403, 2002.
[10] J. Kober and J. Peters, “Policy search for motor primitives
in robotics,” Machine learning, vol. 84, p. 171–203, July
2011.
[11] P. Kormushev, S. Calinon, R. Saegusa, and G. Metta,
“Learning the skill of archery by a humanoid robot
iCub,” in IEEE-RAS International Conference on Hu-
manoid Robots (Humanoids), pp. 417–423, 2010.
[12] Z. Lonˇ carevi´ c, R. Pahiˇ c, M. Simoniˇ c, A. Ude, and
A. Gams, “Reduction of trajectory encoding data us-
ing a deep autoencoder network: Robotic throwing,” in
Advances in Service and Industrial Robotics, (Cham),
pp. 86–94, Springer International Publishing, 2020.