ERK'2021, Portorož, 179-182 179 Using Neural Networks for Synthesizing Importance Sampler Database in Reinforcement Learning Zvezdan Lonˇ carevi´ c 1;2 , Andrej Gams 1;2 1 Joˇ zef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia 2 Joˇ zef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia zvezdan.loncarevic@ijs.si Abstract Reinforcement learning is a widely used method of ac- quiring new skills in robotics. However, it is usually rather slow and a lot of learning iterations are needed until robot successfully learns the skill. During learning attempts, parameters of the actions together with the cor- responding reward are stored and used in the following update. In this paper, we present the possibility of using neural networks for expanding the database containing the knowledge from previous learning iterations. Results of throwing examples show that this can lead to accel- erated robot learning with less iterations and real-world repetitions. 1 Introduction In order for robots to move into unstructured environ- ments, adaptation to the current state of the world is crit- ical. One of the approaches to adaptation is robot learn- ing, i.e., the process of task performance improvement over the course of several repetitions [1]. However, robot learning can be complicated and can take a long time. It might also not be safe for the robot or its vicinity. Many approaches have been proposed to improve the speed and safety of robot learning. The literature states that a good starting point for learning is really important [2]. For example, the initial task execution or policy can be acquired by demonstration [3] or by generalization [4]. Another key approach is in the reduction of search space, for example with principal component analysis or autoen- coder neural networks [5]. Learning approaches themselves can also have a pro- found effect on the required number of task iterations. Sample efficiency of methods is a known problem of deep reinforcement learning approaches [6]. However, even different reinforcement learning methods require differ- ent amounts of samples. For example, gradient-based methods, such as Reinforce, eNAC and CMA-ES require several roll-outs for one iteration (update of policy pa- rameters) [7]. Non-gradient based algorithms, such as PoWER and PI2, do not need to first calculate the gradi- ent but can update the policy based on a few best attempts and, for example random noise. A few best attempts in reinforcement learning can be classified as those attempts that collect the most rewards. Figure 1: PA10 robot in MuJoCo Dynamic Simulation (left) and 2-DoF planar robot in Matlab (right) For example, the reward from all task executions is com- pared, and only the ones with the highest rewards are then used to generate the new iteration. In order to classify the attempts, specially when there is no model available, can also be time consuming. In this paper we propose a method that reduces the number of required attempts by training a neural network that maps between policy parameters and the expected re- ward. Before generating a new set of policy parameters, the algorithm predicts their reward, and then uses this vir- tually acquired reward, together with the rewards of pre- vious attempts, as the input into the importance sampler. The results show that the approach reduces the required amount of needed iterations and the highest number of iterations until the task if learned. We used simulated throwing at a target as the demonstration task. The sim- ulated set-up in MuJoCo [8] for dynamic simulation and planar kinematic simulation are shown in Fig. 1. 2 Search Space Reduction In order for RL to be applied in robotics, it has to learn fast enough. As neural networks have fixed number of in- puts, we need to represent trajectories so that each exam- ple has the same number of parameters. For that purpose, we used Dynamic Movement Primitives and autoencoder networks. 180 2.1 Dynamic Movement Primitives The basic idea of Dynamic Movement Primitives (DMPs) [9] is to represent the trajectory with the mass on spring- damper system, with learned external accelerations - the forcing term. For the each joint space coordinatey, DMP is based on the following second order differential equa- tion: 2 y = z ( z (g y) _ y)+f(x); (1) where is the time constant and it is used for time scal- ing, z and z are damping constants ( z = z =4) that make system critically damped and x is the phase vari- able. The forcing term f(x) encodes the shape of the trajectory from the initial positiony 0 to the final configu- rationg. It is given by: f(x) = P N i=1 (x)w i P N i=1 i (x) x; (2) i (x) = exp( 1 2 2 i (x c i ) 2 ); (3) wherec i are the centers of radial basis functions ( i (x)) distributed along the trajectory and 1 2 2 i their widths. Phasex makes the forcing termf(x) disappear when the goal is reached as it exponentially converges to0. Its dy- namics are given by _ x = x x; (4) where x is a positive constant and x starts from 1 and converges to 0 as the goal is reached. Multiple DoFs are realized by maintaining separate sets of (1–3), while a single canonical system given by (4) is used to synchro- nize them. 2.2 Autoencoder networks Autoencoder networks are neural networks that are often used for dimensionality reduction. Autoencoders with nonlinear layers are capable of extracting the most rel- evant features of robotic movements. They are composed of two parts: encoder and decoder (Fig. 2). An autoen- coder neural network is trained so that the output data ~ DMP matches the input data DMP as close as possi- ble. The encoder part pushes data to the bottleneck of the neural network called latent space and the decoder part recreates the original data, therefore F d F 1 e . An autoencoder is trained on a large set of executable kinematic trajectories represented with DMP parameters DMP i ;i = 1;:::;m by optimizing the following criteria: ? = argmin 1 m m X i=1 DMP i F d F e DMP i ; (5) where ? are the autoencoder parameters (weights and biases of neurons in the AE network). Once the network is fully trained, latent space parameters can be computed by applying the encoder part of the network: AE = F e DMP ; (6) and the decoder part can return latent space values into DMP: ~ DMP = F d AE : (7) 𝚯𝚯 𝐴𝐴𝐴𝐴 𝚯𝚯 𝐷𝐷𝐷𝐷 𝐷𝐷 � 𝚯𝚯 𝐷𝐷𝐷𝐷 𝐷𝐷 Encoder Decoder Figure 2: Example of simple autoencoder network 3 Reward Weighted Policy Learning with NN Extended Importance Sampler As RL algorithm of choice we used reward-weighted pol- icy learning with importance sampling (RWPL), which is a simplified variant of Policy Learning by Weighting Ex- ploration with the Returns (PoWER) method [10]. It uses a parametrized skill policy and a reward function to max- imize the expected return of skill performance trials. Under the assumption that there is only terminal re- ward and that only a single basis function is active at any given time (note that this is only approximately true for DMPs), the policy parameters n are updated using: n+1 = n + h( n n )R( n )i w( k ) hR( n )i w( k ) ; (8) wherew( k ) denotes the policy parameters ofk-th iter- ation. n =f k g n k=1 denotes the set of all policy pa- rameters k executed until then-th iteration andR > 0 the terminal reward received at the end of each rollout. hi w( k ) denotes importance sampling [11]. Its role is to select a predefined number of best trials to compute the update in order to reduce the number of iterations until optimal policy is learnt. In order to increase the convergence rate, we used neural network (NN) to artificially augment our dataset of parameters-reward pairs. It was trained so that it takes policy parameters as input and gives the approximated reward as an output. It was retrained after each itera- tion and the training dataset used for this consisted of n already executed trajectory parameters n and corre- sponding rewardsR( n ). After the network is trained, we have chosenp new random sets of the parameters and afterwards used NN to approximate the rewards. This way, we generated extended dataset of parameters 0 n and corresponding rewardsR( 0 n ). The update rule specified by Eq. (8), together with using augmented dataset is equivalent to n+1 = P m i=1 R in(n 0 ;i) in(n 0 ;i) P m i=1 R in(n 0 ;i) ; (9) where function in(n 0 ;i) selects the trial with the i-th highest reward from the extended trial setf k ;R k g m k=1 , m is the number of best trials selected by importance 181 sampling, and the exploration parameters are computed by adding exploration noise to the current estimate n n = n +" n : (10) Here" n is zero-mean Gaussian noise. Its variance is usually a diagonal matrix and needs to be specified by the user. In general, high variance does a more thorough exploration of the parameter space but may cause oscil- lations around the solution, while small variance can get stuck in a local minimum. 4 Experimental Evaluation As the use-case scenario, we used learning of robotic throwing of a ball into the basket. Throwing was the cho- sen task because it has been already widely studied in RL settings and because it is easy to define the reward. Our method for accelerating RL algorithms was evaluated in two different experimental setups. In Matlab we used a kinematic model of a 2-DoF planar robot that was throw- ing a ball at the target (Fig. 1 - right). In this simulation, air drag and friction were neglected. With this model, we tested the increase in the performance of the RL algo- rithm when we use NN for approximating the reward with randomly given DMP parameters to increase the number of examples. As learning in more reduced space is faster [12], for learning in the latent space of neural network, we used MuJoCo simulation with complete robot and ball dynamics (Fig. 1 - left). In this case NN was approxi- mating reward for the corresponding latent space values. Experiments in the dynamic simulation were repeated for 30 different randomly chosen targets and results together with the spreading of the data are reported. 4.1 Extending the database in DMP space For accelerating the learning in DMP space, in each it- eration we were using NN with 45-50-10-1 neurons re- spectively. Size of NN was determined empirically. Pa- rameters of the input layer were weights describing the shape of trajectory for two joints of the robot. This sets the input of the neural network to: 0 n = w j ;y 0;j ;g j ; l j=1 ; (11) wherel = 2 is number of active joints, is time scaling factor, w j =fw i g N i=1 where N = 20 are weights de- scribing the trajectory profile andy 0;j ,g j are initial and final points of the trajectory for each joint. Output of the neural network was predicted normalized reward of the shot given based on the distance of the ball landing spot from the desired target. After each trial, NN was used to approximatep = 10000 new examples and add them to the dataset of executed trajectories. Length of the im- portance sampler was set tom = 5. In this experiment, we were shooting at the (same) randomly chosen target with our approach with extended database for importance sampler and with regular RWPL algorithm. 4.2 Extending the database in latent space For accelerating the learning in the latent space of AE, we used NN with 3-10-7-5-1 neurons respectively. In or- der to obtain latent space values, a large dataset of ex- ecutable trajectories was calculated with respect to the kinematic properties of the used Mitsubishi PA10 robot as described in [5]. In order to obtain latent space values, we used autoencoder with 67-15-10-3-10-15-67 neurons. Input and output of the AE network were DMP parame- ters for 3 active DoF of the robot. With settingl = 3 in Eq. (11), 67 input/output values ( DMP ) were obtained and AE network was trained using Eq. (5). Using Eq. (6) we obtained latent space of the AE ( AE ). This pa- rameters were used as an input to our neural network for dataset augmentation. Since in our case, there was only 3 latent space values, input parameters to this neural net- work were: 0 n =f i g 3 i=1 ; (12) and the output was predicted normalized reward of the shot with that parameters. Same as in the experiment in DMP space, importance sampler length was set to m = 5 and NN approximated p = 10000 examples. Because learning is much faster in the reduced (latent) space, in this experiment we were able to use MuJoCo dynamic simulation. For this experiment we have chosen 30 (same) random targets and learned throwing with our approach and with regular RWPL algorithm. 5 Results Figure 3 shows the results of throwing in kinematic sim- ulation. Blue line shows convergence of the error and reward for the case of regular RWPL algorithm while red line shows convergence when NN for approximating re- ward based on DMP parameters was used. 0 50 100 150 200 250 300 350 400 450 500 0 0.1 0.2 0.3 0.4 with NN without NN 0 50 100 150 200 250 300 350 400 450 500 0.7 0.8 0.9 1 Figure 3: Error and reward convergence for the kinematic simu- lation with neural network synthesized examples (red line) and without neural network synthesized examples (blue line). Neu- ral network connects DMP parameters with the corresponding reward. The average convergence of the error and reward to- gether with spreading of the data for the case when NN was used for connecting latent space parameters with cor- responding reward is shown in Fig. 4. Blue line shows the results for regular RWPL and red line shows the re- sults of our approach. Since learning in the latent space is anyway significantly faster than learning in DMP space, 182 the results show less difference for this case. However, from Fig. 5 it is visible that our approach still managed to achieve some increase in the convergence rate since the average number of required iterations as well as highest number of iterations both reduced when compared with regular RWPL approach. Figure 4: Mean error and reward convergence for the dynamic simulation with neural network synthesized examples (red line) and without neural network synthesized examples (blue line). Neural network connects latent space values with the corre- sponding reward. Shaded area shows the spreading of the data among 30 different targets. with NN without NN Figure 5: Average (left) and maximal (right) number of itera- tions needed to hit the target with and without neural network synthesized examples. Blue bars present the results of the reg- ular RWPL algorithm and red bars present the results of our approach. 6 Conclusion Results show that neural networks can accelerate RL by artificially expanding dataset of known trajectories. In- crease in convergence rate in both DMP and AE space shows that our approach has potential to be applied in robotic tasks where the each learning iteration saves time and reduces wear of the equipment. Much bigger increase in performance is noticed in the less reduced DMP space where algorithms need to find more parameters. This was expected since the AE search space reduction already sig- nificantly increases RL performance. However, it should be noted that it can be used only when the large dataset of trajectories is available which is usually not the case. Other benefit of this approach is that less parameters need to be tuned for RL because NN gets retrained after each trial and there is less possibility of RL algorithm getting stuck in some local minimum. In the future, we plan to combine this approach with another RL algorithm and use the additional neural net- work to predict the parameters of the initial trajectory and reward for each following iteration of learning. We also plan to implement this algorithm on the physical robot instead of only simulation and to check whether trajecto- ries generated with our approach are smoother and safer for execution than the ones found by using only RL algo- rithm. References [1] F. Stulp, E. A. Theodorou, and S. Schaal, “Reinforcement learning with sequences of motion primitives for robust manipulation,” IEEE Transactions on Robotics, vol. 28, no. 6, pp. 1360–1370, 2012. [2] B. Siciliano and O. Khatib, Springer Handbook of Robotics. Berlin, Heidelberg: Springer-Verlag, 2007. [3] M. Deniˇ sa, A. Gams, A. Ude, and T. Petriˇ c, “Learn- ing compliant movement primitives through demonstra- tion and statistical generalization,” IEEE/ASME Transac- tions on Mechatronics, vol. 21, no. 5, pp. 2581–2594, 2016. [4] Z. Lonˇ carevi´ c, R. Pahiˇ c, A. Ude, and A. Gams, “Generalization-based acquisition of training data for mo- tor primitive learning by neural networks,” Applied Sci- ences, vol. 11, p. 1013, 2021. [5] R. Pahiˇ c, Z. Lonˇ carevi´ c, A. Gams, and A. Ude, “Robot skill learning in latent space of a deep autoencoder neu- ral network,” Robotics and Autonomous Systems, vol. 135, p. 103690, 2021. [6] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus, “Improving sample efficiency in model- free reinforcement learning from images,” 2020. [7] Z. Lonˇ carevi´ c, A. Gams, S. Reberˇ sek, B. Nemec, J. ˇ Skrabar, J. Skvarˇ c, and A. Ude, “Specifying and optimizing robotic motion for visual quality inspec- tion,” Robotics and Computer-Integrated Manufacturing, vol. 72, p. 102200, 2021. [8] E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012. [9] A. Ijspeert, J. Nakanishi, and S. Schaal, “Movement im- itation with nonlinear dynamical systems in humanoid robots,” Proceedings 2002 IEEE International Conference on Robotics and Automation, vol. 2, no. May, pp. 1398– 1403, 2002. [10] J. Kober and J. Peters, “Policy search for motor primitives in robotics,” Machine learning, vol. 84, p. 171–203, July 2011. [11] P. Kormushev, S. Calinon, R. Saegusa, and G. Metta, “Learning the skill of archery by a humanoid robot iCub,” in IEEE-RAS International Conference on Hu- manoid Robots (Humanoids), pp. 417–423, 2010. [12] Z. Lonˇ carevi´ c, R. Pahiˇ c, M. Simoniˇ c, A. Ude, and A. Gams, “Reduction of trajectory encoding data us- ing a deep autoencoder network: Robotic throwing,” in Advances in Service and Industrial Robotics, (Cham), pp. 86–94, Springer International Publishing, 2020.