ERK'2019, Portorož, 219-222 219 Experimental Evaluation of Deep Q-Learning Applied on Pendulum Balancing Zvezdan Lonˇ carevi´ c, Rok Pahiˇ c, Gregor Papa, Andrej Gams All authors are with the Joˇ zef Stefan Institute, and with the Joˇ zef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia e-mail: zvezdan.loncarevic@ijs.si Abstract Autonomy is one of the central issues for the future robots that are expected to operate in continuously changing en- vironments. Reinforcement learning is one of the main approaches for learning in contemporary robotics. With the rise of neural networks in recent studies, the idea of incorporating neural networks with classic Q-learning algorithm for learning policies was introduced in a form of deep Q-Learning algorithm. While supervised and un- supervised learning became widely spread within com- munity, deep Q-Learning still remains a black-box in a sense of parameter tuning as well as neural network ar- chitecture and training. In this paper we explore and compare training per- formance using different parameters and different neural network architectures on a simple use-case of pendulum balancing. 1 Introduction Reinforcement learning (RL) is a popular way of solving optimization problems in robotics through trial-and-error interaction with the environment. This relieves humans from tedious programming. Planning of actions is pos- sible for solving decision making problems with known and determined dynamics as shown in [1, 2]. However, as this is not always the case, RL is applied to help in finding solutions without having detailed description of the prob- lem and is useful for systems with complex dynamics where it is not possible for all the disturbances and exter- nal forces to be modelled [3]. This model-free reinforce- ment algorithms were successfully applied on different types of problems [4] and with the expansion of neural networks extended variety of its application [5, 6]. How- ever, architecture of the neural network, training strategy and high number of parameters that need to be tuned for each specific task diminish benefits of theoretically re- duced need for manual engineering. In real-world domains experience must be collected on real physical systems. By using simulations and understanding the influence of parameters and training strategies as well as possibilities of RL algorithms, it would be possible to optimize the real world systems to learn optimal policies in less iterations thus causing mini- mal wear of the equipment and reducing the needed time. -3 -2 -1 0 1 2 3 0 0.5 1 1.5 2 2.5 3 3.5 4 0 180 o 90 o 90 o x (m) Figure 1: Simulated cart pole used as the experimental environ- ment in MATLAB The goal of this paper is to show the influence of param- eters on the learning process so we used simple inverted pendulum attached to the cart pole (Figure 1) that was powered by discrete accelerations. The paper is organized as follows: In the next sec- tion, we briefly present Deep Q-learning algorithm. In section 3 simulation setup and parameters of the system are presented. Section 4 presents obtained results. The paper concludes with a short outlook on the obtained re- sults and suggestion for the future work. 2 Deep Q-Learning Reinforcement learning deals with control policies for agents that interact with unknown environments. Envi- ronments can be formalized as a Markov Decision Pro- cesses (MDPs) with only four values describing them. At each time-step the agent changes its state from the current states t to a new states t+1 by performing an actiona t and based on the new state gets the reward r t . Based on this values, Q-learning algorithm [7] approximates the long term reward known as Q-value if the particular action is performed in given state. Values are iteratively updated by the equation: Q new (s,a)=Q old (s,a)+ +α r +γmax a Q old (s ,a )−Q old (s,a) (1) 220 where Q old is an approximate before and Q new after the update, α is learning rate, γ is discount factor and max a Q(s ,a ) is the maximal approximated value over all actions a in the resulting state s . However, this way of updating the Q-value means that actions and states need to be discretized thus leading to the Q table of size S ×A where S is the number of possible states and A is the number of possible actions. Instead of this, with the Deep Q-learning algorithm, Q-values are approximated by the neural network (parametrized by weights and bi- ases collectively denoted by θ). With the use of neural network, Q-values approximates, denoted by Q(s,a|θ), are estimated by making a forward pass when an in- put is the current state of the system. By using neural network, discretization of the states is not required be- cause it generalizes beyond the states that it was trained on. To avoid divergence and oscillations in learning [8], experiences of transitions are stored in replay memory Da sd t = {s t ,a t ,r t ,s t+1 } and uniformly sampled in mini-batches containing examples for each training pass. Adam optimizer [9] was used to optimize learning mo- mentum. General deep Q-learning is given below in Al- gorithm 1. Initialize replay memory D Initialize neural network for approximating Q-value with a random weights and biasesθ for i ∈ [1,number of episodes] do Initialize states t for t ∈ [1,number of steps] do Withε probability select random actiona t , otherwise selecta t =max a (s,a ) Executea t , observe the next states t+1 and get reward r t Store transitiond t =( s t ;a t ;r t ;s t+1 ) in D Sets t = s t+1 Sample mini-batch fromD for j ∈ [1,mini−batchsize] do if s’= terminal state then y j = r t else y j = r t +γmax a Q(s ,a |θ) end Perform one step of training using (y j −Q(s,a|θ)) 2 as a cost function end end Algorithm 1: Deep Q-learning algorithm [10] 3 Experimental evaluation In order to test the robustness and speed of learning, we modelled the example of the cart pole with the pendulum in Matlab Environment as shown in the Figure 1. Simulated cart pole mass was M =1 kg, mass of pen- dulum was m =0 .1kg, length was 0.5m and it could be moved left or right by applying the force of −10N or 10N respectively. For the state of the system to be fully defined, we used two generalized coordinates: x- axis and the displacement angle φ. The cart was mov- ing along x-axis and it had to stay within the range of x ∈ (−2.6m,2.6m) for balancing to be counted as successful. The displacement angle of the pole (φ) is the second generalized coordinate and it was in the range φ ∈ [−180 ◦ ,180 ◦ ] as shown in Figure 1. The pendulum was set to initial position of {x, ˙ x,φ, ˙ φ} = {0m,0m/s,0 ◦ ,1 ◦ /s} so that in initial position equilib- rium state was disturbed. The number of possible actions yields the size of neural network output layer at two neu- rons (for -10N and 10N) and number of states needed to fully describe the system (x, ˙ x,φ, ˙ φ) sets the input layer size to four neurons. The goal of learning algorithm was to learn how to balance the pendulum. In order to accomplish that task, we tested three different neural networks with the archi- tecture shown in Figure 2. With all three networks we tested different combinations of reinforcement learning parameters (exploration rateε and discount factorγ). Af- ter finding the combination that was able to find the bal- ancing policy most efficiently, we added uncertainty to the angle measurement to simulate sensors in a real world environments and measured the number of iterations that the policy successfully managed to balance the pendu- lum. 4 Results To find optimal learning strategy, we tested the learn- ing efficiency with the different combinations of ε and γ parameters. Our results have shown that the choice of the neural network is crucial for the performance pol- icy learning. We tested learning algorithm on three dif- ferent networks formed of 4 × 16 × 2 (Network A), 4×1024×256×2 (Network B),4×16×32×16×8×2 (Network C) fully connected layers as shown in Figure 2. Results have shown that it is crucial for the task to find the smallest possible network to achieve good speed of learning and resistance to external perturbations. Bars in Figure 3 show the episode of learning in which al- gorithm successfully managed to balance the pendulum for at least 300 steps for shown pairs of parameters for the Network A (Figure 2-left) and Network B (Figure 2- middle). The deepest network (Network C) (Figure 2- right) did not manage to find any balancing policy for any pairs of parameters in 10000 episodes. With analyzing the results we found out that fastest learning occurred with the parameters {ε,γ} = {0.05,0.8} for the case of Network A (in 20 iterations) and with the {ε,γ} = {0.05,0.9} for the case of Net- work B (in only 7 iterations). For aforementioned cases, the training was stopped after the first success and resistance of learned policy was analyzed by adding simulated sensor noise on the reading of the state of the angle φ. Balancing was considered to be successful for the angle φ ∈ [−12 ◦ ,12 ◦ ] and that is why maximal allowed noise on our simulated sensors was set to the same values. We tested how the number of the iterations that pen- dulum was balanced was affected by this noise in both 221 Fully Connected 4 Fully Connected 256 Fully Connected 2 Fully Connected 4 Fl lC td Fully Connected 2 Fully Connected 4 Fully Connected 16 Fully Connected 32 Fully Connected 16 Fully Connected 8 Fully 2 4 16 2 4 2 1024 256 4 16 32 16 8 2 a) b) c) Network A Network B Network C onnected Fully Connected 256 Fully Conne 2 nected 4 2 1024 256 b) Figure 2: Neural network architectures used for approximating Q-value. a) b) Figure 3: Pairs of γ and ε parameters that found the control policy for the networks A (left) and B (right) at the iteration shown by the bars. Only parts of the graphs where solution is found are shown. Neural network C did not manage to find the policy with any parameters. Figure 4: Resistance based on the number of iterations in which the pendulum satisfied the stabilizing criteria using control policy learned by networks A (left) and B (right) with applied sensory noise. cases (Networks A and B) and the results are shown in the Figure 4. The graphs show the mean and standard devia- tion for the number of iterations in which balancing was successfully performed (tests were done 1000 times for statistics). As expected, bigger noise reduced the number of successful balancing iterations. The results show that smaller network is much more robust to the wrong read- ings from the sensors and that it manages to find a better 222 policy. 5 Conclusion Results show that the architecture of the neural network is crucial for the success of the task. Size of the neural network should be smallest possible for the solution to be found in reasonably small number of episodes. For the simple problem such as balancing the pendulum from initial state in upright position, high complexity of the neural network negatively affects the speed of a learning process. Our tests show the performance might get more degraded by extending the depth than extending the width of the network. With the control problems in almost lin- ear space as in presented use-case, there is no need for big exploration noise to be added as proven by the cases with the fastest learning (ε ≈ 0.05 for the cases with both networks). Our tests with the different parameters have shown that the best choice for γ is to choose val- ues within the range 0.8,1 . With the success occurring in small number of learning episodes, deep Q-learning seems to be promising approach in making the controllers for the systems found in real world. In the future work, we plan to test the balancing on the real world system and with using the convolutional neu- ral networks (CNN) as used in [11, 5] and to try to extend the problem complexity to finding the strategy that would be able to swing up and balance pendulum with using the same network for both problems (swing-up and balance) or play ball-in-a-cup game as it was done using the regu- lar Q-learning algorithm in [12]. We also want to check the possibilities of improvement using the adaptive learn- ing rate method such as RMSProp [13] or ADADELTA [14]. We are planning to perform the analysis of van- ishing gradient [15, 16] to check if there are methods that would help us to make deeper network architectures learn the policies. With making deep neural networks more ro- bust with optimization methods, we would be able to train the policies for more complicated tasks as it was done with the recurrent neural networks [17, 18]. Acknowledgement: The corresponding author is sup- ported by the Fund for Public Scholarship, Develop- ment, Disability and Maintenance Fund of the Republic of Slovenia with Ad futura Scholarship for Postgraduate Studies of Nationals of Western Balkan States for Study in the Republic of Slovenia (226. Public Call). References [1] Moravˇ cik Matej, S. Martin, B. Neil, V . Lisy, M. Dustin, B. Nolan, D. Trevor, W. Kevin, J. Michael, and B. Michael, “DeepStack : Expert-Level Artificial Intel- ligence in Heads-Up No-Limit Poker,” Science, vol. 356, no. 6337, pp. 508–513, 2017. [2] Y . Tassa, T. Erez, and E. Todorov, “Synthesis and Stabi- lization of Complex Behaviors through Online Trajectory Optimization,” 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913, 2012. [3] R. Pahiˇ c, Z. Lonˇ carevi´ c, A. Ude, B. Nemec, and A. Gams, “User feedback in latent space robotic skill learning,” in 2018 IEEE-RAS 18th International Conference on Hu- manoid Robots (Humanoids), pp. 270–276, Nov 2018. [4] J. Kober, J. A. Bagnell, and J. Peters, “Reinforce- ment learning in robotics : A Survey,” Learning Mo- tor Skills. Springer Tracts in Advanced Robotics, vol. 97, no. Springer, Cham, 2013. [5] V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve- ness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529– 533, 2015. [6] M. J. Hausknecht and P. Stone, “Deep reinforce- ment learning in parameterized action space,” CoRR, vol. abs/1511.04143, 2016. [7] P. Dayan, “Technical Note Q-Learning,” Machine Learn- ing (MLJ), vol. 8, pp. 279–292, 1992. [8] J. N. Tsitsiklis and B. V . Roy, “An Analysis of Temporal- Difference Learning with Function Approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674–690, 1997. [9] D. Kingma and J. Ba, “Adam: a method for stochas- tic optimization (2014),” arXiv preprint arXiv:1412.6980, vol. 15, 2015. [10] V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Play- ing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013. [11] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, and L. Sifre, “Mastering the game of Go without human knowledge,” Nature Publishing Group, vol. 550, no. 7676, pp. 354– 359, 2017. [12] B. Nemec, M. Zorko, and L. Zlajpah, “Learning of a ball-in-a-cup playing robot,” 19th International Workshop on Robotics in Alpe-Adria-Danube Region (RAAD 2010), pp. 297–301, 2010. [13] Y . N. Dauphin, H. Vries, J. Chung, and Y . Bengio, “Rm- sprop and equilibrated adaptive learning rates for non- convex optimization,” arXiv, vol. 35, 02 2015. [14] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012. [15] R. Pascanu, T. Mikolov, and Y . Bengio, “Under- standing the exploding gradient problem,” CoRR, vol. abs/1211.5063, 2012. [16] S. Hochreiter, “Untersuchungen zu dynamischen neu- ronalen Netzen. Diploma thesis, Institut f¨ ur Informatik, Lehrstuhl Prof. Brauer, Technische Universit¨ at M¨ unchen,” 1991. [17] T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana, “Deep reinforcement learning for high pre- cision assembly tasks,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 819–825, IEEE, 2017. [18] M. J. Hausknecht and P. Stone, “Deep recurrent q-learning for partially observable mdps,” in AAAI Fall Symposia, 2015.