ERK'2019, Portorož, 219-222 219
Experimental Evaluation of Deep Q-Learning Applied on
Pendulum Balancing
Zvezdan Lonˇ carevi´ c, Rok Pahiˇ c, Gregor Papa, Andrej Gams
All authors are with the Joˇ zef Stefan Institute,
and with the Joˇ zef Stefan International Postgraduate School,
Jamova cesta 39, 1000 Ljubljana, Slovenia
e-mail: zvezdan.loncarevic@ijs.si
Abstract
Autonomy is one of the central issues for the future robots
that are expected to operate in continuously changing en-
vironments. Reinforcement learning is one of the main
approaches for learning in contemporary robotics. With
the rise of neural networks in recent studies, the idea
of incorporating neural networks with classic Q-learning
algorithm for learning policies was introduced in a form
of deep Q-Learning algorithm. While supervised and un-
supervised learning became widely spread within com-
munity, deep Q-Learning still remains a black-box in a
sense of parameter tuning as well as neural network ar-
chitecture and training.
In this paper we explore and compare training per-
formance using different parameters and different neural
network architectures on a simple use-case of pendulum
balancing.
1 Introduction
Reinforcement learning (RL) is a popular way of solving
optimization problems in robotics through trial-and-error
interaction with the environment. This relieves humans
from tedious programming. Planning of actions is pos-
sible for solving decision making problems with known
and determined dynamics as shown in [1, 2]. However, as
this is not always the case, RL is applied to help in ﬁnding
solutions without having detailed description of the prob-
lem and is useful for systems with complex dynamics
where it is not possible for all the disturbances and exter-
nal forces to be modelled [3]. This model-free reinforce-
ment algorithms were successfully applied on different
types of problems [4] and with the expansion of neural
networks extended variety of its application [5, 6]. How-
ever, architecture of the neural network, training strategy
and high number of parameters that need to be tuned for
each speciﬁc task diminish beneﬁts of theoretically re-
duced need for manual engineering.
In real-world domains experience must be collected
on real physical systems. By using simulations and
understanding the inﬂuence of parameters and training
strategies as well as possibilities of RL algorithms, it
would be possible to optimize the real world systems to
learn optimal policies in less iterations thus causing mini-
mal wear of the equipment and reducing the needed time.
-3 -2 -1 0 1 2 3
0
0.5
1
1.5
2
2.5
3
3.5
4
0
180
o
90
o
90
o
x
 (m)
Figure 1: Simulated cart pole used as the experimental environ-
ment in MATLAB
The goal of this paper is to show the inﬂuence of param-
eters on the learning process so we used simple inverted
pendulum attached to the cart pole (Figure 1) that was
powered by discrete accelerations.
The paper is organized as follows: In the next sec-
tion, we brieﬂy present Deep Q-learning algorithm. In
section 3 simulation setup and parameters of the system
are presented. Section 4 presents obtained results. The
paper concludes with a short outlook on the obtained re-
sults and suggestion for the future work.
2 Deep Q-Learning
Reinforcement learning deals with control policies for
agents that interact with unknown environments. Envi-
ronments can be formalized as a Markov Decision Pro-
cesses (MDPs) with only four values describing them. At
each time-step the agent changes its state from the current
states
t
to a new states
t+1
by performing an actiona
t
and
based on the new state gets the reward r
t
. Based on this
values, Q-learning algorithm [7] approximates the long
term reward known as Q-value if the particular action is
performed in given state. Values are iteratively updated
by the equation:
Q
new
(s,a)=Q
old
(s,a)+
+α
  r +γmax
a
  Q
old
(s
  ,a
  )−Q
old
(s,a)
  (1)
220
where Q
old
is an approximate before and Q
new
after
the update, α is learning rate, γ is discount factor and
max
a
  Q(s
  ,a
  ) is the maximal approximated value over
all actions a
  in the resulting state s
  . However, this way
of updating the Q-value means that actions and states
need to be discretized thus leading to the Q table of size
S ×A where S is the number of possible states and A is
the number of possible actions. Instead of this, with the
Deep Q-learning algorithm, Q-values are approximated
by the neural network (parametrized by weights and bi-
ases collectively denoted by θ). With the use of neural
network, Q-values approximates, denoted by Q(s,a|θ),
are estimated by making a forward pass when an in-
put is the current state of the system. By using neural
network, discretization of the states is not required be-
cause it generalizes beyond the states that it was trained
on. To avoid divergence and oscillations in learning [8],
experiences of transitions are stored in replay memory
Da sd
t
= {s
t
,a
t
,r
t
,s
t+1
} and uniformly sampled in
mini-batches containing examples for each training pass.
Adam optimizer [9] was used to optimize learning mo-
mentum. General deep Q-learning is given below in Al-
gorithm 1.
Initialize replay memory D
Initialize neural network for approximating
Q-value with a random weights and biasesθ
for i ∈ [1,number of episodes] do
Initialize states
t
for t ∈ [1,number of steps] do
Withε probability select random actiona
t
,
otherwise selecta
t
=max
a
  (s,a
  )
Executea
t
, observe the next states
t+1
and
get reward r
t
Store transitiond
t
=( s
t
;a
t
;r
t
;s
t+1
) in D
Sets
t
= s
t+1
Sample mini-batch fromD
for j ∈ [1,mini−batchsize] do
if s’= terminal state then
y
j
= r
t
else
y
j
= r
t
+γmax
a
  Q(s
  ,a
  |θ)
end
Perform one step of training using
(y
j
−Q(s,a|θ))
2
as a cost function
end
end
Algorithm 1: Deep Q-learning algorithm [10]
3 Experimental evaluation
In order to test the robustness and speed of learning, we
modelled the example of the cart pole with the pendulum
in Matlab Environment as shown in the Figure 1.
Simulated cart pole mass was M =1 kg, mass of pen-
dulum was m =0 .1kg, length was 0.5m and it could
be moved left or right by applying the force of −10N
or 10N respectively. For the state of the system to be
fully deﬁned, we used two generalized coordinates: x-
axis and the displacement angle φ. The cart was mov-
ing along x-axis and it had to stay within the range
of x ∈ (−2.6m,2.6m) for balancing to be counted
as successful. The displacement angle of the pole (φ)
is the second generalized coordinate and it was in the
range φ ∈ [−180
◦
,180
◦
] as shown in Figure 1. The
pendulum was set to initial position of {x, ˙ x,φ,
˙
φ} =
{0m,0m/s,0
◦
,1
◦
/s} so that in initial position equilib-
rium state was disturbed. The number of possible actions
yields the size of neural network output layer at two neu-
rons (for -10N and 10N) and number of states needed to
fully describe the system (x, ˙ x,φ,
˙
φ) sets the input layer
size to four neurons.
The goal of learning algorithm was to learn how to
balance the pendulum. In order to accomplish that task,
we tested three different neural networks with the archi-
tecture shown in Figure 2. With all three networks we
tested different combinations of reinforcement learning
parameters (exploration rateε and discount factorγ). Af-
ter ﬁnding the combination that was able to ﬁnd the bal-
ancing policy most efﬁciently, we added uncertainty to
the angle measurement to simulate sensors in a real world
environments and measured the number of iterations that
the policy successfully managed to balance the pendu-
lum.
4 Results
To ﬁnd optimal learning strategy, we tested the learn-
ing efﬁciency with the different combinations of ε and
γ parameters. Our results have shown that the choice
of the neural network is crucial for the performance pol-
icy learning. We tested learning algorithm on three dif-
ferent networks formed of 4 × 16 × 2 (Network A),
4×1024×256×2 (Network B),4×16×32×16×8×2
(Network C) fully connected layers as shown in Figure 2.
Results have shown that it is crucial for the task to ﬁnd
the smallest possible network to achieve good speed of
learning and resistance to external perturbations. Bars
in Figure 3 show the episode of learning in which al-
gorithm successfully managed to balance the pendulum
for at least 300 steps for shown pairs of parameters for
the Network A (Figure 2-left) and Network B (Figure 2-
middle). The deepest network (Network C) (Figure 2-
right) did not manage to ﬁnd any balancing policy for
any pairs of parameters in 10000 episodes.
With analyzing the results we found out that
fastest learning occurred with the parameters {ε,γ} =
{0.05,0.8} for the case of Network A (in 20 iterations)
and with the {ε,γ} = {0.05,0.9} for the case of Net-
work B (in only 7 iterations).
For aforementioned cases, the training was stopped
after the ﬁrst success and resistance of learned policy was
analyzed by adding simulated sensor noise on the reading
of the state of the angle φ. Balancing was considered to
be successful for the angle φ ∈ [−12
◦
,12
◦
] and that is
why maximal allowed noise on our simulated sensors was
set to the same values.
We tested how the number of the iterations that pen-
dulum was balanced was affected by this noise in both
221
Fully Connected
4
Fully Connected
256
Fully Connected
2
Fully Connected
4
Fl lC td
Fully Connected
2
Fully Connected
4
Fully Connected
16
Fully Connected
32
Fully Connected
16
Fully Connected
8
Fully 
2
4
16
2 4
2
1024
256
4
16
32
16
8
2
a)
b)
c)
Network A Network B Network C
onnected
Fully Connected
256
Fully Conne
2
nected
4
2
1024
256
b)
Figure 2: Neural network architectures used for approximating Q-value.
a)
b)
Figure 3: Pairs of γ and ε parameters that found the control policy for the networks A (left) and B (right) at the iteration shown by
the bars. Only parts of the graphs where solution is found are shown. Neural network C did not manage to ﬁnd the policy with any
parameters.
Figure 4: Resistance based on the number of iterations in which the pendulum satisﬁed the stabilizing criteria using control policy
learned by networks A (left) and B (right) with applied sensory noise.
cases (Networks A and B) and the results are shown in the
Figure 4. The graphs show the mean and standard devia-
tion for the number of iterations in which balancing was
successfully performed (tests were done 1000 times for
statistics). As expected, bigger noise reduced the number
of successful balancing iterations. The results show that
smaller network is much more robust to the wrong read-
ings from the sensors and that it manages to ﬁnd a better
222
policy.
5 Conclusion
Results show that the architecture of the neural network
is crucial for the success of the task. Size of the neural
network should be smallest possible for the solution to
be found in reasonably small number of episodes. For
the simple problem such as balancing the pendulum from
initial state in upright position, high complexity of the
neural network negatively affects the speed of a learning
process. Our tests show the performance might get more
degraded by extending the depth than extending the width
of the network. With the control problems in almost lin-
ear space as in presented use-case, there is no need for
big exploration noise to be added as proven by the cases
with the fastest learning (ε ≈ 0.05 for the cases with
both networks). Our tests with the different parameters
have shown that the best choice for γ is to choose val-
ues within the range
  0.8,1
  . With the success occurring
in small number of learning episodes, deep Q-learning
seems to be promising approach in making the controllers
for the systems found in real world.
In the future work, we plan to test the balancing on the
real world system and with using the convolutional neu-
ral networks (CNN) as used in [11, 5] and to try to extend
the problem complexity to ﬁnding the strategy that would
be able to swing up and balance pendulum with using the
same network for both problems (swing-up and balance)
or play ball-in-a-cup game as it was done using the regu-
lar Q-learning algorithm in [12]. We also want to check
the possibilities of improvement using the adaptive learn-
ing rate method such as RMSProp [13] or ADADELTA
[14]. We are planning to perform the analysis of van-
ishing gradient [15, 16] to check if there are methods that
would help us to make deeper network architectures learn
the policies. With making deep neural networks more ro-
bust with optimization methods, we would be able to train
the policies for more complicated tasks as it was done
with the recurrent neural networks [17, 18].
Acknowledgement: The corresponding author is sup-
ported by the Fund for Public Scholarship, Develop-
ment, Disability and Maintenance Fund of the Republic
of Slovenia with Ad futura Scholarship for Postgraduate
Studies of Nationals of Western Balkan States for Study
in the Republic of Slovenia (226. Public Call).
References
[1] Moravˇ cik Matej, S. Martin, B. Neil, V . Lisy, M. Dustin,
B. Nolan, D. Trevor, W. Kevin, J. Michael, and
B. Michael, “DeepStack : Expert-Level Artiﬁcial Intel-
ligence in Heads-Up No-Limit Poker,” Science, vol. 356,
no. 6337, pp. 508–513, 2017.
[2] Y . Tassa, T. Erez, and E. Todorov, “Synthesis and Stabi-
lization of Complex Behaviors through Online Trajectory
Optimization,” 2012 IEEE/RSJ International Conference
on Intelligent Robots and Systems, pp. 4906–4913, 2012.
[3] R. Pahiˇ c, Z. Lonˇ carevi´ c, A. Ude, B. Nemec, and A. Gams,
“User feedback in latent space robotic skill learning,” in
2018 IEEE-RAS 18th International Conference on Hu-
manoid Robots (Humanoids), pp. 270–276, Nov 2018.
[4] J. Kober, J. A. Bagnell, and J. Peters, “Reinforce-
ment learning in robotics : A Survey,” Learning Mo-
tor Skills. Springer Tracts in Advanced Robotics, vol. 97,
no. Springer, Cham, 2013.
[5] V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve-
ness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K.
Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,
I. Antonoglou, H. King, D. Kumaran, D. Wierstra,
S. Legg, and D. Hassabis, “Human-level control through
deep reinforcement learning,” Nature, vol. 518, pp. 529–
533, 2015.
[6] M. J. Hausknecht and P. Stone, “Deep reinforce-
ment learning in parameterized action space,” CoRR,
vol. abs/1511.04143, 2016.
[7] P. Dayan, “Technical Note Q-Learning,” Machine Learn-
ing (MLJ), vol. 8, pp. 279–292, 1992.
[8] J. N. Tsitsiklis and B. V . Roy, “An Analysis of Temporal-
Difference Learning with Function Approximation,” IEEE
Transactions on Automatic Control, vol. 42, no. 5,
pp. 674–690, 1997.
[9] D. Kingma and J. Ba, “Adam: a method for stochas-
tic optimization (2014),” arXiv preprint arXiv:1412.6980,
vol. 15, 2015.
[10] V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
I. Antonoglou, D. Wierstra, and M. Riedmiller, “Play-
ing atari with deep reinforcement learning,” arXiv preprint
arXiv:1312.5602, 2013.
[11] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,
A. Bolton, Y . Chen, T. Lillicrap, F. Hui, and L. Sifre,
“Mastering the game of Go without human knowledge,”
Nature Publishing Group, vol. 550, no. 7676, pp. 354–
359, 2017.
[12] B. Nemec, M. Zorko, and L. Zlajpah, “Learning of a
ball-in-a-cup playing robot,” 19th International Workshop
on Robotics in Alpe-Adria-Danube Region (RAAD 2010),
pp. 297–301, 2010.
[13] Y . N. Dauphin, H. Vries, J. Chung, and Y . Bengio, “Rm-
sprop and equilibrated adaptive learning rates for non-
convex optimization,” arXiv, vol. 35, 02 2015.
[14] M. D. Zeiler, “Adadelta: an adaptive learning rate
method,” arXiv preprint arXiv:1212.5701, 2012.
[15] R. Pascanu, T. Mikolov, and Y . Bengio, “Under-
standing the exploding gradient problem,” CoRR,
vol. abs/1211.5063, 2012.
[16] S. Hochreiter, “Untersuchungen zu dynamischen neu-
ronalen Netzen. Diploma thesis, Institut f¨ ur Informatik,
Lehrstuhl Prof. Brauer, Technische Universit¨ at M¨ unchen,”
1991.
[17] T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and
R. Tachibana, “Deep reinforcement learning for high pre-
cision assembly tasks,” in 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS),
pp. 819–825, IEEE, 2017.
[18] M. J. Hausknecht and P. Stone, “Deep recurrent q-learning
for partially observable mdps,” in AAAI Fall Symposia,
2015.