ERK'2019, Portorož, 90-93 90
Learning-based local navigation in dynamic environments
Matej Dobrevski, Danijel Skoˇ caj
Fakulteta za raˇ cunalniˇ stvo in informatiko
Univerza v Ljubljani, Veˇ cna pot 113, 1000 Ljubljana
E-poˇ sta: matej.dobrevski@fri.uni-lj.com
Abstract
Safe navigation is a crucial capability of mobile robots.
Unstructured environments with moving obstacles are es-
pecially difﬁcult to navigate in. In this paper we train
a neural-network-based reinforcement-learning model to
perform local navigation and planning in environments
containing moving pedestrians. We evaluate our approach
against the standard Dynamic Window Approach (DWA)
and show that we can outperform it in presence of dy-
namic obstacles.
1 Introduction
Mobile robot navigation is an enabling technology for the
development of machines that can operate autonomously
in unstructured environments. As such, it has received
a substantial amount of interest in the research commu-
nity [6, 4, 18, 12].
Systems for planning the movement of robots in com-
plex environments are usually organized in two stages: a
global planner and a local planner. The global planner
has access to a map of the whole environment and plans
efﬁcient paths to the goal location. The local planner has
access to the on-board sensors and insures that the robot
follows the global path while reacting to any changes in
the environment, such as new obstacles. The local plan-
ner can also serve as a control algorithm, directly gener-
ating motor movement commands.
Local planning is especially signiﬁcant since for some
environments maintaining an accurate map can be im-
practical or impossible, because of frequent changes or
lack of identiﬁable features. In such cases we can ob-
tain the position of the robot and the goal in the environ-
ment with the use of an external localization system, such
as GPS, indoor localization beacons, marker detection or
another method. In those kinds of environments it is pos-
sible to navigate using a local planner.
For local planners one of the most difﬁcult tasks is
to plan movement in the presence of dynamic obstacles.
While there has been a signiﬁcant amount of research on
this topic, most approaches assume the knowledge of the
position and velocity of all the dynamic obstacles in the
environment [5, 18, 2], which requires an additional step
of perception in the navigation method.
In this research we model the problem of navigation
as a Markov Decision Process (MDP) and train a neu-
ral network to perform local navigation/planning in envi-
ronments densely populated with humans, directly from
sensor inputs, joining the steps of perception and nav-
igation in a single network. The network is trained in
a reinforcement learning (RL) setting using the state of
the art Proximal Policy Optimization (PPO) [16] algo-
rithm. The learning is performed in a simulator with dif-
ferent stages which contain realistically simulated human
crowds. We evaluate our method against the established
Dynamic Window Approach (DWA) to local planning [6]
and show that in the presence of dynamic obstacles it
achieves superior results.
2 Related work
Local navigation has been a long studied ﬁeld in robotics.
Among the most signiﬁcant approaches to local naviga-
tion are the DWA [6], the potential ﬁeld methods [4] and
Velocity obstacles [5] which have spawned numerous ex-
tensions and modiﬁcations such as the Global Dynamic
Window Approach [3], Convergent Dynamic Window Ap-
proach [12], Harmonic Potential Functions [8], Recipro-
cal Velocity Obstacles [18] and many more.
The approaches which consider dynamic obstacles as-
sume the knowledge of the position and velocity of all the
moving obstacles in the scene, while we assume only the
access to the range scanner of the robot and its global
position relative to the goal.
In recent years learning based approaches have be-
come popular. In [13] the authors trained a neural net-
work to map laser range measurements and the relative
goal position to control commands. However, they use
expert demonstrations for training the network in a su-
pervised manner, which can be impractical. Furthermore,
they do not address dynamic obstacles.
Reinforcement learning has been extensively applied
to the problem of navigation. Until recently these ap-
proaches [1, 9, 19] have mainly relied on severe discretiza-
tions of the environment which can severely limit the ac-
curacy and precision of the navigation. Recently, follow-
ing the improvement of the techniques for training neu-
ral networks in a reinforcement learning setting [11, 10,
15, 16], several approaches have been proposed, which
perform robot navigation directly from sensor readings.
In [14] a variant of Q-learning was used to generate safe
commands on the basis of a single image. An obsta-
91
cle avoidance network was trained using the dueling and
double-Q learning techniques in [20]. In [21] the authors
developed a visual navigation method for known scenes.
The work done in [17] is most closely related to our ap-
proach, however in their work they do not consider dy-
namic obstacles in the environment.
3 Learning a navigation policy
3.1 Navigation as a RL problem
We model the problem of navigation in dynamic environ-
ments as an MDP deﬁned with(S;A;R;T;  ), where:
  S deﬁnes the set of all possible observations. In our
case an observations2S consists of three consec-
utive laser-scans of the environment (64 measure-
ments in a 240
  angle) together with the distance
and angle to the goal location as well as the current
velocity of the robot.
  A deﬁnes the set of all possible actions. For out dif-
ferential drive robotA=f(v;w)jv2[0;0:5];w2
[  2;2]g, wherev is the transnational andw is the
angular speed of the robot.
  The reward function R : S ! R is constructed
such that the robot receives a reward of  0:1 for
each time-step, a reward of  120 for a collision
with an agent or the environment, and a reward of
100 for reaching the goal.
  The transition functionT :S  A!S deﬁnes how
the states evolve, and is implicitly deﬁned by the
simulator we use for learning, the robot dynamics
and the sensor readings.
  The discount factor  is set to  =0:99.
For optimization we use the state-of-the-art PPO al-
gorithm which is a more efﬁcient variant of the Trust Re-
gion Policy Optimization (TRPO) [15] algorithm. Both
algorithms are policy gradient algorithms which directly
optimize the policy network. The main difference from
vanilla policy gradient methods is that both PPO and TRPO
limit the gradient updates to the network coefﬁcients so
that the KL divergence of the old and new policies does
not surpass a predetermined limit. As a policy gradient
method PPO is also an on-policy learning method, which
means that in updating we can only use transitions that
have been generated by the policy we are updating.
3.2 Network
The policy network is represented in Figure 1. The three
last laser observations are fed through three convolutional
layers and two fully connected (dense) layers. The output
of the dense layers is joined with the distance and angle
to the goal, as well as the current velocity of the robot.
Further 4 dense layers process the joined representation
before the last layer, where we generate the translational
and rotational speeds which have a tanh activation.
The value network, which is used during the opti-
mization for calculating the action advantage according
to the Generelized Advantage Estimation (GAE) algo-
rithm shares the same design, with the difference being
in the output layer, where there is only one linear output.
Figure 1: The design of the policy network. Three convolutional
layers followed by 5 fully connected ones. The input are three
consecutive laser scans, the angle and distance to the goal as
well as the current translational and rotational speeds, which
are joined with the 2nd fully connected layer.
3.3 Training
In order to learn the navigation policy we perform a cur-
riculum learning procedure in two phases. In the ﬁrst
phase the robot is put in a polygon without any obsta-
cles besides the walls of the room (Figure 2 left). The
goal location is generated randomly [0;3]m away from
its location. We train the policy for100K steps in this en-
vironment. In the second phase (represented in Figure 2
right) we add static obstacles as well as dynamic obsta-
cles which are simulated humans with their movement
modeled by the Social Forces Model [7]. In the second
phase the goal location is generated randomly [2;11]m
away from the robot. We train the robot in this environ-
ment for further250K steps.
Figure 2: The two environments representing phase 1 and phase
2 during learning.
We ﬁnd that this type of training is necessary because
in the beginning the robot acts randomly and if the goal is
far away, and the environment is cluttered with obstacles,
it fails to reach the goal and experience a reward, thus
failing to learn to progress in learning. Another possibil-
ity would be to augment the reward so that the robot re-
ceives a small reward for getting closer to the goal, how-
ever this type of ﬁne tuning the reward function can lead
to unexpected behaviours and interfere with the main ob-
jective of learning.
92
The robot and the environment are simulated in our
extension of the pedsim ros
1
library and controlled through
the exposed ROS
2
topics. The learning algorithm is im-
plemented in Python and Tensorﬂow.
4 Evaluation
4.1 Evaluation procedure
We evaluate our learned policy vs. the performance of
the DWA on the four stages represented in Figure 3. We
evaluate the DWA approach because it is a very popular
approach to local navigation and because it is the default
navigation method implemented in the navigation stack
of our Turtlebot robot. In each stage we generate 100
starting locations and goals, and then evaluate the perfor-
mance of both approaches. The performance is evaluated
in terms of how many times the goal is reached success-
fully, with the length of the successful trajectories and
with the duration of the successful episodes. The four
stages are:
  Stage 1: An empty room with size10m  5m. The
starting and goal points are generated randomly.
  Stage 2: A room with size 10m  5m with hu-
mans entering and exiting in different directions.
The starting and goal locations are the same as in
Stage 1.
  Stage 3: A large hall with size20  10m with ran-
domly generated convex obstacles scattered. The
starting location is generated randomly (so that it
is not in collision) and the goal location is gener-
ated at a distance2[3;6]m from the robot.
  Stage 4: The same hall as in Stage 3 but with hu-
mans entering and exiting in different directions.
The starting and goal locations are at a distance of
2[1;4]m from the robot.
Figure 3: Four stages used for evaluation. Stages 1 and 3 con-
tain only static obstacles, while stages 2 and 4 contain both
static and dynamic obstacles (humans).
1
https://github.com/srl-freiburg/pedsim_ros
2
https://www.ros.org/
4.2 Results
As can be seen in Table 1 in the ﬁrst stage both algorithms
navigated successfully to all of 100 given goals. The dis-
tance traveled as well as the time to reach the goal are
longer for our policy. In the second stage, when the mov-
ing humans are added to the scene, the number of goals
that are reached drops dramatically for both navigation
methods, however the proposed method reaches 6 more
goals than the baseline. It should be noted that perfect
score in the stages containing humans is virtually impos-
sible, since the humans move with a greater speed than
the robot and do not care if they collide with it.
Table 1: Quantitative results of the navigation evaluation.
Shown are the number of goals that were reached, the average
distance per goal and the average time per goal.
goals avg. dist. avg. time
Stage DWA Ours DWA Ours DWA Ours
1 100 100 2.70 2.84 5.86 7.51
2 71 77 2.55 2.68 5.59 7.77
3 82 94 6.48 7.60 14.24 21.15
4 65 75 3.37 3.56 7.87 10.26
In the third stage when the static convex obstacles are
introduced again the number of goals that are reached
drops compared to Stage 1. In evaluating the trajecto-
ries we found that DWA fails when the robot tries to pass
through tight openings, in avoiding the ﬁrst obstacle it
makes a sharp turn and can not stop before it hits the sec-
ond obstacle which can be out of the sight of the 240
degree laser. For our policy the collisions occur when
the robot is approaching straight on a sharp obstacle. We
suspect that because of the reduction of the laser to 64
readings a sharp obstacle can interchangeably appear in
the two frontal readings of the laser and an oscillatory
behaviour occurs.
The fourth stage is the most demanding of all the
stages. Perfect performance in this stage is not possible,
because of the moving agents. As can be seen in Table 1
our policy manages to reach 10 more goals than the base-
line. The length of the trajectories is slightly longer for
the RL-Policy and the time taken is somewhat longer.
In Figure 4 we visualize three executed trajectories
for the same starting and goal locations. We can see
that the trajectories generated by the baseline are gener-
ally smoother, while our RL-Policy sometimes generates
more choppy trajectories.
In evaluating our policy among moving obstacles we
can see that it learned a few different evasion maneuvers
in the presence of moving obstacles: in the presence of a
fast moving human, it generally stops until the human has
passed, or turns and reduces its speed until the obstacle is
gone. In the presence of a slow moving obstacle, it will
try to increase its speed and overtake the human. Some of
these maneuvers are depicted in Figure 5 where we drive
the robot to the same goal location in a3m wide corridor
with humans moving in different directions.
93
Figure 4: Visualizations of some of the performed trajectories
during evaluation. Blue is RL-Policy and red is DWA policy.
Figure 5: Visualizations of some of the learned collision avoid-
ance techniques. Blue is RL-Policy and red are movements of
humans. Saturation representes the passage of time.
5 Conclusion
In this paper we introduced a new way of performing lo-
cal navigation in dynamic environments. Our approach
is to train a neural network to perform this navigation
on the basis of only the reading of a range scanner, cur-
rent velocity of the robot and the distance and angle to
the goal. Our approach does not depend on the knowl-
edge of the position and velocity of the moving objects
in the scene. To train this policy we use a reinforcement
learning method, and construct our simulator, develop a
reward function and a two-stage training procedure.
Our evaluation shows that our navigation method can
outperform the established DWA approach in the pres-
ence of dynamic humans.
References
[1] Abdel, M., Jaradat, K., Al-rousan, M., Quadan, L.: Rein-
forcement based mobile robot navigation in dynamic en-
vironment. Robotics and Computer Integrated Manufac-
turing (1) (2011)
[2] van den Berg, J., Guy, S.J., Lin, M., Manocha, D.: Recip-
rocal n-body collision avoidance. In: Pradalier, C., Sieg-
wart, R., Hirzinger, G. (eds.) Robotics Research. pp. 3–19.
Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
[3] Brock, O., Khatib, O.: High-speed navigation using the
global dynamic window approach. In: Proceedings 1999
IEEE ICRA) (1999)
[4] Dudek, G., Jenkin, M.: Computational principles of
mobile robotics, Computational principles of mobile
robotics, Cambridge University Press, Cambridge (2000)
[5] Fiorini, P., Shiller, Z.: Motion Planning in Dynamic Envi-
ronments Using Velocity Obstacles. IJRR (1998)
[6] Fox, D., Burgard, W., Thrun, S.: The dynamic window ap-
proach to collision avoidance. IEEE Robotics & Automa-
tion Magazine 4(1), 23–33 (1997)
[7] Helbing, D., Moln´ ar, P.: Social force model for pedestrian
dynamics. Phys. Rev. E 51, 4282–4286 (1995)
[8] Kim, J.., Khosla, P.K.: Real-time obstacle avoidance us-
ing harmonic potential functions. IEEE Transactions on
Robotics and Automation 8(3), 338–349 (1992)
[9] Konar, A., Chakraborty, I.G., Singh, S.J., Jain, L.C., Na-
gar, A.K.: A Deterministic Improved Q-Learning for Path
Planning of a Mobile Robot. IEEE Transactions on Sys-
tems, Man, and Cybernetics: Systems 43(5), 1141–1153
(2013)
[10] Mnih, V ., Badia, A.P., Mirza, M., Graves, A., Lillicrap,
T.P., Harley, T., Silver, D., Kavukcuoglu, K.: Asyn-
chronous Methods for Deep Reinforcement Learning.
arXiv 48, 1–28 (2016), http://arxiv.org/abs/
1602.01783
[11] Mnih, V ., et al.: Human-level control through deep rein-
forcement learning. Nature (2015)
[12] Ogren, P., Leonard, N.E.: A convergent dynamic win-
dow approach to obstacle avoidance. IEEE Transactions
on Robotics (2005)
[13] Pfeiffer, M., Schaeuble, M., Nieto, J., Siegwart, R., Ca-
dena, C.: From perception to decision: A data-driven
approach to end-to-end motion planning for autonomous
ground robots. In: IEEE ICRA (2017)
[14] Sadeghi, F., Levine, S.: (CAD)
2
RL: Real Single-Image
Flight without a Single Real Image. arXiv:1611.04201
(2016),http://arxiv.org/abs/1611.04201
[15] Schulman, J., Levine, S., Jordan, M., Abbeel, P.: Trust
Region Policy Optimization. ICML (2015)
[16] Schulman, J., Wolski, F., Dhariwal, P., Radford, A.,
Klimov, O.: Proximal policy optimization algorithms.
CoRR abs/1707.06347 (2017)
[17] Tai, L., Paolo, G., Liu, M.: Virtual-to-real deep reinforce-
ment learning: Continuous control of mobile robots for
mapless navigation. IROS (2017)
[18] van den Berg, J., Ming Lin, Manocha, D.: Reciprocal ve-
locity obstacles for real-time multi-agent navigation. In:
IEEE ICRA. pp. 1928–1935 (2008)
[19] Wen, S., Chen, X., Ma, C., Lam, H.K., Hua, S.: The
Q-learning obstacle avoidance algorithm based on EKF-
SLAM for NAO autonomous walking under unknown en-
vironments. RAS (2015)
[20] Xie, L., Wang, S., Markham, A., Trigoni, N.: Towards
monocular vision based obstacle avoidance through deep
reinforcement learning. In: RSS 2017 workshop on New
Frontiers for Deep Learning in Robotics (2017)
[21] Zhu, Y ., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A.,
Fei-Fei, L., Farhadi, A.: Target-driven visual navigation
in indoor scenes using deep reinforcement learning. In:
IEEE ICRA) (2017)