5 
Advances in Production Engineering & Management ISSN 1854-6250 
Volume 20 | Number 1 | March 2025 | pp 5–17 Journal home: apem-journal.org 
https://doi.org/10.14743/apem2025.1.523 Original scientific paper 
Reinforcement learning for robot manipulation tasks in 
human-robot collaboration using the CQL/SAC algorithms 
Husaković, A.
a
, Banjanović-Mehmedović, L.
b
, Gurdić-Ribić, A.
c
, Prljača, N.
b
, Karabegović, I.
d
 
a
Eacon doo, Zenica, Bosnia and Herzegovina 
b
University of Tuzla, Faculty of Electrical Engineering, Tuzla, Bosnia and Herzegovina 
c
Foundation for Innovation, Technology and Transfer of Knowledge, Tuzla, Bosnia and Herzegovina 
d
Academy of Sciences and Arts of Bosnia and Herzegovina, Sarajevo, Bosnia and Herzegovina 
A B S T R A C T A R T I C L E   I N F O 
The integration of human-robot collaboration (HRC) into industrial and ser-
vice environments demands efficient and adaptive robotic systems capable of 
executing diverse tasks, including pick-and-place operations. This paper in-
vestigates the application of Soft Actor-Critic (SAC) and Conservative Q-
Learning (CQL)—two deep reinforcement learning (DRL) algorithms—for the 
learning and optimization of pick-and-place actions within HRC scenarios. By 
leveraging SAC’s capability to balance exploration and exploitation, the robot 
autonomously learns to perform pick-and-place tasks while adapting to dy-
namic environments and human interactions. Moreover, the integration of 
CQL ensures more stable learning by mitigating Q-value overestimation, 
which proves particularly advantageous in offline and suboptimal data sce-
narios. The combined use of CQL and SAC enhances policy robustness, facili-
tating safer and more efficient decision-making in continually evolving envi-
ronments. The proposed framework combines simulation-based training with 
transfer learning techniques, enabling seamless deployment in real-world 
environments. The critical challenge of trajectory completion is addressed 
through a meticulously designed reward function that promotes efficiency, 
precision, and safety. Experimental validation demonstrates a 100 % success 
rate in simulation and an 80 % success rate on real hardware, confirming the 
practical viability of the proposed model. This work underscores the pivotal 
role of DRL in enhancing the functionality of collaborative robotic systems, 
illustrating its applicability across a range of industrial environments. 
 Keywords: 
Human-robot collaboration; 
Robot learning; 
Deep reinforcement learning; 
Soft actor-critic algorithm (SAC); 
Conservative Q-learning (CQL); 
Robot manipulation tasks 
*Corresponding author: 
lejla.banjanovic-mehmedovic@fet.ba 
(Banjanović-Mehmedović, L.) 
Article history: 
Received 10 February 2025 
Revised 2 March 2025 
Accepted 7 March 2025 
Content from this work may be used under the terms of 
the Creative Commons Attribution 4.0 International 
Licence (CC BY 4.0). Any further distribution of this work 
must maintain attribution to the author(s) and the title of 
the work, journal citation and DOI.
1. Introduction
Human-robot collaboration (HRC), driven by Industry 4.0 and 5.0, has become a cornerstone of 
modern industrial and service robotics, where robots and humans work together to achieve 
common goals. The human worker is tasked with solving social challenges and making decisions, 
while the cobot dictates the speed and acceleration of the process [1]. These collaborative envi-
ronments require robots that are not only efficient and precise but also adaptable to dynamic 
scenarios involving human interaction [2]. HRC systems, within the framework of Industry 4.0, 
contribute to improving workflow, accelerating processes, enhancing product quality, reducing 
energy consumption and CO2 emissions. In the context of Industry 5.0, HRC is worker-centered, 
focusing on creating a positive and conflict-free working environment [3]. 
In the evolution of smart robotic manufacturing, the ability to adapt to uncertainties, chang-
ing environments, and dynamic tasks has become a critical necessity, often described as the 
emergence of program-free robots or learning-enabled robots [4]. Robot learning integrates vari-
Husaković, Banjanović-Mehmedović, Gurdić-Ribić, Prljača, Karabegović 
 
6 Advances in Production Engineering & Management 20(1) 2025 
 
ous machine learning techniques into robotics [5], focusing on enabling robots to learn and per-
form actions based on environmental inputs. This paradigm addresses challenges such as opti-
mize policies for real-time decision-making, managing the exploration-exploitation trade-off, 
incorporate feedback to adapt to evolving tasks and conditions, and ensuring robust transfer of 
learning from simulations to real-world applications. These factors make robot learning a 
unique and complex area within machine learning and robotics.  
Robotic manipulation is a fundamental challenge in robotics, involving tasks such as grasping, 
object handling, and assembly. These tasks are inherently complex due to high-dimensional 
state and action spaces, environmental uncertainties, and the dynamic interactions between the 
robot, objects, and human collaborators. Incorporating human-cobot collaboration adds another 
layer of complexity, as robots must effectively and safely interact with humans in shared work-
spaces. The safety distance between humans and machines, based on real-time changes in vari-
ous scenarios, ensures smoother and safer collaboration [6]. The paper [7] emphasizes the im-
portance of evaluating collaborative workplaces in both simulation and real-world environ-
ments, showing that such evaluations can enhance overall industrial system capacity This re-
quires a high level of adaptability, precision, and real-time decision-making to ensure seamless 
task sharing, coordination, and safety in dynamic industrial or service environments [8-10]. 
Learning for robot manipulation is essential for enabling robots to adapt to complex, unstruc-
tured environments and perform tasks that require interacting with diverse objects [11]. Rein-
forcement learning (RL) has proven effective in training robots to perform intricate manipula-
tion and grasping tasks, assembly and disassembly tasks, which is vital for advancing human-
robot collaboration. By equipping robots with the ability to adapt and respond dynamically to 
their environment and human partners, RL enhances their role in shared workspaces across 
manufacturing, warehousing, healthcare, and service robotics applications [12].  
Applying DRL in continual, dynamic environments is crucial for enabling robots to adapt to 
real-world uncertainties, optimize decision-making over time, and improve efficiency in tasks 
such as robotic manipulation and autonomous navigation [8]. Conservative Q-Learning (CQL) 
and Soft Actor-Critic (SAC) are both off-policy reinforcement learning algorithms tailored for 
continuous control, with SAC optimizing a stochastic policy using entropy regularization to en-
courage exploration, while CQL focuses on learning conservative Q-values to mitigate overesti-
mation and improve robustness in offline learning [13]. While SAC excels in dynamic environ-
ments with abundant data, CQL is particularly useful in settings with limited or suboptimal da-
tasets by constraining the learned Q-values to avoid risky actions [14].  
Hierarchical Reinforcement Learning (HRL) is an AI approach that structures decision-
making by breaking complex tasks into manageable sub-tasks, enhancing learning efficiency 
[15]. A Hierarchical Reinforcement Learning (HRL) approach enhances human-robot interac-
tions by breaking complex tasks into smaller sub-tasks, allowing the robot to independently 
optimize each one while contributing to the overall goal. HRL improves learning efficiency by 
enabling the robot to solve simpler problems faster, adapts better to human inputs like gestures 
or commands, and offers scalability as new sub-tasks can be added without disrupting the sys-
tem. It also decomposes complex tasks, such as assembly or assistance in dynamic environ-
ments, into manageable components, improving overall performance and flexibility. 
This paper focuses on applying the CQL/SAC algorithm to develop a framework for learning 
and executing two sub-tasks, pick-and-place actions in human-robot collaborative settings. The 
proposed approach addresses several critical challenges, including trajectory optimization, and 
autonomously and safe interaction with human collaborators. By integrating reinforcement 
learning into the HRC domain, this paper aims to improve the robot's ability to adapt to varying 
conditions while maintaining task efficiency and safety.  
The paper is structured as follows. Section 2 reviews related work on robot manipulation, 
with focus on pick-and-place operations and reinforcement learning. Section 3 presents the pro-
posed deep reinforcement methodology, detailing the CQL/SAC-based learning framework. Sec-
tion 4 discusses the system description, including the collaborative robotics platform and re-
ward function design. Results are presented in Section 5, demonstrating the effectiveness of the 
CQL/SAC algorithm in optimizing pick-and-place actions. Finally, Section 6 presents the conclu-
sion of this study and discusses potential directions for future research.  
Reinforcement learning for robot manipulation tasks in human-robot collaboration using the CQL/SAC algorithms 
 
Advances in Production Engineering & Management 20(1) 2025 7 
 
2. Related work 
The precise object grasping is essential for robots, especially in industrial settings. Most learn-
ing-driven grasping tasks rely on vision signals, requiring models with strong representational 
capabilities. Using deep convolutional neural networks and a guided policy search method, the 
study presented in paper [16] enables robots to learn policies that map raw visual inputs direct-
ly to motor commands. The approach is validated on real-world tasks requiring vision-control 
coordination, such as assembling a bottle cap.  
 Reinforcement Learning (RL) enables robots to learn manipulation and collaboration skills 
through interaction with their environment and human partners. By optimizing a reward func-
tion, robots can develop policies to perform complex tasks and adapt to human inputs. The deep 
reinforcement learning and deep neural networks like CNNs are commonly used, showcasing the 
power of robot learning-based grasping. Mahler et al. used CNN and DQN for pick-and-transport 
tasks with an ABB Yumi robot [12]. Mohammed et al. utilized DQN and Q-learning with RGBD 
input for target position generation on UR robots [17]. 
 Recent research has demonstrated the effectiveness of DRL in various task-specific applica-
tions [18]. Liu et al. [19] designed a robotic pick-and-place system, emphasizing reward shaping 
to address complex challenges. Rewards were based on the Euclidean distance between an ob-
ject's current and target configurations, with additional rewards for task completion, significant-
ly improving convergence compared to linear reward methods. The study also demonstrated 
that the employed learning technique outperformed both PPO and Actor-Critic (A3C) by directly 
generating manipulation signals, thereby effectively managing high task complexity. A simulated 
pick-and-place task with a simple block, using DDPG enhanced with hindsight experience replay 
(HER), is presented in [20]. 
 In [21], a flexible framework was proposed that integrates motion planning with RL for con-
tinuous robot control in cluttered environments. By combining model-free RL with a sampling-
based motion planner, this approach minimizes dependency on task-specific knowledge and 
enables the RL policy to determine when to plan or execute direct actions through reward max-
imization. Additionally, [22] introduced a framework that leverages demonstrations, unsuper-
vised learning, and RL to efficiently learn complex tasks using only image input. 
SAC algorithm, has demonstrated broad applicability in tasks like door opening and block 
stacking by breaking the task into fundamental steps (e.g., reaching, grasping, turning, and pull-
ing for door opening) and training each step individually [23]. The advantage of SAC in path 
planning lies in entropy maximization, which encourages the robot to explore its surroundings, 
while the addition of HER (hindsight experience replay) allows the use of past experiences for 
improved learning and environmental adaptation [24]. In paper [25], a novel framework for 
sequentially learning vision-based robotic manipulation tasks in offline reinforcement learning 
settings was presented. Their approach leverages offline data and visual inputs to train adapta-
ble policies for complex, sequential manipulation scenarios, demonstrating improved perfor-
mance and generalization in robotic applications.  
Multi-Agent Deep Reinforcement Learning (MADRL) involves multiple agents learning and in-
teracting in the same environment, either through collaboration or competition. Each agent 
learns to optimize its own policy while considering the actions of others. It is widely used in ro-
botics and industrial automation for tasks such as coordinated robot control, resource alloca-
tion, and adaptive decision-making in complex systems. The paper [26] employs a multi-agent 
deep reinforcement learning (DRL) approach, which is a significant advancement in the field of 
production scheduling. This method incorporates an attention mechanism within an advantage 
actor-critic framework, which is further complemented by a global reward function. 
3. Deep reinforcement learning 
Deep Reinforcement Learning (DRL) has emerged as a transformative approach, providing adap-
tive and intelligent solutions for optimizing robotics and production processes in complex indus-
trial environments. It enhances efficiency by enabling autonomous systems to learn optimal 
strategies for navigation, manipulation, and workflow automation [27].  
Husaković, Banjanović-Mehmedović, Gurdić-Ribić, Prljača, Karabegović 
 
8 Advances in Production Engineering & Management 20(1) 2025 
 
Finding the optimal policy in reinforcement learning is influenced by whether the method is 
model-based or model-free. Model-based approaches estimate transition probabilities p(s′∣s,a) 
using prior knowledge or state-space searches, while model-free methods learn directly from 
rewards without modelling system dynamics. Model-free techniques are common in robotics 
due to the complexity of modelling continuous state spaces. 
RL requires balancing exploration (random actions to discover optimal strategies) and ex-
ploitation (choosing actions to maximize rewards). This trade-off is managed using a parameter 
ɛ, which controls the likelihood of exploring non-optimal actions. On-policy learning adjusts the 
current policy by mixing optimal and random actions, while off-policy learning uses a target 
policy for optimization and a behaviour policy for exploration. 
Within the RL/DRL approaches, we distinguish several key methods: value-based, policy-
based, and actor-critic methods [28]. Value-based methods are generally more suitable for prob-
lems with discrete action spaces. For continuous action spaces, policy-based or actor-critic 
methods are typically preferred.  
Value function learning leverages the value function to assess the quality of a given state when 
the agent follows a specific policy. Consequently, a robotic agent can refine its policy by exploring 
the action space to identify actions that maximize the estimated value function. Key Value methods 
are Q-learning, SARSA, Deep Q-Networks (DQN), Double Q-learning [29]. Q-learning, uses the max-
imum Q-value for the next state to update Q-values. It can lead to overestimation in some cases. 
Deep Q-Network (DQN) employs deep neural networks to estimate Q-values, enabling efficient 
decision-making in environments with complex and high-dimensional state spaces. It stabilizes 
training with experience replay and target networks. Double DQN mitigates overestimation bias 
by utilizing two separate Q-networks: one to determine the best action and another to evaluate its 
corresponding Q-value, leading to more accurate value estimations. SARSA is an on-policy method, 
meaning it updates the Q-values based on the actions that the agent actually takes while following 
its policy, as opposed to Q-learning, which is an off-policy method and updates the Q-values based 
on the maximum possible future reward. 
Policy gradient directly optimizes the policy in a model-free manner by maximizing the ac-
cumulated reward. Policy-based algorithms, compared to value-based methods, typically pro-
vide better convergence and can learn stochastic policies. Examples of policy-based methods in 
Reinforcement Learning (RL) include Proximal Policy Optimization (PPO), Deterministic Policy 
Gradient (DPG), and Trust Region Policy Optimization (TRPO). They focus on directly optimizing 
the policy to improve decision-making, rather than learning a value function.  
PPO is a policy-based method that improves the stability of policy updates using a clipped ob-
jective function. PPO is designed to handle large, complex environments and is widely used due 
to its efficiency and simplicity. TRPO is another policy-based method that ensures updates to the 
policy are within a trust region to prevent overly large changes that can destabilize learning. It 
uses a constrained optimization approach, making it more computationally expensive but stable 
and effective. DPG is a policy gradient method specifically designed for continuous action spaces. 
It learns a deterministic policy and utilizes a policy gradient approach to directly update the 
policy without needing a value function for actions.  
Actor-Critic algorithms integrate two key components: an actor, responsible for determining 
actions, and a critic, which assesses the chosen actions' value. By optimizing both networks con-
currently, these algorithms enhance stability and learning efficiency in reinforcement learning. 
Notable Actor-Critic approaches include Advantage Actor-Critic (A2C), Asynchronous Advantage 
Actor-Critic (A3C), Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Determinis-
tic Policy Gradient (TD3), and Soft Actor-Critic (SAC). The summary of Actor-critic algorithms is 
presented in Table 1.  
 The main challenges of applying reinforcement learning to manipulation tasks in robotics 
include the high dimensionality, the continuous action space, and the significant training time 
required [30]. Model-free approaches like PPO, DDPG and SAC are widely used in robot manipu-
lation. These algorithms excel in learning continuous control policies for complex robotic arms 
and manipulators. In our research, we used CQL/SAC algorithm. 
 
Reinforcement learning for robot manipulation tasks in human-robot collaboration using the CQL/SAC algorithms 
 
Advances in Production Engineering & Management 20(1) 2025 9 
 
Table 1 Characteristics of Actor-critic algorithms 
Algorithm Type Key Features Action Space 
A2C Synchronous Advantage function to reduce variance Discrete and continuous 
A3C Asynchronous Multiple parallel agents for stability and efficiency Discrete and continuous 
DDPG Off-Policy Deterministic policy, used for continuous action spaces Continuous 
TD3 Off_policy Double Q-learning and target network smoothing to 
reduce bias 
Continuous 
 
SAC Off-Policy Includes entropy regularization for exploration Continuous 
3.1 CQL/SAC algorithm 
SAC is a DRL algorithm built on the Actor-Critic framework and operates as an off-policy meth-
od. It is designed to overcome the stability and efficiency limitations present in earlier ap-
proaches. SAC utilizes the maximum entropy reinforcement learning paradigm, where the actor 
seeks to maximize both the expected reward and exploration by maximizing entropy [31].  
 By employing off-policy updates, SAC allows for faster learning and improved sample effi-
ciency through experience replay, unlike on-policy methods like Proximal Policy Optimization 
(PPO), which require fresh data for each gradient update. In contrast, off-policy methods reuse 
past experiences, significantly enhancing learning efficiency. Unlike PPO and DDPG, SAC employs 
twin Q-networks alongside a separate Actor network, with entropy tuning mechanisms to im-
prove both stability and convergence.  
 The integration of CQL promotes more stable learning by addressing the issue of Q-value 
overestimation, which is especially useful when dealing with offline or suboptimal data scenari-
os [13]. The advantage of combining Conservative Q-Learning (CQL) with Soft Actor-Critic (SAC) 
lies in the synergy between robust exploration and conservative value estimation. SAC excels at 
learning effective policies through its entropy-regularized framework, which promotes balanced 
exploration and exploitation. However, SAC can sometimes overestimate Q-values, especially in 
data-limited or offline scenarios. By introducing a conservative penalty, CQL mitigates this issue, 
resulting in improved training stability and policy robustness. 
Compared to PPO and Twin Delayed Deep Deterministic Policy Gradient (TD3), the CQL/SAC 
model offers advantages in both training time and efficiency. While SAC requires more training 
time than PPO, it achieves more stable learning, and it converges faster than TD3 due to its en-
tropy-based exploration [8], [32, 33]. Although CQL slightly increases training time, it signifi-
cantly enhances learning stability [13]. Overall, CQL/SAC is more sample-efficient than PPO and 
better suited for real-world deployment than TD3, particularly in high-dimensional tasks and 
offline learning scenarios. 
This CQL/SAC hybrid method leads to safer and more efficient decision-making, particularly 
in dynamic, continuous control tasks where real-world uncertainties and limited data are com-
mon challenges [25]. RL objective for SAC algorithm is: 
 
𝐽𝐽 ( 𝜋𝜋 ) = � 𝔼𝔼 ( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 )~ ρ
𝜋𝜋 [( 𝑟𝑟 ( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 ) + 𝛼𝛼 ℋ(
𝑇𝑇 𝑡𝑡 = 0
𝜋𝜋 ( ∙ | 𝑠𝑠 𝑡𝑡 ))] (1) 
 
where ℋ is the entropy:  
ℋ 𝜋𝜋 ( ∙ | 𝑠𝑠 𝑡𝑡 ) = 𝔼𝔼 [ −log ( 𝑓𝑓 𝜋𝜋 ( ∙ | 𝑠𝑠 𝑡𝑡 ))] (2) 
 
with expectation operator 𝔼𝔼 , which denotes the averaging over all state-action pairs sampled 
from the trajectory distribution. A temperature parameter α is controlling the balance between 
exploration (higher entropy) and exploitation (higher reward). SAC can tune α automatically 
during training. Ultimately, this objective function aims to optimize the cumulative expected 
reward while simultaneously promoting exploration by maximizing policy entropy. 
SAC follows an actor-critic architecture, where the actor explicitly models the policy, while 
the critic is solely responsible for guiding the actor’s improvement. The critic's role is confined 
to the training phase, without directly influencing action selection during execution.  
Two separate critic networks Q1 and Q2 are trained to minimize the Bellman error using the 
target Q value:  
Husaković, Banjanović-Mehmedović, Gurdić-Ribić, Prljača, Karabegović 
 
10 Advances in Production Engineering & Management 20(1) 2025 
 
𝑄𝑄 �
𝜃𝜃 �
1
, 𝜃𝜃 �
2
( 𝑠𝑠 𝑡𝑡 + 1
, 𝑎𝑎 𝑡𝑡 + 1
) = 𝑟𝑟 𝑡𝑡 + 𝛾𝛾 𝔼𝔼 ( 𝑠𝑠 𝑡𝑡 + 1
~ 𝐷𝐷 , 𝑎𝑎 𝑡𝑡 + 1
~ 𝜋𝜋 𝜙𝜙 (.| 𝑠𝑠 𝑡𝑡 + 1
))
[ 𝑄𝑄 �
𝑚𝑚 𝑚𝑚𝑚𝑚
− 𝛼𝛼 log ( 𝜋𝜋 𝜙𝜙 � ( 𝑎𝑎 𝑡𝑡 + 1
| 𝑠𝑠 𝑡𝑡 + 1
) �] (3) 
 
where the actor network  𝜋𝜋 𝜃𝜃 ( 𝑎𝑎 | 𝑠𝑠 ) is trained to maximize the expected return while encouraging 
exploration using the entropy regularization term. 
The target Q value incorporates the minimum of the two Q-values (to mitigate overestimation 
bias) and the entropy of the policy: 
 
𝑄𝑄 �
𝑚𝑚 𝑚𝑚𝑚𝑚
( 𝑠𝑠 𝑡𝑡 + 1
, 𝑎𝑎 𝑡𝑡 + 1
) = min[ 𝑄𝑄 �
𝜃𝜃 �
1
( 𝑠𝑠 𝑡𝑡 + 1
, 𝑎𝑎 𝑡𝑡 + 1
), 𝑄𝑄 �
𝜃𝜃 �
2
( 𝑠𝑠 𝑡𝑡 + 1
, 𝑎𝑎 𝑡𝑡 + 1
)] (4) 
 
For each critic network, the Q-loss is computed as the mean squared error between the pre-
dicted Q-value and the target Q value: 
 
𝐽𝐽 𝑄𝑄 ( 𝜃𝜃 𝑚𝑚 ) =
1
2
𝔼𝔼 ( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 ~ 𝐷𝐷 )
[ � 𝑄𝑄 �
𝜃𝜃 �
1
, 𝜃𝜃 �
2
( 𝑠𝑠 𝑡𝑡 + 1
, 𝑎𝑎 𝑡𝑡 + 1
) − 𝑄𝑄 �
𝜃𝜃 1
( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 ) �
2
] 
(5) 
 
The target networks are slowly updated copies of the critic networks. The target network 
does not have its own loss function; it serves to provide stable target values for training the crit-
ic network.  
The policy-loss for actor-network is defined as: 
 
𝐽𝐽 𝜋𝜋 ( 𝜙𝜙 ) = 𝔼𝔼 ( 𝑠𝑠 𝑡𝑡 ~ 𝐷𝐷 , 𝑎𝑎 𝑡𝑡 ~ 𝜋𝜋 𝜙𝜙 (.| 𝑠𝑠 𝑡𝑡 ))
[ 𝛼𝛼 log ( 𝜋𝜋 𝜙𝜙 ( 𝑎𝑎 𝑡𝑡 | 𝑠𝑠 𝑡𝑡 )) − min [( 𝑄𝑄 𝜃𝜃 1
( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 𝜋𝜋 ), 𝑄𝑄 𝜃𝜃 2
( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 𝜋𝜋 )]] 
(6) 
 
A schematic view of SAC algorithm is presented in Fig. 1. 
 
 
Fig. 1 A schematic view of the SAC algorithm [31] 
 
In offline reinforcement learning (offline RL), the non-Lagrange version of the Conservative 
Q-Learning (CQL) method is utilized, as proposed in [13]. This approach is beneficial because it 
seamlessly integrates with existing continuous RL algorithms like Soft Actor-Critic (SAC) by in-
troducing a regularization loss. By incorporating the CQL loss term into Eq. 5, the total Q-loss is 
expressed as: 
 
𝐽𝐽 𝑄𝑄 𝑡𝑡𝑡𝑡 𝑡𝑡 𝑎𝑎 𝑡𝑡
( 𝜃𝜃 𝑚𝑚 ) = 𝐽𝐽 𝑄𝑄 ( 𝜃𝜃 𝑚𝑚 ) + 𝛼𝛼 𝑐𝑐 𝑐𝑐𝑡𝑡
𝔼𝔼 ( 𝑠𝑠 𝑡𝑡 ~ 𝐷𝐷 )
[ 𝑙𝑙𝑙𝑙 𝑙𝑙 � 𝑒𝑒 𝑒𝑒 𝑒𝑒 � 𝑄𝑄 𝜃𝜃 𝑖𝑖 ( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 ) − 𝔼𝔼 ( 𝑎𝑎 𝑡𝑡 ~ 𝐷𝐷 )
[ 𝑄𝑄 𝜃𝜃 𝑖𝑖 ( 𝑠𝑠 𝑡𝑡 , 𝑎𝑎 𝑡𝑡 )] �
𝑎𝑎 𝑡𝑡 
(7) 
 
where 𝛼𝛼 𝑐𝑐 𝑐𝑐𝑡𝑡
 determines the degree to which the CQL loss is applied to the Q-loss. This penalizes 
actions that significantly deviate from the existing dataset, ensuring that the learned policy re-
mains conservative in terms of exploration.  
SAC algorithm is highly effective in continuous action spaces and complex tasks that require 
real-time adaptation, which is critical for industrial applications [9]. Due to these advantages, 
SAC can significantly contribute to the automation and optimization of pick-and-place opera-
tions in industry, making robotics more efficient and adaptable in dynamic production environ-
ments [8].  
Reinforcement learning for robot manipulation tasks in human-robot collaboration using the CQL/SAC algorithms 
 
Advances in Production Engineering & Management 20(1) 2025 11 
 
4. System description 
The pick-and-place task requires a robotic arm to detect, grasp, and relocate an object to a target 
position efficiently. Key challenges include accurate perception using sensors (e.g., cameras, 
depth sensors), motion planning to compute smooth and collision-free trajectories for object 
grasping and placement, grasp stability to prevent object slippage, and dynamic adaptation to 
object variations.  
 This study focuses on enhancing the performance of RL agents in robotic pick-and-place 
tasks. To initiate this process, the robot interprets human gestures to trigger the intended pick-
and-place actions. The block diagram of the system is presented in Fig. 2. 
The cobot (myCobot320 by Elephant Robotics) is trained to perform pick-and-place tasks 
separately. The pick range is 10 cm × 15 cm, and the place range is identical but mirrored along 
the y-axis. The object's initial position is within the pick range, while the place position is ran-
domly assigned within the place range. The trained policies are then combined to execute tasks 
sequentially in both simulation (PyBullet) and real-world scenarios. For real-world object detec-
tion, an Astra Pro 2 camera uses HSV color space, with tests conducted on green objects of 2 × 
2.5 cm and 5 × 5 cm. Additionally, human collaboration is incorporated, allowing the robot to 
pause and resume operation based on hand state recognition using Google’s MediaPipe library. 
 
 
Fig. 2 A schematic view of Pick-and-place task in human-robot collaboration setting 
4.1 Simulated and real-world platform 
Robotic systems, with their high-dimensional and continuous action spaces, require extensive 
training for agents to develop optimal policies. However, online training can be expensive due to 
factors like the need for human oversight, potential risks of robot wear and damage, and limited 
access to training robots, making simulation-based training a practical alternative. Despite its 
benefits, simulation often suffers from low accuracy when applied to real-world tasks. Transfer-
ring policies from simulation to real-world robotics faces several challenges due to the sim-to-
real gap, including [30, 34]: 
• Model Inaccuracies: Simulations often oversimplify physics, neglecting factors like friction, 
sensor noise, and joint flexibility. 
• Perception Discrepancies: Simulated sensors lack real-world noise and distortions, leading 
to poor generalization in tasks like vision-based navigation. 
• Actuation Latency: Real actuators introduce delays and non-linearities not present in sim-
ulations. 
• Data Distribution Shift: Policies trained in simulation may not generalize well to real-
world variations. 
• Safety and Robustness: Errors in real-world execution can damage hardware, and policies 
must handle unpredictable human interactions. 
• Transfer Learning: Simulation-trained policies often need fine-tuning in the real world, 
which can be costly and inefficient. 
• Lack of Real-World Data: Collecting real-world data for training RL policies is resource-
intensive. 
Husaković, Banjanović-Mehmedović, Gurdić-Ribić, Prljača, Karabegović 
 
12 Advances in Production Engineering & Management 20(1) 2025 
 
A major challenge in RL for robotics lies in bridging the Sim2Real gap, as transferring policies 
learned in simulators like Mujoco, PyBullet, or Gazebo, to real-world environments remains dif-
ficult. Addressing this challenge requires not only improving the robustness of RL algorithms but 
also developing more accurate and reliable simulators [35]. The development of standardized 
benchmarks and open-source environments, such as OpenAI Gym, Robosuite, and Isaac Gym, has 
accelerated research in RL for robotic manipulation. These advancements would provide suita-
ble training environments for RL, enabling the creation of policies enabling them to perform 
effectively on physical robots. 
This paper explores solutions to these challenges, focusing on leveraging RL to enable robots 
to autonomously learn and execute pick-and place tasks in both simulated (PyBullet environ-
ment), presented in Fig. 3 and real-world environments using myCobot320 by Elephant Robot-
ics, presented in Fig. 4.  
 
               
Fig. 3 Simulated environment (PyBullet)                         Fig. 4 Hardware settings for real environment 
4.2 Human gesture recognition 
Human collaboration is applied by stopping and resuming the work of real robot by recognizing 
the state of the hand by using Google’s MediaPipe library. MediaPipe is an open-source frame-
work developed by Google for building cross-platform, real-time perception pipelines, particu-
larly in computer vision and machine learning applications. It utilizes graph-based processing, 
where modular components (calculators) efficiently handle tasks such as object detection, pose 
estimation, and gesture recognition.  
 Google's MediaPipe library can be utilized for gesture recognition by leveraging its pre-
trained models for hand tracking. It detects key points on the hand, such as the position of each 
finger and the palm, allowing for precise gesture recognition in real-time. By integrating Medi-
aPipe into robotic systems, gestures can be used as inputs to control the robot’s actions, ena-
bling intuitive human-robot interaction. 
 MediaPipe utilizes GPU acceleration and model optimization strategies to enable high-
performance inference on edge devices, such as smartphones and embedded systems. Its archi-
tecture is designed for fast prototyping and seamless deployment of AI-powered applications 
across different platforms, while minimizing computational demands. 
 In the event of failure (due to unexpected human interventions beyond simple gesture-based 
commands), the real robot will perform unusual actions based on observations that are current-
ly unseen. Typically, in such situations, we stop the execution. 
4.3 RL agent 
In our study, the RL agent leverages key components to learn effective object manipulation with-
in a human-robot collaboration setting. By observing the state, taking actions, and receiving re-
wards, the agent refines its policy to enhance task performance, such as successful object picking 
and placement. 
Reinforcement learning for robot manipulation tasks in human-robot collaboration using the CQL/SAC algorithms 
 
Advances in Production Engineering & Management 20(1) 2025 13 
 
 The state encapsulates essential details about the environment and the robot's condition, 
including the end-effector's position and orientation, the real object's position relative to the 
robot's gripper, the target placement position, the computed desired object position based on 
the end-effector and real object positions, and the gripper’s state (open/closed), resulting in a 
state size 22 in total. 
 The agent operates with continuous actions, such as executing small translations and rota-
tions of the end-effector, adjusting joint angles for articulated control, and managing the grip-
per’s state (open/close) to grasp or release objects, resulting in a total size of 7 actions. 
 The reward function is designed to encourage efficient task completion while discouraging 
suboptimal behaviours. Positive rewards are given for actions such as successfully grasping the 
object, moving it closer to the target, and correctly placing it in the target area. Negative rewards 
(penalties) are applied for actions like collisions with obstacles, dropping the object before 
reaching the target, and unnecessary movements that delay task completion. 
 In the context of the Soft Actor-Critic (SAC) framework and Conservative Q-learning (CQL) on 
top of the framework, the policy network plays a crucial role in determining the actions the ro-
bot should take based on the current state [8]. We use two Multi-Layer Perceptron (MLP), one 
for policy (actor) and the other for Q-value (critic, Q1 and Q2). The structure of both networks is 
the same except for the input/output layer. The structure of MLP consists of 3 fully connected 
hidden layers with size of 256 and batch size of 32. The policy network takes the state (observa-
tions) as input of size 22 and gives action as output of size 7, representing the actions the robot 
should take. The Q-value network takes the states and actions as input of size 29 and gives Q-
value of size 1. 
 Overall, this neural network architecture within the CQL/SAC algorithm enables the robot to 
learn a policy that maps states to actions in a continuous space, continuously optimizing its per-
formance over time through reinforcement learning. 
 Hyperparameters include the state dimension for the policy (actor) set to 22 and the action 
dimension set to 7. The total number of training epochs is 300 for the place task and 500 for the 
pick task. Each epoch consists of 1000 steps, where a step represents a single interaction where 
the agent takes an action, receives a reward, and moves to the next state. 
 Other hyperparameters include a maximum of 40 steps per episode during evaluation and a 
training batch size of 32. Evaluation occurs after each epoch (set to 1), using 16 episodes to 
compute evaluation metrics. An episode starts from an initial state and ends upon reaching a 
goal, a time limit, or failure. 
 CQL and SAC-specific hyperparameters include automatic entropy tuning (set to true), CQL 
alpha (1), CQL temperature (1), discount factor (0.99), target smoothing coefficient tau (0.005), 
and entropy regularization alpha (0.2). The number of random samples for CQL loss estimation 
is 10, with CQL version 3. Learning rates are 1e-4 for the policy network and automatic entropy 
tuning, and 3e-4 for the Q-value network. 
4.4 Evaluation metrics 
Evaluation metrics in robot manipulation are essential for assessing the performance, efficiency, 
and reliability of robotic systems in handling various tasks. These metrics help quantify how 
well a robot interacts with objects, executes movements, and completes assigned tasks under 
different conditions. Some key evaluation metrics include Trial Success Rate (TSR), Task Com-
pletion Time, Grasp Success Rate, Path Efficiency, etc. [36].  
In this study, we used TSR and average cumulative reward over multiple episodes, which 
represents the percentage of multi-step tasks completed with 100 % success [37]. 
5. Experimental design, results and discussion 
To evaluate CQL/SAC's performance in high-dimensional continuous control tasks, we train a 
cobot on a pick-and-place problem requiring accurate object detection, grasping, transportation, 
and precise placement. Figs. 5 and 6 show end-effector positions and object positions for four 
trajectories and ten trajectories, where green color represents successful trajectory/object 
placement and red color is for failed trajectory/object placement. 
Husaković, Banjanović-Mehmedović, Gurdić-Ribić, Prljača, Karabegović 
 
14 Advances in Production Engineering & Management 20(1) 2025 
 
 
               
        Fig. 5 End-effector positions for four trajectories      Fig. 6 End-effector positions for ten trajectories 
  
 
 
Fig. 7 Four successful trajectories, End-effector/object 
  positions 
Fig. 8 Ten mixed (successful and failed) trajectories, 
    End-effector/object positions 
 
Fig. 7 and Fig. 8 presents trajectories of end-effector (dots) positions and object coordinates 
(squares) as functions of steps for 4 and 10 trajectories, respectively. 
Fig. 9 and Fig. 10 illustrate the best performance metrics of picking and the placing tasks for 
evaluation and training. Neural networks with such trained weights are used for the real robot 
for pick and place tasks respectively. Pick task is trained for 500k steps and place task for 300k 
steps. From Fig. 9 can be seen that pick task gets 100 % success rate after 280k episodes, and 
mean rewards during training and evaluation remain stable. The similar results can be conclud-
ed from Fig. 10 for place task where after around 100k steps (1 epoch) we get the best perfor-
mance regarding evaluation success rate and trained/evaluated reward remains similar. 
 Figs. 11 and 12 presents comparisons between different runs of pick and place tasks respec-
tively. From Fig. 11 we started with 100k steps (run called `pick_0`) for training for pick task. We 
can see that it didn’t get a successful eval success rate (obtained success rate is 0), even though 
other parameters were looking promising. The same happened for other runs (`pick_{1,2,3}`) 
respectively, where we can also see that RL is not deterministic. After expanding the training 
time from 100k to 500k we obtained the `pick_best` run. 
Reinforcement learning for robot manipulation tasks in human-robot collaboration using the CQL/SAC algorithms 
 
Advances in Production Engineering & Management 20(1) 2025 15 
 
 
 
Fig. 9 The best performances during training and 
evaluation for pick task 
Fig. 10 The best performances during training and 
evaluation for place task 
 
 
 
 
Fig. 11 The performance comparison between different 
runs for pick task 
Fig. 12 The performance comparison between different 
runs for place task 
Husaković, Banjanović-Mehmedović, Gurdić-Ribić, Prljača, Karabegović 
 
16 Advances in Production Engineering & Management 20(1) 2025 
 
Similarly, for place task, we varied the total number of steps for different runs, from 130k 
(`place_0` run), 250k steps (`place_1` run) and 300k `place_best` run, which can be seen on Fig. 11. 
We observed that although the highest success rate was achieved during evaluation, stability 
throughout the evaluation steps was not maintained. Therefore, we extended the training time to 
300k episodes. 
 The success rate achieved was 100% in simulation and 80% on the real robot, evaluated over 
100 episodes. 
6. Conclusion 
In this research, we demonstrated the effectiveness of combining CQL/SAC for robotic manipula-
tion in human-robot collaboration settings. By leveraging the combination of CQL and SAC, the 
robot successfully learned optimal policies for executing intricate tasks, such as object picking 
and placement. CQL ensures more conservative value estimation, preventing suboptimal actions 
that could arise due to overestimation, while SAC facilitates efficient exploration and adaptation 
in uncertain environments. The results show that this hybrid approach significantly enhances 
the performance and reliability of robotic manipulation, enabling robots to effectively collabo-
rate with humans in shared environments. We assessed this approach through cobot manipula-
tion experiments, demonstrating the transferability of the learned policy from simulation to re-
al-world settings without additional training. This was validated through real robot experi-
ments, confirming that the integration of CQL with SAC enables safer and more reliable policy 
deployment in human-robot collaborative scenarios. 
 Currently, the framework is designed for pick-and-place tasks, but future work will focus on 
expanding its capabilities to a wider range of robotic manipulation tasks. Additionally, future 
research should explore multi-object detection, enabling the system to handle various shapes in 
a more dynamic industrial setting. Further advancements will also integrate diverse interaction 
modalities, including voice commands and text-based collaboration powered by Generative AI, 
such as Large Language Models, to enhance human-robot communication and adaptability. 
Acknowledgement 
This research is supported in scope of projects “Research and Development of Collaborative Intelligence in Service 
Robots for Industrial Applications”, funded by Federal Ministry of Education and Science, Bosnia and Herzegovina and 
“Smart factory enabled by artificial intelligence at the edges of a distributed network - a prerequisite for Industry 5.0”, 
funded by Ministry of education and science of TK, Bosnia and Herzegovina. 
References 
[1] Javernik, A., Buchmeister, B., Ojsteršek, R. (2022). Impact of Cobot parameters on the worker productivity: Op-
timization challenge, Advances in Production Engineering & Management, Vol. 17, No 4, 494-504, doi: 
10.14743/apem2022.4.451. 
[2] Banjanovic-Mehmedovic, L., Karabegovic, I., Jahic, J., Omercic, M. (2021). Optimal path planning of a disinfection 
mobile robot against COVID-19 in a ROS-based research platform, Advances in Production Engineering & Man-
agement, Vol. 16, No. 4, 405-417, doi: 10.14743/apem2021.4.409. 
[3] Li, Y.C., Wang, X. (2024). Human-robot collaboration assembly line balancing considering cross-station tasks and 
the carbon emissions, Advances in Production Engineering & Management, Vol. 19, No 1, 31-45, doi: 
10.14743/apem2024.1.491. 
[4] Liu, Q., Liu, Z., Xiong, B., Xu, W., Liu, Y. (2021). Deep reinforcement learning-based safe interaction for industrial 
human-robot collaboration using intrinsic reward function, Advanced Engineering Informatics, Vol. 49, Article 
No. 101360, doi: 10.1016/j.aei.2021.101360. 
[5] Peters, J., Lee, D.D., Kober, J., Nguyen-Tuong, D., Bagnell, J.A., Schaal, S. (2016). Robot learning, In: Siciliano, B., 
Khatib, O. (eds.), Springer handbook of robotics, Springer, Cham, Switzerland, 357-398, doi: 10.1007/978-3-319-
32552-1_15. 
[6] Xing, H.R. (2024). Optimizing human-machine systems in automated environments, International Journal of 
Simulation Modelling, Vol. 23, No. 4, 716-727, doi: 10.2507/IJSIMM23-4-CO19. 
[7] Ojstersek, R., Javernik, A., Buchmeister, B. (2021). The impact of the collaborative workplace on the production 
system capacity: Simulation modelling vs. real-world application approach, Advances in Production Engineering 
& Management, Vol. 16, No. 4, 431-442, doi: 10.14743/apem2021.4.411. 
[8] Haarnoja, T., Zhou, A., Abbeel, P., Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep rein-
forcement learning with a stochastic actor, ArXiv, doi: 10.48550/arXiv.1801.01290. 
Reinforcement learning for robot manipulation tasks in human-robot collaboration using the CQL/SAC algorithms 
 
Advances in Production Engineering & Management 20(1) 2025 17 
 
[9] Gu, S., Holly, E., Lillicrap, T., Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asyn-
chronous off-policy updates, In: Proceedings of 2017 IEEE International Conference on Robotics and Automation 
(ICRA), Singapore, 3389-3396, doi: 10.1109/ICRA.2017.7989385. 
[10] Karabegović, I., Banjanović-Mehmedović, L. (2021). Service robots: Advances and applications, Nova Science Pub-
lisher, New York, USA. 
[11] Kroemer, O., Niekum, S., Konidaris, G. (2021). A review of robot learning for manipulation: Challenges, represen-
tations, and algorithms, ArXiv, doi: 10.48550/arXiv.1907.03146. 
[12] Mahler, J., Matl, M. Satish, V., Danielczuk, M., DeRose, B., McKinley, S., Goldberg, K. (2019). Learning ambidextrous 
robot grasping policies, Science Robotics, Vol. 4, No. 26, Article Id. eaau4984, doi: 10.1126/scirobotics.aau4984. 
[13] Kumar, A., Zhou, A., Tucker, G., Levine, S. (2020). Conservative Q-learning for offline reinforcement learning, 
ArXiv, doi: 10.48550/arXiv.2006.04779. 
[14] Liu, Y., Wang, C., Zhao, C., Wu, H., Wei, Y. (2024). A soft actor-critic deep reinforcement-learning-based robot 
navigation method using LiDAR, Remote Sensing, Vol. 16, No. 12, Article No. 2072, doi: 10.3390/rs16122072. 
[15] Kosaraju, D. (2024). Hierarchical reinforcement learning: Structuring decision-making for complex AI tasks, 
International Journal of Science and Healthcare Research, Vol. 7, No. 2, 485-491, doi: 10.52403/ijshr.20220468. 
[16] Levine, S., Finn, C., Darrell, T., Abbeel, P. (2016). End-to-end training of deep visuomotor policies, ArXiv, doi: 
10.48550/arXiv.1504.00702. 
[17] Mohammed, M.Q., Chung, K.L., Chyi, C.S. (2020). Pick and place objects in a cluttered scene using deep rein-
forcement learning, International Journal of Mechanical and Mechatronics Engineering, Vol. 20, Vol. 4, 50–57.  
[18] Han, D., Mulyana, B., Stankovic, V., Cheng, S. (2023). A survey on deep reinforcement learning algorithms for 
robotic manipulation, Sensors, Vol. 23. No. 7. Article No. 3762, doi: 10.3390/s23073762.  
[19] Liu, D., Wang, Z., Lu, B., Cong, M., Yu, H., Zou, Q. (2020). A reinforcement learning-based framework for robot 
manipulation skill acquisition, IEEE Access, Vol. 8, 108429-108437, doi: 10.1109/ACCESS.2020.3001130. 
[20] Al-Selwi, H.F., Aziz, A.A., Abas, F.S., Zyada, Z. (2021). Reinforcement learning for robotic applications with vision 
feedback, In: Proceedings of the 2021 IEEE 17
th
 International Colloquium on Signal Processing & Its Applications 
(CSPA), Langkawi, Malaysia, 81-85, doi: 10.1109/CSPA52141.2021.9377292. 
[21] Yamada, J., Lee, Y., Salhotra, G., Pertsch, K., Pflueger, M., Sukhatme, G.S., Lim, J.J., Englert, P. (2020). Motion plan-
ner augmented reinforcement learning for robot manipulation in obstructed environments, ArXiv, doi: 
10.48550/arXiv.2010.11940. 
[22] Zhan, A., Zhao, P., Pinto, L., Abbeel, P., Laskin, M. (2020). A framework for efficient robotic manipulation, ArXiv. 
[23] Kwon, G., Kim, B., Kwon, N.K. (2024). Reinforcement learning with task decomposition and task-specific reward 
system for automation of high-level tasks, Biomimetics, Vol. 9, No. 4, Article No. 196, doi: 
10.3390/biomimetics9040196. 
[24] Zhao, T., Wang, M., Zhao, Q., Zheng, X., Gao, H. (2023). A path-planning method based on improved soft actor-
critic algorithm for mobile robots, Biomimetics, Vol. 8, No. 6, Article No. 481, doi: 10.3390/biomimetics8060481. 
[25] Yadav, S.P., Nagar, R., Shah, S.V. (2024). Learning vision-based robotic manipulation tasks sequentially in offline 
reinforcement learning settings, Robotica, Vol. 42, No. 4, 1715-1730, doi:10.1017/S0263574724000389. 
[26] Liu, A.Y., Yue, D.Z., Chen, J.L., Chen, H. (2024). Deep learning for intelligent production scheduling optimization, 
International Journal of Simulation Modelling, Vol. 23, No. 1, 172-183, doi: 10.2507/IJSIMM23-1-CO4. 
[27] Wei, Z.H., Yan, L., Yan, X. (2024). Optimizing production with deep reinforcement learning, International Journal 
of Simulation Modelling, Vol. 23, No. 4, 692-703, doi: 10.2507/IJSIMM23-4-CO17. 
[28] Wang, Y., Friderikos, V. (2020). A survey of deep learning for data caching in edge network, Informatics, Vol. 7. 
No. 4, Article No. 43, doi: 10.3390/informatics7040043. 
[29] Liu, Z., Liu, Q., Xu, W., Wang, L., Zhou, Z. (2022). Robot learning towards smart robotic manufacturing: A review, 
Robotics and Computer-Integrated Manufacturing, Vol. 77, Article No. 102360, doi: 10.1016/j.rcim.2022.102360. 
[30] Lobbezoo, A., Qian, Y., Kwon, H.-J. (2021). Reinforcement learning for pick and place operations in robotics: A 
survey, Robotics, Vol. 10, No. 3, Artice No. 105, doi: 10.3390/robotics10030105. 
[31] Sharifi, N. Mathematical foundation underpinning reinforcement learning, from 
https://ai.gopubby.com/mathematical-foundation-underpinning-reinforcement-learning-34b304890c33, accessed 
January 3, 2025.  
[32] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy optimization algorithms, 
ArXiv, doi: 10.48550/arXiv.1707.06347.  
[33] Fujimoto, S., Hoof, H., Meger, D. (2018). Addressing function approximation error in actor-critic methods, ArXiv, 
doi: 10.48550/arXiv.1802.09477. 
[34] Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P. (2017). Domain randomization for transferring 
deep neural networks from simulation to the real world, In: Proceedings of the IEEE/RSJ International Conference 
on Intelligent Robots and Systems (IROS), Vancouver, Canada,, 23-30, doi: 10.1109/IROS.2017.8202133. 
[35] Zhu, W., Guo, X., Owaki, D., Kutsuzawa, K., Hayashibe, M. (2023). A survey of sim-to-real transfer techniques 
applied to reinforcement learning for bioinspired robots, IEEE Transactions on Neural Networks and Learning 
Systems, Vol. 34, No. 7, 3444-3459, doi: 10.1109/TNNLS.2021.3112718. 
[36] Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., Quillen, D. (2018). Learning hand-eye coordination for robotic 
grasping with deep learning and large-scale data collection,The International Journal of Robotics Research, Vol. 
37, No. 4-5, 421-436, doi: 10.1177/0278364917710318. 
[37] Hundt, A., Killeen, B., Greene, N., Wu, H., Kwon, H., Paxton, C., Hager, G.D. (2020). “Good Robot!”: Efficient rein-
forcement learning for multi-step visual tasks with sim-to-real transfer, IEEE Robotics and Automation Letters, 
Vol. 5, No. 4, 6724-6731, doi: 10.1109/LRA.2020.3015448.