https://doi.org/10.31449/inf.v48i5.5295                                                                                                Informatica 48 (2024) 121-134   121 
A Deep Reinforcement Learning Model-Based Optimization Method for 
Graphic Design 
Qi Guo
*
, Zhen Wang 
School of Art and Design, Henan Institute of Economics and Trade; Zhengzhou, Henan 450000 China 
E-mail: guoqi390446118@126.com 
*
Corresponding author  
 
Keywords: Topology Optimization, Graphics design, buildings, Deep reinforcement learning, and 2D-3D structure. 
Received: October 16, 2023 
The significance of Deep Reinforcement learning is sensibly represented in the method of optimizing the 
graphic design and space framework of buildings in context with the worldwide big data environment, wherein 
people have increasingly stringent requirements for building layout and design and conventional layout is 
increasingly inadequate. This research put out a novel approach to topology optimization using deep learning 
in geometry. Deep neural networks characterize the density distribution in the design domain. By employing 
a geometry-based deep learning approach to represent the density distribution function, we can successfully 
avoid the checkerboard phenomena and ensure a smooth border. With a deep learning reinforcement 
approach, the design variables may be drastically decreased. In adjusting the designs of neural networks, we 
may fine-tune not only the minimal length but also the structural complexity. The proposed model has provided 
an accuracy of 95% and a computation time of 61s. The effectiveness of the suggested technique is shown by 
several 2-dimensional and 3-dimensional numerical results ranging from minimal conformance to stress-
constrained issues. 
Povzetek: Predlagana je nova metoda vzpodbujevalnega učenja za topološko optimizacijo v grafičnih 
storitvah z uporabo globokih nevronskih mrež.
1   Introduction 
In both academia and business, research on machine 
learning (ML) and artificial intelligence (AI) has grown 
significantly in the past ten years. As computer technology 
improved and the need to evaluate increasing amounts of 
data evolved, these methods, which were previously 
undervalued, found updated recognition. Reinforcement 
Learning (RL) aims for maximizing a numerical reward 
signal by retraining the system to relate actions to instances. 
The student must attempt each activity to determine which 
is most rewarding rather than being instructed which to 
choose. The issue of how agents should learn a strategy that 
acts in a way to maximize the cumulative reward through 
interaction with the environment is addressed by 
reinforcement learning (Tapeh & Naser, 2022). Figure 1 
represents Deep Reinforcement Learning Implementation 
using the Interior Design Model. The article outlines the 
solution of multi-objective reinforcement learning 
(MORL) tasks with unknown weights and many conflicting 
objectives (Yamaguchi, Nagahama, Ichikawa, & 
Takadama, 2019). The research demonstration continues to 
grow because it enables robots to quickly acquire innovative 
abilities. In inverse reinforcement learning 
(IRL), demonstrations can benefit in a number of methods 
by having the robot make an effort to determine the 
objectives or reward from the human demonstrator (Das, 
Bechtle, Davchev, Jayaraman, Rai, & Meier, 2021). 
 
Figure 1: Deep reinforcement learning implementation 
using the interior design model. 
The creations of completely autonomous agents that interact 
with their surroundings for learn the best behaviours and 
perfect them over time through trial and error. Making AI 
122   Informatica 48 (2024) 121–134                                                                                                                                          Qi Guo et al. 
systems that are responsive and can successfully learn has 
long been a problem, from software-only agents that can 
interact with spoken language and multimedia to robots that 
can perceive and respond to their environment (Zhou, Lee, 
Diao, Shi, Balyen, &Peto, et al, 2019). RL is a mathematical 
framework with guiding principles for experience-driven 
autonomous learning. While earlier iterations of RL had 
some success, they were fundamentally confined to rather 
low-dimensional issues and lacked scalability (Cioffi, 
Travaglioni, Piscitelli,  Petrillo,& De Felice, et al, 2020). 
AI will have a profound influence on human existence in the 
future due to the worldwide nature of the world, and it will 
be a key factor in designers' decision-making processes. 
Artificial intelligence is fundamentally a tool, and it should 
exercise its four main responsibilities of anticipation, 
contemplation, negotiation, and reaction throughout the 
process of design innovation (Bichu,  Hansa,  Bichu,  
Premjani,  Flores-Mir,  &Vaid, et al, 2021). Each designer 
has a preference, and ResNet artificial intelligence is 
suggested as a way to increase decision accuracy while also 
increasing the effectiveness of design selections based on 
individual designer preferences. To successfully prevent the 
negative consequences of designers' decision-making 
preferences, pattern recognition, and decision-making 
difficulties are combined (Wang, Tang, Huang, Chen, 
Zhang, & Huang, (2020)). The term "spatial layout design" 
describes the process of partitioning a given space into 
several tiny spaces or of logically placing certain things in 
the area within the framework of some objective and 
arbitrary design standards and layout conventions 
(Bouhamed, Ghazzai, Besbes, &Massoud, (2020)). The 
necessity for efficient design nowadays cannot be addressed 
by conventional approaches, which is why researchers are 
looking into spatial layout design. Designers often use 
interactive modeling tools or build traditional layouts by 
hand (Deng, & Chen, (2021)). The article suggested a novel 
Deep Reinforcement Learning-based topology optimization 
technique. The density dispersion in the design region is 
characterized by deep neural networks. Using geometric 
deep learning to define the density distribution function can 
ensure the smoothness of the border and successfully 
combat the checkerboard phenomena, in contrast to 
standard density-based methods (Brown, Garland, Fadel, & 
Li, et al, (2022)).  
 
 
Table 1: Survey of related works 
Author Proposed Result Limitations 
(Zhou, Lee, 
Diao, Shi, 
Balyen, &Peto, et 
al, (2019)) 
In the field of ophthalmology, AI, 
ML, and DL has been applied to 
verify medical diagnoses, interpret 
images, map the cornea, and 
compute intraocular lenses. 
The existing DL, ML, and 
AI techniques and 
application on glaucoma 
treatment, AMD, DR, and 
other eye disorders early 
identification. 
One of the main 
issues in many 
nations is the 
shortage of retina 
specialists and 
qualified human 
graders. Analysis 
of such images can 
be expensive, time-
consuming, and 
prone to human 
error in population 
growth. 
(Cioffi, 
Travaglioni, 
Piscitelli, 
Petrillo, & De 
Felice, et al, 
(2020)) 
The research has been designed to 
conduct a thorough analysis of 
scientific research about the 
industrial applications of AI and 
ML. 
The significant outcome is 
the higher quantity of 
American-published works 
and the growing interest 
following the release of 
Industry 4.0. 
 
It is essential to 
emphasize that this 
report was 
generated simply 
from two 
databases, namely 
WoS and Scopus 
and that only 
publicly accessible 
materials were 
included. 
2   Related works 
A Deep Reinforcement Learning Model-Based Optimization…                                                         Informatica 48 (2024) 121–134   123 
(Bichu, Hansa, 
Bichu, Premjani, 
Flores-Mir, 
&Vaid, et al, 
(2021)) 
The PRISMA-ScR standards were 
followed in the scoping assessment 
of the research. 
The fields of diagnosis and 
treatment planning, 
development assessment, 
and treatment outcome 
evaluation were examined. 
Some AI 
applications could 
have failed to 
appear in PubMed 
because of 
inclusion rules, 
search terms used, 
publication 
language rather 
than English. 
(Wang, Tang, 
Huang, Chen, 
Zhang, & Huang, 
(2020)) 
The study developed the DNN 
framework and RL state area, action 
space, and multiple incentives. 
The northeast power grid 
and the 36-node the China 
Electric Power Research 
establishment (CEPR) 
system are utilized to verify 
the efficacy of the technique. 
The adjustment 
effect can be 
enhanced by 
raising the range of 
adjustments step 
per sample, with 
completing that 
could extend the 
learning and 
adjustments period. 
(Bouhamed, 
Ghazzai, Besbes, 
&Massoud, 
(2020)) 
To enable the UAV to navigate over 
obstacles and the continuous area 
developed the Deep Deterministic 
Policy Gradient (DDPG). 
The UAV is provided 
utilizing the DDPG in 
constant movement space to 
navigate over obstacles to 
achieve its designated 
destination. 
The limited 
dimensions of 
mobility and action 
space for UAVs, 
which could lower 
their effectiveness 
in dealing with 
everyday 
environments. 
(Deng, & Chen, 
(2021)) 
A policy-based RL model was 
developed in the investigation to 
depict the behaviour of controlling 
the thermostat and material level. To 
simulate the individuals' behaviour, 
a MDP used. 
The behaviour of building 
occupants could be predicted 
reasonably well using the RL 
framework and transfer 
learning. 
A limitation of the 
research was the 
RL occupant 
behaviour model's 
prediction 
difference, which 
could be partially 
justified. 
(Brown, Garland, 
Fadel, & Li, et al, 
(2022)) 
An RL agent can sequentially 
determine in the specified 
environment whether to create a 
topology by eliminating components 
to most effectively accomplish 
compliance minimization 
requirements. 
These results indicate that 
deep RL agents can acquire 
generalized design 
techniques to satisfy multi-
objective design 
requirements. 
Testing was done 
on the agent using 
a number of 
standard load 
instances, certain of 
which it did not see 
during training. 
 
 
 
 
 
 
 
 
124   Informatica 48 (2024) 121–134                                                                                                                                          Qi Guo et al. 
Contribution of the study Thus, this research contributes by 
demonstrating an implementation of the topology 
optimization to increase its effectiveness by Deep 
Reinforcement Learning and the field's relevance to making 
decisions through trial. The following are some of the 
particular accomplishments of this paper: 
• The approach of interior design based on certain 
learning method is evaluated. 
• To encourage the mathematical method of 
topology which is an optimized material layout 
within a given design space and assess the 
effectiveness of the process, an efficient Deep 
Reinforcement Learning component is suggested. 
3   Application of deep learning in 
graphic design 
The article presented an approach to electrical drive 
controller design that uses Deep Reinforcement Learning 
techniques. To effectively forecast the behavior of building 
occupants with high scalability and without the requirement 
for data gathering, the RL model was integrated with 
transfer learning (Ding, &Cerpa, (2020)). The article 
investigated how Designing 2D discredited topologies is 
automated by applying the optimum sequences of actions 
for RL agents to do to accomplish a goal learned from prior 
experiences. An RL agent may build a topology in the given 
environment by sequentially deciding which parts should be 
removed to best achieve compliance reduction goals 
(Zhang, Chen, Bernstein, Chintala, Graf, Jin, &Biagioni, et 
al, (2022)). The overview objective of the article conducted 
a series of using the lessons we've learned. By using a group 
of environment-conditioned neural networks, the piece was 
able to learn the dynamics of the building. Next, a brand-
new control technique called Model Predictive Path Integral 
is used. In a five-zone office complex, we assess Energy 
Plus models. According to the report can save 8.23% more 
energy than the most advanced system while keeping a 
comparable level of thermal comfort (Zhang, Chintala, 
Bernstein, Graf, &Jin, et al, (2020).). The study desired a 
tailored scanning strategy that was learned using 
reinforcement learning (RL) to determine the angles and 
dosage for each selected angle for each patient. Modern 
deep RL techniques are used in the study to define the CT 
scanning procedure and then solve it. In addition to 
producing improved reconstruction outcomes, the learned 
tailored scanning technique also exhibits great 
generalizability when used in conjunction with other 
reconstruction algorithms (Shen, Wang, Yang, & Dong, et 
al, (2020)). The research downplayed the significance of 
sampling while determining the Q-return function, ensuring 
that the built-in techniques are more likely to acquire high-
value lessons while being more resilient (Li, Zhu, Zhou, 
Feng, & Feng, et al, (2022)). Research enhanced the 
Building Information Model system and Python 
development tools, enabling cross-platform collaboration 
deep learning on computers and further design effort, The 
architectural design methodology of the BIM system and the 
interior design research carried out using the BIM building 
data platform were assessed in the article is shown using 
real-world examples (Luong, & Pham, 2021).The study 
paper's goal analyze the demand for interior space design 
has risen quickly along with the rate at which people are 
purchasing homes. In the domain of autonomous interior 
space design, computer science, and technology have 
infinite potential. The corresponding study suggested an 
automated way of designing spatial areas using 
convolutional neural networks (CNN) (Wu& Feng, 2022). 
The article investigated the CNN technique as a quick and 
effective approach. Iteratively finishing the automated 
arrangement of the internal spaces begins with the predicted 
living room. The paper examined several empirical interior 
design case studies, showing that this approach had similar 
results to professional designers' interior design floor plans 
(Predić, Manić, Saračević, Karabašević, &Stanujkić, 2022). 
Research classified the four different Machine Learning 
(ML) models created for the semi-arid region of Iraq's river 
flow forecasting. Investigated was the efficacy of data 
division's impact on the development of ML models. Three 
data division modeling scenarios—70%–30%, 80%–20%, 
and 90%–10%—were examined. To evaluate how well the 
models are performing, several statistical indicators are 
computed (Tao, Al-Sulttani, Salih Ameen, Ali, Al-Ansari, 
Salih, & Mostafa, 2020). Using 90%–10% data division, the 
article demonstrated the benefits of the hybrid support 
vector correlation model with a genetic algorithm over 
current machine learning forecasting models for monthly 
river flow predictions. Also, it was discovered to increase 
the accuracy of high-flow event predictions (Zhong, Zhang, 
Zhang, & Zhang, 2022). The study case developed the 
Support vector regression (SVR) model's internal 
parameters may be tuned by the optimizer, which results in 
a robust learning process. Compared to earlier developed 
hybrid models, the article has improved its ability to predict 
stochastic river flow behavior (Xu, Zhang, Liu, Nie, Su, Nie, 
& Zhang, 2019.) The Research compared the design of 
Adaptive Cruise Control (ACC) using Model Predictive 
Control (MPC) and Deep Reinforcement Learning (DRL) in 
car-following instances (Lin, McPhee, & Azad, 2019). The 
research explored the DRL approach as comparable to MPC 
with a large enough prediction horizon when modeling 
errors disappear and the training information range is 
occupied by the testing inputs (Zhu, Wang, Pu, Hu, Wang, 
&Ke, 2019). The study evaluated that DRL control 
performance declines when testing inputs are outside of the 
training data range, which is a sign that machine learning 
generalization is insufficient (Chen, Tong, Zheng, 
A Deep Reinforcement Learning Model-Based Optimization…                                                         Informatica 48 (2024) 121–134   125 
Samuelson, &Norford, 2020). The study focused on 
constraint optimization and multi-objective optimization; 
the investigation provides an innovative perspective on the 
data age's design progress. After verifying the quality of the 
non-adaptive solution set, optimizing the converge, 
uniformity, and extensiveness, analyzing the experimental 
process, and drawing a multi-objective conclusion, it is 
determined that additional optimization related to the 
interior and spatial structure is necessary for artificial 
intelligence making decisions in the instance of the Library 
of Highly Cold Lands (Ran, & Dong,2022). Research 
provided layout boundary or layout space to automatically 
generate a layout plan. The scene redirection solution has 
successfully been tested, according to the findings. The 
study used a redirection algorithm's efficacy which is shown 
by comparison with the outcomes of uniform scaling (Wu, 
2022). The study case simulated two reinforcement learning 
agents in a cooperative learning setting to discover the ideal 
3D layout for the Markov decision process (MDP) 
formulation. The article examines the tests on a big dataset 
of actual interior layouts, which includes industrial designs 
created by qualified designers. The numerical findings 
suggested model produces layouts of superior quality when 
compared to the most recent model (Di, & Yu, 2021). 
 
4   Materials and method 
Graphic design has been around since the beginning of time. 
Books, periodicals, packaging, newspapers, banners, 
emblems, and many more things all benefit from graphic 
design in some way. Graphic design, topology optimization, 
our suggested deep reinforcement learning approach, and 
performance assessment of this graphic design are the 
primary topics covered in this chapter. 
4.1 Graphic design  
According to a widely held belief, visual design is the art 
and skill of giving various words and graphics an orderly, 
practical, and appealing framework. Both the act (verb) and 
the product (noun) of visual art are related concepts. A kind 
of "all design" employed in the creation of different 
platforms is traditional graphic design. The logical and 
practical aesthetics that developed in conventional graphic 
design over the years for media are the foundation for 
contemporary visual graphic design, which is today 
employed across multiple fields such as industrial layout, 
information architecture, message styling, and more. Table 
2 displays the types of graphic designs. 
 
 
Table 2: Types of graphic designs 
S.no Graphic designs types 
(i) Visual identification 
(ii) Promotion and marketing 
(iii) Interface for Users 
(iv) Newspaper 
(v) Packaging 
(vi) Movements 
(vii) Environmental 
(viii) Visual Compositions 
 
Graphics has been known by many different names over the 
last two centuries, including artistic works, advertising 
material, digital marketing, graphics, and visuals. This 
demonstrates how the range of methods used to convey 
information has broadened beyond traditional visual arts.  
The 2D graphic arts include book arts, calligraphy, 
lithography, cinematography, printing, and typography. 
Applications, experience-based design, interaction 
methods, user-centered design, and websites are just some 
of the newer areas that graphic arts have expanded to 
include. The number of design-related discussions is 
growing at an astounding rate. There is training and 
schooling in graphic design all around the globe, at all 
levels. The figure depicts the graphic model of the building 
structure in DRL. 
 
Figure 2: Graphic design of building in DRL 
 
126   Informatica 48 (2024) 121–134                                                                                                                                          Qi Guo et al. 
4.2 Topology optimization 
Topology Optimisation as a construction tool is rarely 
implemented in the design of buildings. It is usually the 
result of a laborious procedure necessary to produce results 
that meet the standards of a designer. Yet, that difficulty 
shouldn't prevent some builders from trying out these 
instruments in building design. The density-based approach 
converts the substance distribution into a finite-element 
spatial configuration. By constructing discrete elements of 
varying densities, the finite element method is developed. 
Mesh is used to represent density spatially in the well-
established SIMP method, yielding an optimized layout 
with spaced boundary conditions. So, it takes a lot of work 
in post-processing to make a smooth CAD model, and that 
might reduce the accuracy of the geometry near the border. 
As the mesh is employed to describe the organizational 
topology, the variety of design parameters is usually quite 
huge for 3D design, and many mature optimization 
strategies are not appropriate for large-scale problems. In 
this section, we describe a novel approach to density 
portrayal that resolves those particular issues by using a 
feed-forward neural network. A high-fidelity feed-forward 
neural network can be used to illustrate a complex shape, 
ensuring a smooth surface throughout. Thus, a deep 
feedforward network is a natural choice for representing the 
density field in the design domain. In Figure 3, we see a 
contrast of three feedforward neural networks, each having 
three hidden units and a unique set of neurons in each of 
those levels and Figure 4 displays the outcomes of the 
training. 
 
Figure 3: Feed-forward neural network design structure 
 
Figure 4: Outcome of the building training function 
A properly justified density field is one in which the limits 
of the component densities fall within the interval [0, 1]. A 
sampling distribution in the design domain is defined by a 
deep feedforward network in which the input for the system 
is all the point dimensions. The density value at that location 
is what you get as an answer. The following mapping 
function 𝒩 is used to control the output density to stay 
within the range [0, 1]: 
𝒩 =
(tanh(𝛽𝑦 )+1)
2
(𝛽 =0.5)           (1) 
The density field may be expressed mathematically as: 
∅(𝑦 ,𝑥 )=ℳ(ℕ(𝑦 ,𝑥 ,𝜃 ))(2𝐷𝑝𝑟𝑜𝑏𝑙𝑒𝑚 )         (2) 
∅(𝑦 ,𝑥 ,ℎ)=ℳ(ℕ(𝑦 ,𝑥 ,ℎ,𝜃 ))(3𝐷𝑝𝑟𝑜𝑏𝑙𝑒𝑚 )        (3) 
Where ℕ represents feedforward networks and stands for a 
free-form parameter. Several discrete layers make up a 
deep-layered network's topology. Networks with 𝐹 hidden 
layers may be represented as, where 𝑧 (1)
 represents the 
output of the corresponding hidden layer. 
ℕ(𝑦 ,𝑥 ,ℎ,𝜃 )=
ℕ(𝑒 (𝐹 +𝑞 )
(
(𝑍𝐹 )
(𝑒 (𝐹 )
(….𝑧 (1)
(𝑒 (1)
(𝑦 ,𝑥 ,ℎ)))       (4) 
Where the linear process
( )
 is written as, 
𝑒 (𝑓 )
(𝑦 )=𝑈 (𝑓 )
𝑦 +𝑝 (𝑓 )
                         (5) 
4.3 Minimum compliance 
Topology optimization using a compliance-minimizing 
formulation is developed with deep reinforcement learning 
(DRL). In the space of design, a DNN represents the density 
field. Hence, the TO will repeatedly update the network 
configuration in the design domain to improve the 
concentration field until the component arrangement 
provides optimal stiffness performance. During 
A Deep Reinforcement Learning Model-Based Optimization…                                                         Informatica 48 (2024) 121–134   127 
optimization, the density field in the design domain is 
changed by adjusting the connection weights in a 
feedforward fashion. This allows us to formulate the 
optimization issue as: 
{
𝐹𝑖𝑛𝑑 :𝜃 𝑀𝑖𝑛 :𝑉 (𝑤 ,Φ)=
1
2
∫𝜀 (𝑤 )
𝐷 𝑇 (Φ(𝜃 ))𝜀 (𝑤 )𝑡 Ω
Ω
𝑠 .𝑡 :{
1
(Ω)
∫Φ(θ)tΩ−C
𝑝𝑟𝑒𝑠𝑐𝑟𝑖𝑏𝑒 ≤0
Ω
 
                                       (6) 
Whereθthe feedforward is network parameters and 𝑉 is the 
architectural compliance goal function. The relative 
densityΦ in the world of design is denoted by, where 
C
𝑝𝑟𝑒𝑠𝑐𝑟𝑖𝑏𝑒 is the proportion of the volume that must conform 
to the design. The finite element framework uses the 
unknown velocity field( ), the pressure (𝜀 ), and the elastic 
matrix (𝑇 ) to represent these quantities. 
4.4 The lower limit of stress compliance 
While optimizing for the least conformance with pressure 
limitation issue, mises pressure is always employed to gauge 
local stress and serve as a restriction on the search space.  
Yet, it is numerically costly to restrict local stress. To 
estimate the local stress limitation, a p-norm method is used 
here. Many updated strategies for precise local stress 
regulation have been put forward in recent years. To keep 
things simple, we use a tried-and-true technique to put a cap 
on the local stress created by von Mises. In this approach, 
the constraint is formulated using the p-norm measure PN. 
Thus, the issue presented in Section 3.2 may be restated as: 
{
 
 
 
 
 
 
 
 
𝐹𝑖𝑛𝑑 :𝜃 𝑀𝑖𝑛 :𝑉 (𝑤 ,𝛷 )=
1
2
∫𝜀 (𝑤 )
𝐷 𝑇 (𝛷 (𝜃 ))𝜀 (𝑤 )𝑡𝛺
𝛺 𝑆 .𝑡 :
{
 
 
 
 
 
 {
1
(𝛺 )
∫𝛷 (𝜃 )𝑡𝛺 −𝐶 𝑝𝑟𝑒𝑠𝑐𝑟𝑖𝑏𝑒 ≤0
𝛺 𝜎 𝑃𝑁
=(∑(𝑐 𝑎 𝜎 𝑎 𝑐𝑁
)
𝑏 𝑁 𝑎 =1
)
𝑞 𝐵 ≤𝜎 𝑃𝑁
̅̅̅̅̅((∑(𝑐 𝑎 𝜎 𝑎 𝑐𝑁
)
𝑏 𝑁 𝑎 =1
)
𝑞 𝐵 −𝜎 𝑃𝑁
̅̅̅̅̅<0)
                                                                                                                                                                                      (7)
Where, 𝜎 𝑎 𝑐 is the mises pressure on a component, 𝑏 is the 𝑏 -
norm parameter, 𝜎 𝑃𝑁
̅̅̅̅̅ is the 𝑏 -norm measure, and 𝜎 𝑃𝑁
̅̅̅̅̅ is the 
global stress bound. A solid volume of the element is s. 
The algorithm's performance and accuracy as an estimate 
of the maximum stress values are both affected by the 
number you choose for 𝑏 . All pressure numerical 
experiments in this work use 𝑏 = 10. 
4.5 Sensitivity testing for layouts 
The objective's responsiveness to the model parameters, i.e., 
the strengths of the feed-forward network, is required for 
gradient-based optimization. The chain rule will be used to 
calculate the objective stored procedure sensitivity. You 
may calculate the density field sensitivity using the adjoint 
approach. 
𝜕𝑉
𝜕 ∅
=𝜆 𝐷 𝜕𝑅
𝜕 ∅
𝑤           (8) 
Where𝜆 𝐷 is the constructed stiffness matrix and is the 
conjugate gradient vector obtained from the conjugate 
gradient equation 𝑅 = − 𝑓 . Using the chain rule, we can 
write down how sensitive objective 𝑉 is is to changes in 
design variable w. 
𝜕𝑉
𝜕𝑢
=
𝜕𝑉
𝜕 ∅
.
𝜕 ∅
𝜕𝑢
           (9) 
for ∅, where 𝑁 (𝑀 )is an expression of the density field. The 
algorithmic differentiation method used in the free program 
CasADi makes it simple to get the sensitivity of 𝑁 (𝑀 ) about 
the network weights w. In a similar vein, the following 
derivation using the chain rule may be used to do a risk 
assessment of the p-norm stress: 
𝜕 𝜎 𝑃𝑁
𝜕𝑢
=
𝜕 𝜎 𝑃𝑁
𝜕 ∅
.
𝜕 ∅
𝜕𝑢
                                                  (10) 
Where, one may find the adjoint technique of 
𝜕 𝜎 𝑃𝑁
𝜕 ∅
quantitative susceptibility deduction. 
 
 
128   Informatica 48 (2024) 121–134                                                                                                                                          Qi Guo et al. 
4.6 Deep reinforcement learning 
The MDP, the central formalism in RL, has been presented, 
and some of the difficulties in the field have been touched 
on. The following discussion will categorize RL 
technologies into their respective groups. Both value-
function-based and policy-search-based techniques may be 
used to address RL issues. The actor-critic method combines 
critical values and strategy search into a single strategy. We 
would then describe these methods, along with some other 
tools, for addressing RL issues. 
4.7 Function of value 
Value-function-based approaches, attempt to calculate the 
monetary benefit (or another measure of value) of being in a 
certain condition. The predicted return from beginning in 
state s and continuing to follow is denoted by the state-value 
function 𝑋 𝜋 (𝑡 ). 
𝑋 𝜋 (𝑡 )=𝔼 [𝑄 ]𝑡 ,𝜋 ]          (11) 
Both the optimum policy𝜋 ∗
and the ideal state-value function 
𝑋 ∗
(𝑠 )may be expressed in terms of one another. 
𝑋 𝜋 (𝑡 )=max
𝜋 𝑋 𝜋 (𝑡 ) ∀𝑡 𝜖𝑇 .         (12) 
Knowledge of 𝑋 𝑡 (𝑄 )the best policy might be retrieved by 
determining the course of action that maximizes the 
function's value at state 𝑡 𝑠 among the potential outcomes  
𝔼 𝑡 𝑠 +1
~𝜏 (𝑡 𝑠 +1
|𝑡 𝑠 ,𝑏 )
[𝑋 ∗
(𝑡 𝑠 +1
)].       (13) 
The transitional dynamics T are not accessible in the RL 
setup. As a result, we create a different function referred to 
as the state-action value or quality value 𝑋 𝜋 (𝑡 ,) which is 
similar to 𝑋 𝜋 , with the exception that 𝑎 is given as the first 
action and is only applied after the subsequent state: 
𝑃 𝜋 (𝑡 ,𝑏 )= 𝔼 [𝑄 ]𝑡 ,𝑏 ,𝜋 ]          (14) 
By selecting an aggressive at each stage (,𝑏 ), one may 
determine the optimum policy given 
𝑃 𝜋 (𝑡 ,𝑏 )arg𝑎𝑖𝑛 𝑃 𝜋 (𝑡 ,𝑏 ). According to this rule, we can also 
determine 𝑋 𝜋 (𝑡 )by maximizing 𝑃 𝜋 (𝑡 ,𝑏 ): 𝑃 𝜋 (𝑡 ,𝑏 )=
𝑚𝑎𝑥 𝑏 𝑃 𝜋 (𝑡 ,𝑏 ). 
 
 
 
4.8 Dynamic programming 
To learn 𝑃 𝜋 , we make use of the Markov property and 
formulate the variable as a Bellman equation, that has the 
recursive form: 
𝑃 𝜋 (𝑡 𝑠 ,𝑏 𝑠 )=𝔼 𝑡 𝑠 +1
[𝑞 𝑠 +1
+𝛾𝑅
𝜋 ((𝑡 𝑠 +1
,𝜋 )𝑡 𝑠 +1
)]     (15) 
In other words, we may utilize the present values of our 
approximation of 𝑃 𝜋 to improve it. This suggests that 𝑃 𝜋 can 
be improved through bootstrapping. This is the cornerstone 
of the SARSA algorithm and Q-learning. 
𝑃 𝜋 (𝑡 𝑠 ,𝑏 𝑠 )←𝑃 𝜋 (𝑡 𝑠 ,𝑏 𝑠 )+𝛼𝛿 ,        (16) 
Where𝛼 is the learning rate and 𝛿 =𝑍 −𝑃 𝜋 (𝑡 𝑠 ,
 𝑡 ℎ𝑒 𝑏𝑎𝑠 )is 
of the temporal difference error; Y is the goal in this case, 
much as in a typical regression issue. By employing 
transitions produced by the behavioral policy (the policy 
derived from ), SARSA, an on-policy training algorithm, 
is utilized to enhance the approximation of𝑃 𝜋 , which has the 
effect of establishing𝑍 =𝑞 𝑠 +𝛾𝑅
𝜋 (𝑡 𝑠 +1
,𝑏 𝑠 +1
). Q-learning 
is against policy since 𝑅 𝜋 is modified by transitioning that is 
not always produced by the derived policy. As an 
alternative, Q-learning employs 𝑍 =𝑞 𝑠 +𝛾 =
𝑚𝑎𝑥 𝑏 𝑃 𝜋 (𝑡 𝑠 +1
,𝑏 𝑠 +1
), which closely resembles
∗
. 
We employ generalized policy repetition, which comprises 
policy evaluation and enhancement, to determine 𝑃 ∗
 from an 
arbitrary𝑃 𝜋 . Minimizing TD inaccuracies from the 
trajectory encountered while following the policy is one way 
in which policy assessment helps to enhance the estimation 
of the functional form. By making greedy decisions based 
on the revised functional form, the policy can be made more 
effective as estimation accuracy rises. Generalized policy 
iteration allows these steps to be interleaved, rather than 
performed sequentially to obtain an optimal (as in policy 
iteration), speeding up the process. 
4.9 Sampling 
Instead of utilizing optimization techniques to bootstrapping 
value functions, Monte Carlo approaches use the average 
return from numerous policy rollouts to predict the 
anticipated return from a state. This means that contrary to 
popular belief, pure Carlo techniques are applicable in non-
Markovian settings. Nevertheless, they are limited to serial 
MDPs, since the rollout must end before the return can be 
determined. To get the most out of both approaches, the 
A Deep Reinforcement Learning Model-Based Optimization…                                                         Informatica 48 (2024) 121–134   129 
𝑆𝐶 (⋋)algorithm combines TD learning with Monte Carlo 
policy assessment. The 𝑆𝐶 (⋋) functions as an interpolation 
between Carlo computation and ramping in same way that 
the present value does.  
Learning the benefit function 𝐵 𝜋 (𝑡 ,𝑏 ) is a key component 
of another effective value approach. Provides relative values 
and experimental as opposed to creating utter impossibility 
values as𝑃 𝜋 , does. Understanding relative values is similar 
to lowering the threshold or median level of a signal; 
intuitively, it is simpler to understand that one course of 
action will have better results than another than to 
understand the exact return from that course of action. Via 
the straightforward equation, 𝐴𝜋 = 𝑄𝜋 − 𝑉 𝜋 reflects a 
relative benefit of actions. It is also closely connected to the 
baseline variability reduction approach used in diffusion 
policy search methods. Several modern DRL algorithms 
have used the concept of advantage updates. 
4.10 Policy search 
The search for the best policy can be done independently of 
any model of the value function. To maximize the 
anticipated return 𝐸 [𝑅 |𝜃 ]most people choose a 
parameterized strategy whose parameters may be optimized 
in either a horizontal stripe or horizontal stripe fashion.  Both 
gradient-free and gradient-based techniques have been used 
effectively to train neural network models that encode 
policies. While diffusion optimization has shown promise 
for covering cheap parameter spaces, most DRL techniques 
still favor diffusion training since it is more specimens when 
dealing with policies that have many characteristics. 
4.11 Policy gradients 
An efficient learning indication of how to fine-tune a 
parameterized policy may be obtained from gradients. But 
to calculate the anticipated return, we need to take an 
average across conceivable paths that the present policy 
parameterization may provide. This takes average calls for 
either predetermined (via linearization, for example) or 
simulated annealing (via sampling) approximations. Only in 
a prototype system, where the fundamental changeover 
mechanisms can be modeled, can predictable approaches be 
used. For the most part, model-free RL settings use a Carlo 
calculation to determine the anticipated return. This Carlo 
estimation presents a problem for diffusion learning because 
gradients do not propagate through random specimens of a 
probability function. As a result, we use a scoring function 
or posterior probability estimator (known as the 
REINFORCE rule in RL) as an estimate of the gradient. The 
latter name is evocative, as maximizing the log-likelihood is 
a common method for supervised learning that is used in 
conjunction with the estimator. The log-likelihood of the 
sampled action is increased by the estimator's gradient 
ascent, which is graded by the return. Calculating the 
gradient of an expectancy over a linear function of a random 
vector about parameters may be formalized using the 
REINFORCE rule𝜃 . 
∇
𝜃 𝔼 𝑉 [𝑒 (𝑉 ;𝜃 )]=𝔼 𝑉 [𝑒 (𝑉 ;𝜃 )∇
𝜃 log𝑜 (𝑌 )].        (17) 
Because this calculation is based on the actual results of 
trajectories, the resultant gradients are very inconsistent. A 
more manageable variance may be achieved by including 
unbiased estimates with lower levels of background noise. 
The standard approach involves deducting a baseline, which 
implies putting more emphasis on positive updates than 
purely financial ones. The most elementary foundation is the 
average annual return across several events, although there 
are numerous more possibilities. 
4.12 Actor-critic methods 
When value features are combined with explicit 
consideration of the policy, we get actor-critic approaches. 
The "critic" (value function) provides the "actor" (policy) 
with constructive criticism that helps it improve. They 
achieve this by balancing the benefits of reducing the 
variation of policy grades with the drawbacks of introducing 
bias when using value function approaches. 
Policy gradients in actor-critic approaches are derived from 
the value function, just as they are in others' development; 
the key distinction is that actor-critic approaches employ a 
learned value function. As a result, we will go over actor-
critic techniques as a special case of gradient descent 
methods later on. 
5.  Results and discussion 
This section examines the existing methods like MDP (Ran 
& Dong, 2022), VR (Wu, 2022), and AI (Di & Yu, 2021) 
with time consumption, accuracy prediction, precision 
value, and the recall factor by comparing with our 
recommended strategy. Python 3.7 is used to implement the 
models for accurate selections. TensorFlow 2.0.0 is used to 
implement the value neural network. For simulations, we 
employed a GNU/Linux server equipped with a 64-bit Intel 
Xeon Gold CPU executing at 2.10GHz. 
5.1 Computation time 
A computer operation's "computation time," often known as 
its "running time," is the amount of time needed to finish it. 
The quantity of rule implementations will have an impact on 
how long it takes to finish a computation, which may be seen 
as a collection of rule applications. With a logic-gate-based 
quantum computer, the number of unitary transformations is 
130   Informatica 48 (2024) 121–134                                                                                                                                          Qi Guo et al. 
directly proportional to the time required to complete a 
single "quantum parallel" calculation. 
 
Figure 5: The computation time of the proposed and 
existing system 
Figure 5 and Table 3 shows the computation time for 
proposed method. The computation time requires the DRL 
framework to analyze and produce optimal design 
configurations in an optimization technique for graphic 
design. For actual time applicability and easy incorporation 
into a graphic design process, efficient calculation time is 
essential for timely and flexible design optimization. 
Standard methods that include VR and MDP take 91% and 
73% of the time. AI has an 81%-time utilization rate. The 
method that has been proposed requires only 61% of the 
computing time, which is a significant reduction. 
Table 3: Comparison of computation time 
Methods Computation time (%) 
MDP 91 
VR 73 
AI 81 
DRL [Proposed] 61 
 
 
 
 
 
 
 
5.2 Accuracy 
 
Figure 6: Accuracy of proposed and existing method 
The accuracy of the suggested technique is seen in Figure 6. 
It is possible to think of a device's accuracy as how closely 
its estimations of a quantity match the value that matches 
that number. 
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =(𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 +𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 )/
(𝑇𝑟𝑢𝑒𝑝𝑜 𝑠𝑖𝑡𝑖𝑣𝑒𝑠 +𝑇𝑟𝑢𝑒𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 +𝐹𝑎𝑙𝑠𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 +
𝐹𝑎𝑙𝑠𝑒𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 )=(𝑇𝑃 +𝑇𝑁 )/(𝑇𝑃 +𝑇𝑁 +𝐹𝑃 +𝐹𝑁 )
                                   (18) 
Accuracy measures how well the model produces designs 
that meet predetermined standards, guaranteeing the 
efficiency of the optimization procedure. The capability of 
model to apply DRL methods to produce attractive and 
functionally successful graphic designs is demonstrated by 
the high metric accuracy obtained. Conventional methods, 
such as VR and MDP, yield 65% and 75% accuracy. 
Accuracy is increased to 85% when AI is used. The proposal 
provides the most effective 95% accuracy rate, 
demonstrating its effectiveness in improved graphic design 
processes. Table 4 displays the accuracy of the suggested 
strategy. 
 
 
 
 
 
 
A Deep Reinforcement Learning Model-Based Optimization…                                                         Informatica 48 (2024) 121–134   131 
Table 4. Comparison of accuracy 
Methods Accuracy (%) 
MDP 75 
VR 65 
AI 85 
DRL [Proposed] 95 
 
5.3 Precision 
Precision or positive predictive value is the percentage of 
pertinent concepts among recovered occurrences. It can 
imply that the standard for quality is accuracy. Precision is 
the extent to which the same results are achieved from the 
same measurements carried out under the same conditions. 
Reproducibility is the variance that happens when the same 
technique is applied over extended times by different 
instruments and operators. 
When every attempt is made to maintain a process, 
repeatability is the variance that occurs when the same 
equipment and operator are used and the same short amount 
of time is given to each repetition. 
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 /(𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 +
𝐹𝑎𝑙𝑠𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 )=𝑇𝑃 /(𝑇𝑃 +𝐹𝑃 )                    (19) 
 
 
Figure 7: The precision of the proposed and existing 
method 
The precision for the suggested system is shown in Figure 7. 
The precision is essential for assuring that the algorithm 
navigates the design space efficiently and generates visually 
appealing graphics. It displays the model's ability to 
optimize parameters for design to satisfy predetermined 
standards and make delicate adjustments, which increases 
efficiency in graphic design activities. Using a 98% 
precision rate, the proposed method showed outcomes. 
Compared with various methods, it performed better at 88%, 
75%, and 66% in VR. The research objectives outcomes 
illustrate determining whether the DRL method succeeds in 
relation to obtaining higher precision. In Table 5, the 
suggested approach is shown. 
Table 5: Comparison of precision 
Methods Precision (%) 
MDP 88 
VR 66 
AI 75 
DRL [Proposed] 98 
 
5.4 Recall 
The ability of the model to identify every significant sample 
in a set of data is referred to as recall. According to statistics, 
it is defined as the percentage of the TPs multiplied by the 
sum of TPs and FNs. Utilizing the formula, the recall is 
calculated.  
Recall=
FN
FN+TP
           (20) 
 
Figure 8: Recall of proposed and existing method 
Comparative data for the recall metrics are shown in Figure 
8. The Recall is an important component that ensures the 
models maintain important data and apply it to the design 
process, improving the efficacy and efficiency of the 
optimization process to produce elegant designs. With a 
recall of 77% VR, MDP obtains a recall rate of 66%. AI 
produces an 87% recall rate. The proposed exceeds other 
132   Informatica 48 (2024) 121–134                                                                                                                                          Qi Guo et al. 
methods with a 98% recall rate, demonstrating its 
effectiveness in the specific research environment. Table 6 
depicts the comparison of recall 
Table 6: Comparison of recall 
Methods Recall (%) 
MDP 66 
VR 77 
AI 87 
DRL [Proposed] 98 
 
6   Discussion 
Interpretability and clarification issues with DL (Zhou, Lee, 
Diao, Shi, Balyen, &Peto, et al, (2019)) models can prevent 
them from being used in domains where it is essential for 
explaining the decision-making process. Its application in 
areas with dense datasets is limited as it frequently requires 
substantial volumes of data with labels for efficient training, 
Specific knowledge can fail to identify complex patterns in 
data, which is the foundation of ML (Cioffi, Travaglioni, 
Piscitelli,  Petrillo,& De Felice, et al, (2020)) methods. 
Complex and non-linear interactions can be difficult for the 
models to manage, which could result in inadequate 
performance on assignments where techniques for deep 
learning work efficiently. RL (Wang, Tang, Huang, Chen, 
Zhang, & Huang, (2020)) has the potential to be technically 
expensive and lengthy to train. Limitations include 
exploration-exploitation compromises, scarce reward 
scenarios that can cause RL models to fail and the 
Performance of DDPG (Bouhamed,  Ghazzai,  Besbes, 
&Massoud, (2020)) can be hindered by sensitivity to 
variables and training issues with stability. It could struggle 
with the issue of highly dimensional action spaces. When 
applying DDPG to intricate optimization jobs, it must be 
carefully adjusted and its limits need to be considered 
perspective in various instances. Deep Reinforcement 
Learning (DRL) enables the model to learn specific 
correlations between design elements. It provides numerous 
benefits in graphic design optimization. Its capacity for 
iterative adaptation and optimization improves the 
effectiveness of the method of graphic design by providing 
relevant information and automating complex design 
selections for increased innovation and efficiency. 
7. Conclusion 
To aid in the process of navigating graphic design files, we 
proposed DRL framework. The most advanced DRL 
techniques are often used in artificial settings where the 
distribution of pictures does not correspond to that of natural 
scenes. This is an important step in achieving more lifelike 
environments. Because of the rapid proliferation of 
generative design tools, it is now possible to augment 
traditional shape-finding procedures with technological 
answers. Our findings highlight the potential for using 
topological optimization techniques in the built 
environment. Some key takeaways are as follows:  
(a) As contrasted with the conventional voxel-based 
optimization technique, when a neural network is used to 
model the density fields, the amount of architectural 
parameters is significantly decreased.  
(b) As the topology is represented implicitly, the resulting 
layout does not have a staggered border.  
In the long run, this paper's approach offers a fresh chance 
to combine deep learning with topology optimization. More 
advanced and robust deep-learning models have been 
presented in recent years. This paper's proposed approach is 
a hybrid of deep learning and topology optimization. More 
deep learning models, like CNN and GAN, will be used to 
represent the density field in upcoming research. 
References 
[1] Bichu, Y. M., Hansa, I., Bichu, A. Y., Premjani, P., 
Flores-Mir, C., &Vaid, N. R. (2021). Applications of 
artificial intelligence and machine learning in 
orthodontics: a scoping review. Progress in 
Orthodontics, 22(1), 18. 
https://doi.org/10.1186/s40510-021-00361-9 
[2] Bouhamed, O., Ghazzai, H., Besbes, H., &Massoud, Y. 
(2020). Autonomous UAV navigation: A DDPG-based 
deep reinforcement learning approach. 2020 IEEE 
International Symposium on Circuits and Systems 
(ISCAS). 
[3] Brown, N., Garland, A. P., Fadel, G. M., & Li, G. 
(2022). Deep reinforcement learning for engineering 
design through topology optimization of elementally 
discretized design domains. SSRN Electronic Journal. 
https://doi.org/10.2139/ssrn.4010395 
[4] Chen, Y., Tong, Z., Zheng, Y., Samuelson, H., 
&Norford, L. (2020). Transfer learning with deep 
neural networks for model predictive control of HVAC 
and natural ventilation in smart buildings. Journal of 
Cleaner Production, 254(119866), 119866. 
https://doi.org/10.1016/j.jclepro.2019.119866 
 
[5] Cioffi, R., Travaglioni, M., Piscitelli, G., Petrillo, A., 
& De Felice, F. (2020). Artificial intelligence and 
A Deep Reinforcement Learning Model-Based Optimization…                                                         Informatica 48 (2024) 121–134   133 
machine learning applications in smart production: 
Progress, trends, and directions. Sustainability, 12(2), 
492. https://doi.org/10.3390/su12020492 
[6] Das, N., Bechtle, S., Davchev, T., Jayaraman, D., Rai, 
A., & Meier, F. (2021). Model-based inverse 
reinforcement learning from visual demonstrations. In 
Conference on Robot Learning (pp. 1930-1942). 
PMLR. 
[7] Deng, Z., & Chen, Q. (2021). Reinforcement learning 
of occupant behavior model for cross-building transfer 
learning to various HVAC control systems. Energy 
and Buildings, 238(110860), 110860. 
https://doi.org/10.1016/j.enbuild.2021.110860 
[8] Di, X. & Yu, P., (2021). Multi-agent reinforcement 
learning of 3d furniture layout simulation in indoor 
graphics scenes. arXiv preprint arXiv:2102.0937. 
[9] Ding, X., Du, W., &Cerpa, A. E. (2020). Mb2c: 
Model-based deep reinforcement learning for multi-
zone building control. In Proceedings of the 7th ACM 
international conference on Systems for energy-
efficient buildings, cities, and Transportation (pp. 50–
59). 
[10] Li, H., Zhu, J., Zhou, Y., Feng, Q., & Feng, D. (2022). 
Charging station management strategy for returns 
maximization via improved TD3 deep reinforcement 
learning. International Transactions on Electrical 
Energy Systems, 2022, 1–14. 
https://doi.org/10.1155/2022/6854620 
[11] Lin, Y., McPhee, J., & Azad, N. L. (2019). Comparison 
of Deep Reinforcement Learning and Model Predictive 
Control for Adaptive Cruise Control. In arXiv 
[eess.SY]. http://arxiv.org/abs/1910.12047 
[12] Luong, M., & Pham, C. (2021). Incremental learning 
for autonomous navigation of mobile robots based on 
deep reinforcement learning. Journal of Intelligent & 
Robotic Systems, 101(1). 
https://doi.org/10.1007/s10846-020-01262-5 
[13] Predić, B., Manić, D., Saračević, M., Karabašević, D., 
&Stanujkić, D. (2022). Automatic image caption 
generation based on some machine learning 
algorithms. Mathematical Problems in Engineering. 
[14] Ran, M., & Dong, J. (2022). A Multiobjective 
Optimization Algorithm for Building Interior       
Design and Spatial Structure Optimization. Mobile 
Information Systems. 
[15] Shen, Z., Wang, Y., Wu, D., Yang, X., & Dong, B. 
(2020). Learning to scan: A deep Reinforcement 
Learning approach for personalized scanning in CT 
imaging. In arXiv [physics.med-ph]. 
http://arxiv.org/abs/2006.02420 
[16] Tao, H., Al-Sulttani, A. O., Salih Ameen, A. M., 
Ali, Z. H., Al-Ansari, N., Salih, S. Q., & Mostafa, 
R. R. (2020). Training and testing data division 
influence on hybrid machine learning model 
process: Application of river flow 
forecasting. Complexity, 2020, 1–22. 
https://doi.org/10.1155/2020/8844367 
[17] Tapeh, A. T. G., &Naser, M. Z. (2022). Machine 
Learning, and Deep Learning in Structural 
Engineering: A Scientometrics Review of Trends 
and Best Practices. Archives of Computational 
Methods in Engineering. 1–45. 
[18] Wang, T., Tang, Y., Huang, Y., Chen, X., Zhang, 
S., & Huang, H. (2020). Automatic adjustment 
method of power flow calculation convergence for 
large-scale power grid based on knowledge 
experience and deep reinforcement learning. 2020 
IEEE 4th Conference on Energy Internet and 
Energy System Integration (EI2). 
[19] Yamaguchi, T., Nagahama, S., Ichikawa, Y., & 
Takadama, K. (2019). Model-based multi-
objective reinforcement learning with unknown 
weights. In Human Interface and the Management 
of Information. Information in Intelligent Systems: 
Thematic Area, HIMI 2019, Held as Part of the 21st 
HCI International Conference, HCII 2019, 
Orlando, FL, USA, July 26-31, 2019, Proceedings, 
Part II 21 (pp. 311-321). Springer International 
Publishing. DOI: https://doi.org/10.1007/978-3-
030-22649-7_25 
[20] Wu, W., & Feng, Y. (2022). Interior space design 
and automatic layout method based on 
CNN. Mathematical Problems in 
Engineering, 2022, 1–14. 
https://doi.org/10.1155/2022/8006069 
[21] Wu, Y. (2022). Architectural interior design and 
space layout optimization method based on VR and 
5G technology. Journal of Sensors, 2022, 1–10. 
https://doi.org/10.1155/2022/7396816 
[22] Xu, N., Zhang, H., Liu, A.A., Nie, W., Su, Y., Nie, 
J. & Zhang, Y., 2019. Multi-level policy and 
reward-based deep reinforcement learning 
134   Informatica 48 (2024) 121–134                                                                                                                                          Qi Guo et al. 
framework for image captioning. IEEE 
Transactions on Multimedia, 22(5), pp.1372-1383. 
[23] Zhang, X., Chen, Y., Bernstein, A., Chintala, R., 
Graf, P., Jin, X., &Biagioni, D. (2022). Two-stage 
reinforcement learning policy search for grid-
interactive building control. IEEE Transactions on 
Smart Grid, 13(3), 1976–1987. 
https://doi.org/10.1109/tsg.2022.3141625 
[24] Zhang, X., Chintala, R., Bernstein, A., Graf, P., 
&Jin, X. (2020). Grid-interactive multi-zone 
building control using reinforcement learning with 
global-local policy search. In arXiv [eess.SY]. 
http://arxiv.org/abs/2010.06718 
[25] Zhong, X., Zhang, Z., Zhang, R., & Zhang, C. 
(2022). End-to-end deep reinforcement learning 
control for HVAC systems in office 
buildings. Designs, 6(3), 52. 
https://doi.org/10.3390/designs6030052 
[26] Zhou, Y., Lee, W. J., Diao, R., Shi, D., Balyen, L., 
&Peto, T. (2019). Promising artificial intelligence-
machine learning-deep learning algorithms in 
ophthalmology. Journal of Modern Power Systems 
and Clean Energy, 10(5), 264–272. 
[27] Zhu, M., Wang, Y., Pu, Z., Hu, J., Wang, X., &Ke, 
R. (2019). Safe, efficient, and comfortable velocity 
control based on reinforcement learning for 
autonomous driving. In arXiv [cs.LG]. 
http://arxiv.org/abs/1902.00089