Volume 41 Number 3 September 2017
1977
Editorial Boards
Informatica is a journal primarily covering intelligent systems in
the European computer science, informatics and cognitive com-
munity; scientific and educational as well as technical, commer-
cial and industrial. Its basic aim is to enhance communications
between different European structures on the basis of equal rights
and international refereeing. It publishes scientific papers ac-
cepted by at least two referees outside the author’s country. In ad-
dition, it contains information about conferences, opinions, criti-
cal examinations of existing publications and news. Finally, major
practical achievements and innovations in the computer and infor-
mation industry are presented through commercial publications as
well as through independent evaluations.
Editing and refereeing are distributed. Each editor from the
Editorial Board can conduct the refereeing process by appointing
two new referees or referees from the Board of Referees or Edi-
torial Board. Referees should not be from the author’s country. If
new referees are appointed, their names will appear in the list of
referees. Each paper bears the name of the editor who appointed
the referees. Each editor can propose new members for the Edi-
torial Board or referees. Editors and referees inactive for a longer
period can be automatically replaced. Changes in the Editorial
Board are confirmed by the Executive Editors.
The coordination necessary is made through the Executive Edi-
tors who examine the reviews, sort the accepted articles and main-
tain appropriate international distribution. The Executive Board
is appointed by the Society Informatika. Informatica is partially
supported by the Slovenian Ministry of Higher Education, Sci-
ence and Technology.
Each author is guaranteed to receive the reviews of his article.
When accepted, publication in Informatica is guaranteed in less
than one year after the Executive Editors receive the corrected
version of the article.
Executive Editor – Editor in Chief
Matjaž Gams
Jamova 39, 1000 Ljubljana, Slovenia
Phone: +386 1 4773 900, Fax: +386 1 251 93 85
matjaz.gams@ijs.si
http://dis.ijs.si/mezi/matjaz.html
Editor Emeritus
Anton P. Železnikar
Volaričeva 8, Ljubljana, Slovenia
s51em@lea.hamradio.si
http://lea.hamradio.si/˜s51em/
Executive Associate Editor - Deputy Managing Editor
Mitja Luštrek, Jožef Stefan Institute
mitja.lustrek@ijs.si
Executive Associate Editor - Technical Editor
Drago Torkar, Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Phone: +386 1 4773 900, Fax: +386 1 251 93 85
drago.torkar@ijs.si
Contact Associate Editors
Europe, Africa: Matjaz Gams
N. and S. America: Shahram Rahimi
Asia, Australia: Ling Feng
Overview papers: Maria Ganzha, Wiesław Pawłowski,
Aleksander Denisiuk
Editorial Board
Juan Carlos Augusto (Argentina)
Vladimir Batagelj (Slovenia)
Francesco Bergadano (Italy)
Marco Botta (Italy)
Pavel Brazdil (Portugal)
Andrej Brodnik (Slovenia)
Ivan Bruha (Canada)
Wray Buntine (Finland)
Zhihua Cui (China)
Aleksander Denisiuk (Poland)
Hubert L. Dreyfus (USA)
Jozo Dujmović (USA)
Johann Eder (Austria)
George Eleftherakis (Greece)
Ling Feng (China)
Vladimir A. Fomichov (Russia)
Maria Ganzha (Poland)
Sumit Goyal (India)
Marjan Gušev (Macedonia)
N. Jaisankar (India)
Dariusz Jacek Jakóbczak (Poland)
Dimitris Kanellopoulos (Greece)
Samee Ullah Khan (USA)
Hiroaki Kitano (Japan)
Igor Kononenko (Slovenia)
Miroslav Kubat (USA)
Ante Lauc (Croatia)
Jadran Lenarčič (Slovenia)
Shiguo Lian (China)
Suzana Loskovska (Macedonia)
Ramon L. de Mantaras (Spain)
Natividad Martínez Madrid (Germany)
Sando Martinčić-Ipišić (Croatia)
Angelo Montanari (Italy)
Pavol Návrat (Slovakia)
Jerzy R. Nawrocki (Poland)
Nadia Nedjah (Brasil)
Franc Novak (Slovenia)
Marcin Paprzycki (USA/Poland)
Wiesław Pawłowski (Poland)
Ivana Podnar Žarko (Croatia)
Karl H. Pribram (USA)
Luc De Raedt (Belgium)
Shahram Rahimi (USA)
Dejan Raković (Serbia)
Jean Ramaekers (Belgium)
Wilhelm Rossak (Germany)
Ivan Rozman (Slovenia)
Sugata Sanyal (India)
Walter Schempp (Germany)
Johannes Schwinn (Germany)
Zhongzhi Shi (China)
Oliviero Stock (Italy)
Robert Trappl (Austria)
Terry Winograd (USA)
Stefan Wrobel (Germany)
Konrad Wrona (France)
Xindong Wu (USA)
Yudong Zhang (China)
Rushan Ziatdinov (Russia & Turkey)
 Informatica 41 (2017) 259–274 259 
Bipartivity Index based Link Selection Strategy to Determine Stable 
and Energy-Efficient Data Gathering Trees for Mobile Sensor 
Networks 
Natarajan Meghanathan 
Professor, Department of Computer Science, Jackson State University 
Jackson, MS, USA 
E-mail: natarajan.meghanathan@jsums.edu 
 
Keywords: link stability, mobile sensor networks, bipartivity index, data gathering trees, simulations 
Received: August 12, 2016 
Bipartivity Index (BPI) has been used in complex network analysis to quantify the extent of partitioning 
of the vertices of a network graph into two disjoint partitions; the edges between vertices within the 
same partition are called frustrated edges. The BPI values for a network graph ranges from 0 to 1 (the 
BPI of a network graph that is truly bipartite and has no frustrated edges is 1). Our hypothesis in this 
research is that the end nodes of a short distance link (the distance between the end nodes is 
significantly smaller than the transmission range per node) in a mobile sensor network (MSN) are more 
likely to share a significant fraction of their neighbors and such links are more likely to be stable. We 
introduce a notion called the egocentric network of an edge (adapted from egocentric network for a 
node) comprising of the end nodes of the edge and their neighbors (as vertices) and the edges incident 
on the end nodes (as edges). Our claim is that an edge whose egocentric network has a lower BPI score 
is more likely to be a stable short distance link, with a relatively larger fraction of shared neighborhood, 
and could be preferred for inclusion while determining stable data gathering trees for MSNs. Through 
extensive simulations, we show that the BPI-based DG trees are significantly more stable and energy-
efficient compared to the DG trees determined using the predicted link expiration time (LET), currently 
the best known strategy. 
Povzetek: Prispevek s pomočjo BPI indeksa ugotavlja stabilna in energijsko učinkovita drevesa za 
mobilne senzorske mreže. 
 
1 Introduction 
Mobile Sensor Networks (MSNs) are an emerging 
category of wireless sensor networks in which the sensor 
nodes are considered to move independent of each other. 
MSNs could be used for applications in which an entire 
region (that is being monitored) could be effectively 
covered by letting the sensor nodes to move rather than 
be static. For example [9], the pollutant concentration in 
an area (like the downtown of a city) could be effectively 
measured by fixing the sensor nodes in mobile vehicles 
(like cars) that move through the area. For most of the 
applications of wireless sensor networks (including those 
of the MSNs), the data recorded by the sensor nodes is 
forwarded to a control center (called the sink) through 
one of several network-wide communication topologies 
(like chains [11], clusters [7], trees [18], connected 
dominating sets [16], etc). Among these communication 
topologies, the data gathering trees (DG trees) have been 
observed to be energy-efficient [18] as they comprise of 
the minimum number of links needed to span all the 
sensor nodes and there are no redundant transmissions. In 
the case of DG trees, the leaf nodes merely sense the data 
and transmit them to an upstream intermediate node that 
would in turn aggregate its own data with data received 
from all of its child nodes and forward the aggregated 
data to an upstream node that is on the path to the root 
node of the DG tree. For the rest of the paper, the terms 
'node' and 'vertex', 'link' and 'edge', 'network' and 'graph', 
'data gathering' and 'data aggregation', 'construction' and 
'configuration' mean the same. These terms are used 
interchangeably unless stated. 
MSNs inherit all the constraints of their static 
counterpart (like energy and memory-constrained sensor 
nodes as well as limited network bandwidth); mobility of 
the nodes is an additional constraint that needs to be 
handled. Due to node mobility, the network topology 
changes dynamically with time and any communication 
topology (like DG trees) that is setup among the sensor 
nodes needs to be frequently reconfigured. Significant 
amount of energy might be lost if network-wide 
broadcasts are frequently initiated for reconfiguring the 
communication topology in use. This motivates the need 
to determine stable communication topologies that could 
exist for a longer time. 
In [19], the authors took the first step towards using 
DG trees for MSNs and proposed a distributed algorithm 
for determining stable DG trees in MSNs using the 
concept of predicted link expiration time (LET) [31] that 
has been earlier successfully used for mobile ad hoc 
260 Informatica 41 (2017) 259–274 N. Meghanathan  
 
networks [20, 22]. In [24], the authors proposed a generic 
algorithm to determine maximum bottleneck link weight 
(MaxBLW)-based DG trees for static sensor networks: 
the bottleneck link weight for a path from a node to the 
root node of the DG tree is the minimum of the weights 
of the constituent links on the path and the MaxBLW-DG 
algorithm determines a DG tree in which the path from 
any node to the root node of the tree is the path with 
maximum value for the bottleneck link weight. In this 
paper, we explain a distributed version of the MaxBLW-
DG algorithm to determine ALGC-based DG trees 
wherein the link weight is the link stability score (LSS) 
computed based on this strategy. For performance 
comparison purposes, we use the distributed version of 
the MaxBLW-DG tree algorithm to also determine the 
LET-based DG trees [19] wherein the weight of a link is 
its predicted LET. 
The LET-based strategy is the only available link 
selection strategy that has been successfully 
demonstrated so far [19] for determining stable DG trees 
in MSNs. However, the LET formulation [19, 31] does 
not consider the distance between the constituent end 
nodes of a link and is prone to choosing links that could 
incur a larger transmission energy and ultimately 
contributing towards larger energy consumption per 
round. We opine that links whose constituent end nodes 
are closer to each other (i.e., the distance between the end 
nodes of the link is appreciably lower than the 
transmission range per node) are more likely to be stable 
(and vice-versa) as it would take a while for such end 
nodes to move out of the transmission range of each 
other. We refer to such links as "short distance" links. 
Also, as the energy lost per transmission is directly 
proportional to the square of the distance [28] over which 
the transmission is made, we claim that the DG trees 
comprising of short distance links are more likely to be 
both stable (and vice-versa) as well as incur lower energy 
consumption per round. Moreover, the LET approach 
[31] requires a sensor node to be aware of its own 
location and mobility as well as that of its neighbors. 
This would require the sensor nodes to be equipped with 
energy-draining hardware/software systems (like GPS 
[8]) that would make them location and mobility aware. 
All of the above observations form the motivation for the 
research conducted in this paper.  
The high-level contribution of this paper is that we 
show the use of a spectral graph-theoretic metric called 
Bipartivity Index (BPI) [6] to quantify the extent of 
shared neighborhood between the end vertices of an edge 
and thereby model the link stability score (LSS) for the 
edge. The BPI has been widely used in complex network 
analysis [6] to quantify the extent of partitioning of a 
network graph into two disjoint partitions of vertices; the 
edges between vertices within the same partition are 
referred to as frustrated edges. BPI values range from 0 
to 1 [6]. A network graph is said to be truly bipartite 
(such a partitioning also has no frustrated edges) if its 
BPI is 1 [6]. We propose to use a notion called the 
"egocentric network of an edge" (adapted from the notion 
of egocentric network of a node [13]) to quantitatively 
evaluate the extent of shared neighborhood between the 
end vertices of a link. The egocentric network of an edge 
u-v (denoted EGu-v) comprises as vertices - the end nodes 
of the edge and their neighbors, and edges - the links 
incident on the end nodes of the edge. We claim that an 
edge u-v is more likely to be a stable short distance link 
with a larger fraction of shared neighborhood if the 
egocentric network EGu-v of the edge has a lower BPI. 
Accordingly, we model for an edge u-v: the LSS(u-v) as 
1 - BPI(EGu-v).  
We provide a high-level justification for the above 
modeling as follows (more details are presented in 
Section 4). If the end nodes of an edge u-v do not have 
any shared neighbors, then the egocentric network of the 
edge u-v would comprise of node u and the neighbors of 
node v in one of the two partitions, and node v and the 
neighbors of node u in the other partition; all the edges 
would connect the vertices in one partition to the other 
partition and there would be no frustrated edges within 
either partition (a frustrated edge is an edge involving 
vertices that are in the same partition [6]). Such an 
egocentric network is truly bipartite and will have a BPI 
of 1. Whereas, if the end nodes of an edge u-v have one 
or more shared neighbors, the egocentric network of the 
edge (when analyzed for bipartivity) would comprise of 
one or more frustrated edges contributing to a BPI less 
than 1. We anticipate the BPI for the egocentric network 
of an edge u-v to reduce with increase in the number of 
shared neighbors for the end nodes u and v. 
The rest of the paper is organized as follows: Section 
2 outlines the maximum bottleneck link weight-based 
algorithm for determining data gathering trees in sensor 
networks. Section 3 reviews related work, including the 
strategy of using the predicted link expiration time (LET) 
to determine stable data gathering trees for MSNs. 
Section 4 introduces the notions of short distance links, 
egocentric network for an edge and bipartivity index 
(BPI) as well as illustrates their use to quantify the extent 
of shared neighborhood and stability of links. Section 5 
presents results of exhaustive simulations conducted to 
showcase the effectiveness of the BPI-based strategy to 
determine data gathering trees that are both stable as well 
as energy-efficient compared to the LET-based DG trees. 
Section 6 concludes the paper. 
2 Distributed algorithm to construct 
a maximum bottleneck link 
weight-based data gathering tree 
In this section, we describe a distributed version of the 
algorithm to construct maximum bottleneck link weight-
based data gathering (MaxBLW-DG) trees for mobile 
sensor networks. A centralized version of the MaxBLW-
DG algorithm has been earlier proposed in [24] and a 
distributed implementation of the algorithm to determine 
LET-based stable data gathering trees has been discussed 
in [19]. The distributed version of the MaxBLW-DG 
algorithm discussed here could be applied for any 
measure of link weight. For this section, we assume the 
link weights are randomly generated in the range [0...1]. 
Bipartivity Index based Link Selection… Informatica 41 (2017)259–274 261 
In sections 4 and 5, the weight of a link depends on the 
link selection strategy (BPI or LET) employed.  
2.1 Assumptions and definitions 
We assume the sensor nodes to operate in a fixed 
transmission range, R. We assume the underlying 
network is modeled as a unit-disk graph wherein there 
exists a link between any two nodes if the Euclidean 
distance between them is within the transmission range, 
R. We assume the network to be homogeneous (i.e., all 
the nodes have an equal transmission range). We define 
the fraction of link distance (fld) as the ratio of the 
Euclidean distance between the end nodes of the link and 
the transmission range per node. In the case of 
heterogeneous networks (each node operating with a 
different transmission range), the fraction of link distance 
could be measured as the ratio of the Euclidean distance 
between the end nodes of the link and the maximum of 
the transmission ranges of the two end nodes. The data 
gathering algorithms (discussed in Sections 2 and 3) and 
the BPI strategy discussed in Section 4 could be used for 
both homogeneous and heterogeneous networks. For a 
directed edge u → v, we refer to node u as the upstream 
node and node v as the downstream node. In the context 
of link weights, we assume the links/edges are undirected 
(bidirectional): i.e., the weight of a directed edge u → v 
is the same as the weight of the directed edge v → u.  
We define a round of data gathering to comprise of 
steps in which the sensor nodes individually sense the 
data within their sensing range (typically the sensing 
range of a sensor node is at most half its transmission 
range [36]), aggregate and forward only a representative 
version of the data (like the average temperature in a 
region) to the sink through a network-wide 
communication topology (like a data gathering tree) 
spanning all the sensor nodes. The size of the aggregated 
data is assumed to be the same as the size of the data 
collected at the individual sensor nodes. The root node of 
a DG tree is called the LEADER node and is chosen by 
the sink at the time of tree construction. 
We assume the sensor nodes to be both TDMA 
(Time Division Multiple Access) and CDMA (Code 
Division Multiple Access)-enabled [35]. An upstream 
node communicates with its own immediate downstream 
child nodes using a TDMA schedule (one time slot per 
downstream node); such communication between every 
upstream node with their own downstream nodes can 
occur in parallel (using unique CDMA codes). The above 
assumptions and definitions hold good for both the BPI 
and LET-based MaxBLW-DG trees studied in this paper. 
When used for constructing the LET-based DG trees, 
we assume a sensor node to be aware of its current 
location, velocity and direction of movement at any time 
instant and mentions the same in a location update vector 
(LUV) [19] included in the control messages broadcast 
as part of tree discovery. Such an assumption is not 
required for the MaxBLW-DG algorithm that makes use 
of BPI (discussed in Section 4) as the BPI scores could 
be computed without a priori knowledge about the 
location and mobility of the nodes. Each node maintains 
a Link Weight Table comprising of the estimates of the 
bottleneck link weights to the neighbor nodes that sent it 
the TREE-CONSTRUCT message (see Section 2.3 for 
more details about the message). 
2.2 Initialization of state information at 
the sensor nodes 
Each sensor node locally maintains state information 
about the data gathering tree that is currently being used 
or newly configured. The state information comprises of 
the following fields (with their initial values indicated in 
parenthesis): estimated bottleneck link weight (-∞), 
upstream node id (NULL), tree level (0), LEADER node 
id and sequence number (the latest sequence number in 
the TREE-INITIATE message broadcast by the sink). 
The estimated bottleneck link weight is the value for the 
currently known maximum weight for a link on the path 
to the LEADER node of the DG tree. The upstream node 
id corresponds to the neighbor node that lies on the 
currently estimated maximum bottleneck link weight 
path to the LEADER  node. The tree level corresponds to 
the number of hops on the maximum bottleneck link 
weight path to the LEADER node. The sequence number 
corresponds to the latest sequence number for a TREE-
CONSTRUCT message received by the node.  
2.3 Initiation of the Tree-construct 
message 
Whenever the sink fails to receive the aggregated data 
from the LEADER node of the DG tree used in the 
previous round of data aggregation, the sink queries all 
the sensor nodes to send it their estimates of the weight 
of the links from their neighbor nodes. The sink 
calculates the estimated weight of a node as the sum of 
the estimated weights of the directed edges originating 
from the node (as reported by its neighbors); the sink 
selects the node with the largest estimated weight to be 
the LEADER node (root node) of the new DG tree that is 
to be setup and sends a TREE-INITIATE message 
(including a sequence number) to the chosen root node to 
begin the construction of the new DG tree. The sequence 
number for the tree construction process is a 
monotonically increasing value maintained at the sink 
and the sink sends the latest value of the sequence 
number to the LEADER node to facilitate the sensor 
nodes to uniquely identify the control messages that are 
exchanged with regards to the new DG tree being 
constructed. 
The LEADER node broadcasts a TREE-
CONSTRUCT message to its neighbors; the message has 
a 5-element tuple: <sequence number, LEADER node id, 
upstream node id, sender's estimated bottleneck link 
weight, tree level>. The sequence number is the one that 
is sent by the sink to the LEADER node. For the TREE-
CONSTRUCT message broadcast by the LEADER node, 
the values for the upstream node id, sender's estimated 
bottleneck link weight and tree level are respectively the 
LEADER node id, +∞ and 0. For the TREE-CONSTRCT 
message broadcast by the other nodes: the upstream node 
262 Informatica 41 (2017) 259–274 N. Meghanathan  
 
id is the id of the node that the sender of the message 
considers to be the best node that would connect it to the 
LEADER node through a path estimated to be the one 
with the maximum bottleneck link weight (the value of 
which is also indicated in the message). The tree level 
field indicates the number of hops on the estimated 
maximum bottleneck link weight path from the sender 
node to the LEADER node of the DG tree. 
2.4 Propagation of the Tree-construct 
message 
When a node receives the TREE-CONSTRCT message 
(from a neighbor node) with a higher sequence number, 
it assumes that the DG tree that had been used until then 
no longer exists and resets its state information to the 
values listed in Section 2.2. The receiving node (say, 
node v) decides to further process the TREE-
CONSTRUCT message from a neighbor node (say, node 
u) if all the following conditions are met: (i) The 
upstream node id in the message is different from the id 
of the receiving node itself. (ii) The tree level value in 
the message is less than or equal to the tree level value 
maintained as part of the state information at the 
receiving node. (3) The value for the sender's estimated 
bottleneck link weight in the message is larger than the 
value for the receiver's estimated bottleneck link weight. 
(4) The weight of the directed edge from the sender node 
to the receiver node is larger than the latter's estimated 
bottleneck link weight for the path to the LEADER node. 
If all the above four conditions are met, the receiver node 
(node v) makes the following updates to its state 
information: (i) The receiver node updates its estimated 
bottleneck link weight for the path to the LEADER node 
to the minimum of the sender's (node u's) estimated 
bottleneck link weight value in the TREE-CONSTRUCT 
message and the weight of the directed edge u → v. (ii) 
The receiver node updates its upstream node id to that of 
the sender's node id. (iii) The value for the tree level is 
set to one more than the value for the tree level in the 
TREE-CONSTRUCT message. After making the above 
updates, the receiver node also rebroadcasts the TREE-
CONSTRUCT message in its neighborhood by changing 
the values for the upstream node id, sender's estimated 
bottleneck link weight and tree level fields in the message 
to the most recently updated values for these fields in its 
state information.  
Overall, a node receiving the TREE-CONSTRUCT 
message decides to further rebroadcast the message only 
if it can increase (through the sender node that sent it the 
message) its estimate for the bottleneck link weight path 
to the LEADER node (thus minimizing unnecessary 
retransmissions). Each node (other than the LEADER 
node) will be able to do so at least once because its initial 
value for the estimated bottleneck link weight is -∞ and 
all edge weights are positive as well as the value for the 
sender's estimated bottleneck link weight in the TREE-
CONSTRUCT message broadcast by the LEADER node 
is +∞. At the end of the tree construction process, each 
node (other than the LEADER node) would have joined 
the DG tree through an upstream node that is on the 
maximum bottleneck link weight path to the LEADER 
node. 
2.5 Propagation of the Tree-link-failure 
message 
Whenever an upstream node fails to receive an 
aggregated data packet from one of its downstream child 
nodes, the upstream node decides that the link to the 
child node has broken and initiates a TREE-LINK-
FAILURE message (included with a sequence number 
corresponding to the value sent by the LEADER node in 
the TREE-CONSTRUCT message) with the number of 
hops the message can get propagated equal to the tree 
level value for the initiating upstream node. The TREE-
LINK-FAILURE message is essentially reverse 
broadcast higher up the currently used DG tree so that 
the LEADER node can receive the failure message and 
initiate the construction of a new DG tree. Nodes that lie 
downstream of the failed link get to learn about the tree 
failure when a TREE-CONSTRUCT message with a 
higher sequence number (larger than the current value for 
the sequence number known) is received. 
3 Related work  
In this section, we first discuss related work data 
gathering in mobile sensor networks and then focus our 
discussion specifically on related work on determining 
stable data gathering trees in mobile sensor networks. 
3.1 Related work on data gathering in 
mobile sensor networks 
To the best of our knowledge, other than the work 
presented in Section 3.2 and the related works discussed 
below, the existing works (e.g., [10, 14, 32, 33]) in the 
literature on mobile sensor networks take the following 
hybrid approach: The regular data sensing nodes are 
considered static and there exists one or more mobile 
data collecting nodes that move around the static sensor 
nodes; a data gathering topology involving the data 
collecting nodes is constructed and maintained, if 
needed. Since all the sensor nodes are not considered 
mobile (the type of mobile sensor networks considered in 
our research) and the data gathering topology constructed 
is not network-wide (i.e., spanning all the sensor nodes), 
we do not delve further on related works based on the 
above approach.  
Among the very few network-wide spanning 
topology-based data gathering algorithms available in the 
literature for mobile sensor networks, most of the work 
focused on extending the classical LEACH (Low Energy 
Adaptive Clustering Hierarchy) [7] algorithm for static 
sensor networks to adapt to mobile environments. 
Variants of LEACH that have been proposed for MSNs 
focus on choosing the cluster heads by taking into 
account the residual energy available at the sensor nodes 
[2], mobility of the sensor nodes [29], stability of the 
links incident on a node [5] or proximity of the sensor 
nodes to certain landmarks [12]. Another work [30] 
related to cluster head selection proposed to set up a 
Bipartivity Index based Link Selection… Informatica 41 (2017)259–274 263 
panel of cluster heads (some of which serve as backup) 
to facilitate cluster reconfiguration due to node mobility.  
In [34], the authors proposed a cluster independent 
data collection tree (CIDT) protocol for mobile sensor 
networks that first partitions the entire network into 
clusters with a cluster head plus member nodes for each 
cluster and then chooses certain sensor nodes as data 
collection nodes (DCNs) that have better connection with 
the cluster heads. A data gathering tree of the DCNs is 
constructed and reconfigured over time when broken due 
to node mobility. The DCNs are selected in such a way 
that the links to the cluster heads and the links to the 
adjacent DCNs in the data gathering tree are stable. We 
opine that the CIDT protocol would incur a lot of control 
overhead (with respect to bandwidth and energy 
consumption) as two topologies (a cluster topology 
comprising of cluster heads plus their links to the 
member nodes and a tree topology of DCNs) have to be 
maintained in the network at any time. Though the two 
topologies have been formulated to be independent of 
each other, (due to node mobility) the identification of 
cluster heads and the DCNs has to be often initiated to 
maintain connectivity of the cluster heads to one or more 
near by DCNs.  
In [15], the authors propose a directed acyclic graph 
(DAG)-based topology for determining data gathering 
trees in mobile sensor networks. Whenever a data 
gathering tree is required, the sink constructs a DAG of 
the underlying network and runs a maximum bottleneck 
node weight-based data gathering (MaxBNW-DG) 
algorithm on the DAG. In this pursuit, the sink initiates 
data collection from all the nodes in the network on one 
or more multi-hop paths; the paths traversed by the data 
in cycle-free manner constitute a DAG of the network. 
The weight of a sensor node is determined based on the 
theory of thermal fields applied on the utility of the data 
sensed by the node as well as that of its neighbors. The 
sink then initiates a distributed version of the MaxBNW-
DG algorithm on the DAG such that each sensor node is 
located on a maximum bottleneck node weight path to 
the sink node. The bottleneck node weight for a path in 
[15] is calculated as the minimum of the weights of the 
intermediate node on the path; ties are broken in favor of 
paths of lower hop count. Due to node mobility, there 
may not be paths from one or more nodes to the sink 
node on the DAG. Similar to [34], we opine that a 
significant control overhead (in a mobile sensor network) 
would be encountered to first construct a DAG and then 
run a distributed version of the MaxBNW-DG algorithm 
on the DAG.  
3.2 Related work on stable data gathering 
trees in mobile sensor networks 
In [21], the authors had proposed a benchmarking 
algorithm to determine a sequence of stable data 
gathering trees that would exist for the longest time such 
that the number of tree transitions is the bare minimum. 
When a DG tree is required at a time instant t, the idea is 
to determine an intersection of the network graphs 
existing at time instants t, t+1, t+2, ... t+k such that the 
intersection graph is connected from time instants t ... t+k 
and not connected from time instants t ... t+k+1. That is, 
the inclusion of the graph at time instant t+k+1 to the 
intersection graph of time instants t ... t+k would 
disconnect the intersection graph from time instants t ... 
t+k+1. However, the algorithm is centralized in nature 
and would require the topology changes to be known a 
priori from the beginning to the end of the simulation 
session. On the other hand, the focus of research in this 
paper is to employ a distributed algorithm for 
determining stable data gathering trees using the BPI 
approach from complex network analysis - this approach 
does not require any a priori knowledge about the 
network topology changes as well as about the location 
and mobility of the nodes; we would just need the one-
hop neighborhood information at every node. 
In [19], the authors proposed distributed algorithms 
to determine the predicted link expiration time (LET)-
based data gathering trees for longer tree lifetime and the 
minimum distance spanning tree (MST)-based data 
gathering trees for longer node lifetime (time of first 
node failure due to exhaustion of energy) and longer 
network lifetime (time at which the network gets 
disconnected due to the failure of one or more nodes). 
The predicted link expiration time (LET) of a link i – j 
between two nodes i and j, currently at (Xi, Yi) and (Xj, 
Yj), and moving with velocities vi and vj in directions θi 
and θj (with respect to the positive X-axis) is computed 
using the formula proposed in [31]: 
                      
22
2222 )()()(
),(
ca
bcadRcacdab
jiLET


  (1) 
 
where a = vi*cosθi – vj*cosθj; b = Xi – Xj;  
 c = vi*sinθi – vj*sinθj; d = Yi – Yj 
  
The MST-based DG trees aim to minimize the 
largest Euclidean distance between the end nodes of a 
link in the DG tree, but are not as stable as the LET-DG 
trees [19]. Due to repeated tree reconfigurations, the gain 
obtained in the node lifetime (85-150% more than that of 
the LET-DG trees) does not equally get transferred to the 
gain obtained in network lifetime (only 15-130% more 
than that of the LET-DG trees) [19]. The LET-DG trees 
fit within the criteria of finding maximum bottleneck link 
weight-based DG trees (i.e., the objective is to maximize 
the minimum LET for a link on the path from any node 
to the LEADER node); whereas, the MST-DG trees fit 
within the criteria of finding minimum bottleneck link 
weight-based DG trees (i.e., the objective is to minimize 
the maximum value for the distance between the end 
nodes of the link on the path from any node to the 
LEADER node). In this paper, we model the short 
distance links as links with larger BPI/link stability score 
(measure of link weight; for further details, see Section 
4) and run the maximum bottleneck link weight-based 
algorithm to maximize the minimum link weight on the 
path from any node to the LEADER node of the DG tree; 
we show that by doing so, we can simultaneously incur a 
larger tree lifetime as well as a lower energy 
264 Informatica 41 (2017) 259–274 N. Meghanathan  
 
consumption per round (see Section 5 for the simulation 
results).   
In [24], the authors proposed a generic algorithm to 
determine maximum bottleneck node weight 
(MaxBNW)-based data gathering trees wherein the root 
node is the node with the largest weight. In [24], the 
weight of a node has been modeled as the sum of the 
weights of the links incident on it. The bottleneck node 
weight for a path from a node to the root node of the DG 
tree is the minimum of the weights of the nodes on the 
path. The MaxBNW-DG algorithm aims to determine a 
DG tree in which the path from any node to the root node 
is the path with the maximum bottleneck node weight. In 
[24], it has been observed that the MaxBNW-DG trees 
have different characteristics compared to the MaxBLW-
DG trees. The focus of research in this paper is to 
determine MaxBLW-DG trees by modeling the link 
weight as a measure of the stability of the link using the 
algebraic connectivity approach from complex network 
analysis that does not need the location and mobility 
information of the nodes. 
4 Bipartivity index (BPI)-based link 
selection strategy 
In this section, we describe the Bipartivity Index (BPI) 
[6]-based link selection strategy adapted from complex 
network analysis to quantify the stability of links (i.e., 
the link weights) in a mobile sensor network. We 
compute the link stability score (LSS) for an edge by 
analyzing the bipartivity of the "egocentric network for 
the edge" that is adapted from the notion of egocentric 
network of a node [13]. The egocentric network of a 
node [13] in a graph is a sub graph comprising of: 
vertices - the nodes and its neighbors and edges - the 
links involving the node and/or its neighbors. We define 
the egocentric network of an edge in a graph to be a sub 
graph comprising of: vertices - the end nodes of the edge 
and their neighbors and edges - the links incident on the 
end nodes of the edge.  
Our hypothesis for this research is based on the 
observation that links (we refer to as short distance links) 
whose end nodes are close enough to each other (vis-a-
vis the transmission range per node) are more likely to be 
stable (and vice-versa) compared to links for which the 
distance between the end nodes is closer to the 
transmission range per node. We define the fraction of 
link distance (fld) for an edge as the ratio of the 
Euclidean distance between the end nodes of the edge 
and the transmission range per node. For a short distance 
link, fld is expected to be appreciably less than 1. Our 
hypothesis is that the end nodes of a short distance link 
are more likely to share a significant fraction of their 
neighbors (and vice-versa) and we could compute the 
BPI for the egocentric network of the link to quantify the 
extent of this shared neighborhood that can be in turn 
used as the link stability score (LSS). Note that the 
egocentric network of an edge could be independently 
(and identically) constructed by each of the two end 
nodes of the edge based on the one-hop neighborhood 
information received from the other node (as part of 
periodic beacon exchange). 
A graph is said to be truly bipartite [4, 6] if we could 
partition the vertices of the graph into two disjoint sets 
such that all the edges in the graph are those that connect 
the vertices in one partition to vertices in the other 
partition and that there are no edges (called frustrated 
edges [6]) between vertices within the same partition. 
However, all network graphs cannot be expected to be 
truly bipartite. Hence, Estrada and Rodriguez-Velazquez 
[6] proposed the notion of bipartivity index (BPI) to 
measure the extent of bipartivity in a graph. The 
bipartivity index of a graph ranges from 0 to 1. If a graph 
is truly bipartite, then the bipartivity index is 1 and there 
are no frustrated edges between vertices within the same 
partition [6]. If a graph is not truly bipartite, then the 
bipartivity index will be less than 1. Estrada and 
Rodriguez-Velazquez [6] proposed a mechanism that 
will allow us to identify a partitioning of the vertices into 
two disjoint partitions as well as identify the frustrated 
edges (if the graph is not truly bipartite) involving 
vertices within the same partition. The mechanism 
proposed by Estrada Rodriguez-Velazquez [6] is to 
determine the eigenvalues of the adjacency matrix of the 
graph (to determine the bipartivity index, as shown in 
formulation 2) and use the signs (positive or negative) of 
the entries in the eigenvector corresponding to the 
smallest eigenvalue of the adjacency matrix to determine 
the partitioning of the vertices. If λ1, λ2, λ3, ..., λn are the 
eigenvalues of the adjacency matrix of a graph G of n 
vertices, then the bipartivity index (BPI) of G is given by 
the formulation below [6]:   
 
 

 



n
j
n
j
jj
n
j
j
GBPI
1 1
1
)sinh()cosh(
)cosh(
)(


                 (2) 
 
To measure the extent of shared neighborhood of the 
end vertices of an edge in a graph, we propose to 
compute the bipartivity index on the egocentric network 
of the edge and use the complement of the bipartivity 
index (1 - BPI) as the link stability score (LSS) for the 
edge. That is: LSS(u-v) = 1 - BPI(EGu-v), where EGu-v is 
the egocentric network graph of the edge u-v. We justify 
the above proposal as follows (also illustrated in Figures 
1-3: for an edge u-v where u1-u4 are four neighbors, 
other than vertex v, for a vertex u; and v1-v4 are four 
neighbors, other than vertex u, for a vertex v):  
If the end vertices of an edge u-v do not have any 
shared neighbors (i.e., u1-u4 and v1-v4 are all distinct 
vertices: as shown in Figure 1), then we could partition 
the vertices in the egocentric network of the edge to two 
disjoint partitions such that vertex u and the neighbors of 
vertex v are in one partition (referred to as partition-u) 
and vertex v and the neighbors of vertex u are in the 
other partition (referred to as partition-v). The egocentric 
network of the edge u-v with no common neighbors for 
the end vertices would be a truly bipartite graph (as in 
Bipartivity Index based Link Selection… Informatica 41 (2017)259–274 265 
Figure 1) as the only edges in the graph would be edges 
connecting vertices in partition-u to vertices in partition-
v. The bipartivity index of such an egocentric network 
graph would be 1.0 and as per our hypothesis, the link 
stability score for the edge would be 0.0. 
 
 
Figure 1: Example for a Truly Bipartite Egocentric 
Network of an Edge u-v. 
 
   
   2-a: BPI(EGu-v) = 0.75           2-b: BPI(EGu-v) = 0.85 
       LSS(u-v) = 0.25                     LSS(u-v) = 0.15 
 
 
2-c: BPI(EGu-v) = 0.92 
LSS(u-v) = 0.08 
Figure 2: Bipartivity Index of the Egocentric Network 
Graph of an Edge and its Link Stability Score (Varying 
the Number of Shared Neighbors for a Fixed Number of 
Edges in the Egocentric Network). 
    
   3-a: BPI(EGu-v) = 0.83           2-b: BPI(EGu-v) = 0.85 
       LSS(u-v) = 0.17                     LSS(u-v) = 0.15 
 
 
3-c: BPI(EGu-v) = 0.87 
LSS(u-v) = 0.13 
Figure 3: Bipartivity Index of the Egocentric Network 
Graph of an Edge and its Link Stability Score (Varying 
the Number of Edges in the Egocentric Network for a 
Fixed Number of Shared Neighbors). 
 
On the other hand, if the end vertices of an edge u-v 
share one or more of their neighbors: then, the 
eigenvector-based decomposition for bipartivity [6] 
applied on the egocentric network of the edge would 
group the two end vertices u and v together in one 
partition (referred to as partition u-v) and the neighbors 
of u and v together in the other partition (referred to as 
partition: neighbors of u and v). The BPI of such 
egocentric network graphs would be less than 1 (due to 
the presence of the frustrated edge u-v in the same 
partition) and the actual magnitude of the BPI would 
depend on the actual number of neighbors for the two 
end vertices (i.e., on the number of vertices in the other 
266 Informatica 41 (2017) 259–274 N. Meghanathan  
 
 
 
Figure 4: Illustration of the Eigenvector-based Partitioning of the Egocentric Networks of the Edges and the 
Bipartivity Index and Link Stability Scores of the Edges in an Example Graph. 
 
partition: neighbors of u and v) as well as on the 
number of shared neighbors (i.e., on the number of edges 
connecting the vertices in partition u-v to the vertices in 
the partition: neighbors of u and v). For a given 
egocentric network graph EGu-v with a certain number of 
edges and is not truly bipartite (as shown in Figures 2-a, 
2-b and 2-c): the larger the number of shared neighbors 
(i.e., fewer the number of vertices in the partition: 
neighbors of u and v), the lower the BPI (and larger will 
be the LSS score for the edge u-v). Likewise, for a given 
egocentric network graph EGu-v with a certain number of 
shared neighbors and is not truly bipartite (as shown in 
Figures 3-a, 3-b and 3-c): the larger the number of 
vertices in the partition: neighbors of u and v (i.e., the 
larger the number of edges connecting the vertices in 
partition u-v to the vertices in the partition: neighbors of 
u and v), the larger the BPI (and lower will be the LSS 
score for the edge u-v). 
In Figure 4, we illustrate the computation of the 
bipartivity index of the egocentric networks of the edges 
(with coordinates of the vertices as indicated in a grid) in 
an example graph. The egocentric networks for none of 
the edges have been observed to be truly bipartite. We 
illustrate the eigenvector-based partitioning of the 
egocentric networks of three edges: 6-7, 3-4 and 1-5 that 
have different values for the fraction of link distance 
(fld). We observe the bipartivity-based LSS values (1 - 
bipartivity index) for these three edges to increase with 
decrease in the fraction of link distance (fld) values. 
Overall, we see the expected trend between fld and the 
bipartivity-based LSS values for the edges: the LSS 
values are more likely to be higher for edges with lower 
fld values and vice-versa. 
5 Simulations 
In this section, we first present the simulation 
environment and the notion of normalized 
comprehensive relative performance (NCRP) score to 
identify the link selection strategy that effectively 
balances the tradeoffs with respect to the performance 
metrics, and then discuss in detail the results of the 
Bipartivity Index based Link Selection… Informatica 41 (2017)259–274 267 
simulations obtained by running the distributed version 
of the MaxBLW-DG algorithm incorporated with BPI as 
well as the LET-based link selection strategies. The 
simulations were conducted in a discrete-event simulator 
implemented in Java for mobile sensor networks. The 
simulator was earlier successfully used for other related 
studies (e.g., [19][21][23]) for mobile sensor networks. 
The medium access control (MAC) layer is assumed to 
be ideal to extract the best possible performance from the 
data gathering algorithm and the link selection strategies.  
5.1 Simulation environment 
In this sub section, we present the simulation parameters 
(network density, maximum node velocity, data size and 
the number of rounds as well as the frequency of LSS 
updates), the mobility model, the energy consumption 
model, DG tree update policy and channel access policies 
as well as define the structural metrics and performance 
metrics.  
Simulation Parameters: The network dimensions is 
100m x 100m (Area A = 10,000 m2) and the sink is 
assumed to be outside the network: at (50, 300). The 
number of nodes (N) in the network is set to be 50 and 
100, and the transmission range (R) per node values used 
are 25m and 35m. The average number of neighbors per 
node is computed using the formula: πR2N/A. 
Accordingly, we have the following scenarios of network 
density: low density (N = 50, R = 25m, Avg. # neighbors 
per node = 9.8), low-moderate density (N = 50, R = 35m, 
Avg. # neighbors per node = 19.2), moderate-high 
density (N = 100, R = 25m, Avg. # neighbors per node = 
19.6) and high density (N = 100, R = 35m, Avg. # 
neighbors per node = 38.5). The maximum velocity of a 
node (vmax) is set to be: 1 m/s (low mobility), 3 m/s 
(low-moderate mobility), 5 m/s (moderate-high mobility) 
and 10 m/s (high mobility). Thus, we have a total of 
sixteen scenarios of various combinations of network 
density and node mobility. We generated 100 instances 
of node mobility profiles for each of the above sixteen 
scenarios of network density and node mobility and 
averaged the results (with respect to the performance 
metrics and structural metrics discussed below) obtained 
for the MaxBLW-DG algorithm incorporated with the 
BPI and LET-based link selection strategies run on these 
100 instances. 
Mobility Model: To start with, the nodes are 
uniform-randomly distributed throughout the network. 
Mobility of the nodes is modeled according to the 
Random Waypoint model [3] with the nodes moving 
continuously (zero pause time) and independent of each 
other. A node decides to move from its current location 
to a randomly chosen location within the network with a 
velocity uniform-randomly chosen from [0...vmax]; after 
reaching the chosen location, the node continues its 
movement by randomly choosing another location with a 
different randomly chosen velocity from the above range. 
A node continues its movement like this throughout the 
simulation. We record the instances of direction change 
and the corresponding location and velocity to construct 
(offline) a mobility profile for each node and feed in this 
mobility profile to the MaxBLW-DG tree algorithm. 
Energy Consumption Model: Nodes are assumed to 
be of sufficient energy so that there are no node failures 
due to exhaustion of energy. The energy consumed at a 
node for data aggregation is the sum of the energy lost in 
receiving the aggregated data from each of its child 
nodes, fusing its own data with that of the aggregated 
data and transmitting the final aggregated data to its 
upstream node in the DG tree. The energy consumed at a 
node for broadcast tree discovery is the sum of the 
energy lost to receive the broadcast control message from 
each of its neighbors and to the transmit the control 
message in its neighborhood, if the conditions for 
rebroadcast are met. The energy consumption model 
used is a first-order radio model [28] that has been used 
in several of the previous work [7, 11] in the literature. 
According to this model: (i) the energy consumed at a 
sensor node to transmit a k-bit message over a distance d 
is given by: ETX(k, d) = Eelec*k + amp

*k*d2, where 
Eelec = 50 nJ/bit is the energy lost to run the radio 
transmitter or receiver circuitry and amp

= 100 
pJ/bit/m2 is the energy lost to run the transmitter 
amplifier; (ii) the energy lost at a sensor node to 
broadcast a k-bit message to all its neighbors within the 
transmission range R is simply given by ETX(k, R); the 
energy consumed at a sensor node to receive a k-bit 
message is ERX(k) = Eelec *k. The total energy 
consumed at a sensor node to receive k-bit broadcast 
messages transmitted by all of its n-neighbors is simply 
given by n * ERX(k). We do not take into consideration 
the energy lost due to periodic beacon exchange as both 
the LET and BPI-based link selection strategies 
considered in this research use it to determine the link 
weights. 
Data Size and Frequency of LSS Updates: We 
conduct the simulations for 2000 rounds (one round for 
every 0.25 seconds: a total of 500 seconds). The LSS 
scores of the links are estimated in the neighborhood of 
the nodes for every second. For each round: data gets 
aggregated across the network, starting from the leaf 
nodes and proceeding all the way to the LEADER node 
of the DG tree; the LEADER node forwards the final 
aggregated data to the sink. The data size is assumed to 
remain the same during network-wide aggregation. That 
is, the size of the aggregated data is assumed to be the 
same as the size of the data collected at the individual 
sensor nodes. The data size is 2000 bits and the size of 
the control messages used for tree configuration and 
maintenance is assumed to be 400 bits (sufficiently large 
enough to accommodate the various fields in the control 
messages).  
DG Tree Update Policy: Every time a DG tree is 
needed, the sink collects the weights of the links of the 
sensor nodes using a network-wide broadcast. The node 
with the largest sum of the link weights is considered as 
the root node (a.k.a. LEADER node) and the sink node 
sends a control message to the LEADER node to initiate 
tree discovery (a process also called tree 
268 Informatica 41 (2017) 259–274 N. Meghanathan  
 
reconfiguration). A DG tree is used as long as it exists: 
this is referred to as the Least Overhead Routing 
Approach (LORA) [1] in the literature of mobile ad hoc 
networks. 
Channel Access Policy: Note that in a particular 
timeslot, an intermediate node could collect data from 
only one of its child nodes (using Time Division Multiple 
Access, TDMA [35]) if the latter has its aggregated data 
available, and an intermediate node could transmit 
upstream its aggregated data only after receiving the 
same from each of its child nodes and aggregating with 
its own. An intermediate node could collect data from 
one of its child nodes at the same time (using Code 
Division Multiple Access, CDMA [35]) as any other 
intermediate node collects data from any of its child 
nodes. We assume that sufficient number of CDMA and 
TDMA codes are available at the sensor nodes (as 
needed) to facilitate data aggregation in the minimum 
number of time slots. 
Structural Metrics: We evaluated the following three 
structural metrics: (S-i) Tree Height, TH: The tree height 
is the maximum of the level numbers of the vertices (i.e., 
the number of hops) from the root node of the DG tree 
(with the root node considered to be at level 0). (S-ii) 
Fraction of Leaf Nodes, FLN: The fraction of leaf nodes 
is the ratio of the number of leaf nodes to the total 
number of nodes in the network graph. (S-iii) Average 
Number of Child Nodes per Intermediate Node, CNI: 
The average number of child nodes per intermediate 
node is the weighted average of the number of child 
nodes per intermediate node considered across all 
intermediate nodes.  
Performance Metrics: We evaluated the following 
three performance metrics: (P-i) Tree Lifetime, TL: The 
tree lifetime is the number of rounds a DG tree exists 
before one or more of its links fail due to node mobility, 
averaged over the duration of a simulation session. (P-ii) 
Aggregation Delay per Round, ADR: The aggregation 
delay per round is the minimum number of timeslots 
(computed as per algorithm [25]) it takes for data to get 
aggregated along the edges of the DG tree and reach the 
root node, averaged across all the rounds. (P-iii) Energy 
Consumption per Round, ECR: The energy consumed 
per round is the sum of the energy consumed at each of 
the nodes for data aggregation in the network plus the 
energy lost due to broadcast tree discoveries if the DG 
tree was reconfigured at the beginning of the round. We 
average the energy consumed across all the rounds of a 
simulation session.  
5.2 Normalized comprehensive relative 
performance (NCRP) score 
As described in Section 5.4, we observe a complex 
tradeoff between the three performance metrics: tree 
lifetime, energy consumption per round and aggregation 
delay per round. Since the performance metrics incur 
different levels of magnitude, we propose to bring the 
values incurred for these metrics on a common scale of 0 
to 1 using the method of normalization and propose to 
prefer the link selection strategy that incurs the largest 
value for the normalized score (or the complement of the 
normalized score, as appropriate) with respect to the 
individual metrics and/or with respect to the normalized 
comprehensive relative performance (NCRP) score 
(introduced below). In other words, the idea is to 
normalize the values incurred for each of the 
performance metrics incurred for the BPI and LET link 
selection strategies for a particular simulation scenario 
and compute a normalized comprehensive relative 
performance (NCRP) score with respect to the 
performance metrics (as shown below).  
As we seek for a larger tree lifetime (see Section 
5.4), lower energy consumption per round (see Section 
5.5) and lower aggregation delay per round (see Section 
5.6), we use the normalized values for the tree lifetime 
(TL), but complement of the normalized values for the 
energy consumption per round (ECR) and aggregation 
delay per round (ADR) to compute the NCRP score as a 
weighted average (weight = 1/3 for each metric) of these 
three values.  
Complement of Norm. ECR = 1 - Normalized ECR    
Complement of Norm. ADR = 1 - Normalized ADR 
NCRP = {Normalized TL + Complement of Norm. 
ECR + Complement of Norm. ADR}/3   ...........(3)          
5.3 Structural metrics 
In this section, we illustrate the results obtained with 
respect to the structural metrics for the DG trees 
determined based on the BPI and LET strategies. For 
lower energy consumption and lower aggregation delay 
per round, we would desire to have DG trees with a 
lower number of child nodes per intermediate node (so 
that an intermediate node can spend less energy in 
receiving data from each of its child nodes as well as 
aggregate data from its child nodes in fewer time slots) 
and at the same time a larger fraction of leaf nodes (so 
that the energy lost due to receptions could be lower and 
the number of nodes that readily have the data to transmit 
could be larger). However, we observe that it would not 
be possible to simultaneously maximize the fraction of 
leaf nodes as well as minimize the number of child nodes 
per intermediate node. As the fraction of leaf nodes in a 
DG tree increases, the fraction of intermediate nodes in 
the DG tree is bound to decrease and hence the number 
of child nodes per intermediate node is bound to only 
increase. We also observe a similar trend in the results 
(see Figures 5-6) for the structural metrics obtained for 
the DG trees based the LET and BPI strategies.  
The LET-based DG trees incur a larger fraction of 
leaf nodes (desirable for lower energy consumption and 
lower aggregation delay per round) and lower height 
(desirable for lower aggregation delay per round), but 
also simultaneously incur a larger number of child nodes 
per intermediate node. The BPI-based DG trees incur a 
relatively lower number of child nodes per intermediate 
node (desirable for lower energy consumption and lower 
aggregation delay per round), but also incur a lower 
fraction of leaf nodes. Thus, as envisioned previously, we 
observe a tradeoff between the fraction of leaf nodes and 
the number of child nodes per intermediate node.  
Bipartivity Index based Link Selection… Informatica 41 (2017)259–274 269 
 
            
(a) vmax = 1 m/s                        (b) vmax = 3 m/s                       (c) vmax = 5 m/s                       (d) vmax = 10 m/s 
Figure 5: Average Fraction of Leaf Nodes. 
 
            
(a) vmax = 1 m/s                        (b) vmax = 3 m/s                       (c) vmax = 5 m/s                       (d) vmax = 10 m/s 
Figure 6: Average Number of Child Nodes per Intermediate Node. 
 
            
(a) vmax = 1 m/s                        (b) vmax = 3 m/s                       (c) vmax = 5 m/s                       (d) vmax = 10 m/s 
Figure 7: Average Tree Height. 
 
Since the structural metrics are not dependent on 
node mobility, for a given network density: we observe 
the values incurred for the structural metrics to be 
independent of node mobility. For a given level of node 
mobility, we observe the fraction of leaf nodes as well as 
the average number of child nodes per intermediate node 
(incurred for both the LET and BPI-based DG trees) to 
decrease with increase in network density. On the other 
hand, for a given level of node mobility, we observe the 
tree height to increase with increase in network density 
(especially, as we increase from 50 to 100 nodes).  
5.4 Tree lifetime 
The BPI-DG trees incur significantly larger values for 
the tree lifetime (see Figure 8) compared to that of the 
LET-DG trees. The lifetime of the BPI-DG trees could 
be as large as 12 times the lifetime of the LET-DG trees 
(especially in scenarios of high network density and low 
node mobility). Even in the worst case (scenarios of low 
network density and high node mobility), the lifetime of 
the BPI-DG trees is at least 60% larger than the lifetime 
of the LET-DG trees. When considered across all the 16 
scenarios of network density and node mobility (refer 
Figure 9), the normalized values (with respect to tree 
lifetime) for the BPI-DG trees is at least 0.85; whereas, 
the normalized values for the LET-DG trees is at most 
0.52.  
 
 
     
(a) vmax = 1 m/s                        (b) vmax = 3 m/s 
 
     
(c) vmax = 5 m/s                        (d) vmax = 10 m/s 
Figure 8: Absolute Value of Average Tree Lifetime. 
 
From Figure 8, we could observe that for a fixed 
level of node mobility: the average lifetimes for the BPI-
DG trees relatively increase with increase in network 
density, whereas the average lifetimes for the LET-DG 
trees relatively decrease with increase in network 
density. From Figure 9, for a given level of network 
density: we could observe that the relative performance 
270 Informatica 41 (2017) 259–274 N. Meghanathan  
 
of the LET-DG trees with respect to tree lifetime 
improves with increase in node mobility. On the other 
hand, the relative performance of the BPI-DG trees with 
respect to tree lifetime remains almost the same or only 
marginally degrades with increase in node mobility. 
Thus, with respect to tree lifetime, the BPI-DG trees are 
relatively more scalable (i.e., are robust to increase in 
network density for a given level of node mobility) and 
remains relatively about the same (with increase in node 
mobility for a given network density). Such observations 
on the relative performance of the link selection 
strategies cannot be easily assessed by simply looking at 
the actual values incurred for tree lifetime in Figure 8 (or 
for that matter any other performance metric). 
 
     
(a) LET                                  (b) BPI 
Figure 9: Normalized Value of Average Tree Lifetime. 
5.5 Aggregation delay per round 
From Figures 10-11, we observe the LET-DG trees to 
incur lower ADR values for all conditions of network 
density and node mobility. For a given level of node 
mobility, the difference in the magnitude of the ADR 
values between the BPI-DG trees and the LET-DG trees 
increases with increase in network density. For a given 
network density, the ADR values incurred for the DG 
trees based on a particular link selection strategy remain 
about the same (there is no particular or a significant 
trend of variation) at different levels of node mobility. 
As we prefer a link selection strategy to yield lower 
aggregation delay per round for the DG trees, we plot the 
complement of the normalized ADR values (instead of 
just the normalized ADR values) of Figure 10 in Figure 
11. The ADR values (shown in Figure 10) incurred for 
both the LET-DG and BPI-DG trees appear to be directly 
proportional and positively correlated with the height of 
the DG trees (shown in Figure 7). Though the absolute 
ADR values (Figure 10) are observed to increase with 
increase in network density for a given level of node 
mobility, there is no change in the trend of the 
normalized ADR values (a measure of the relative 
performance) incurred for the DG trees based on both 
LET and BPI (for a given level of node mobility, the 
complement of the normalized ADR values for either 
LET or BPI almost remains the same with increase in 
network density).  
Unlike the exceptionally high values for the tree 
lifetime incurred with the BPI-DG trees and relatively 
very poor tree lifetime (see Figures 8-9) observed for the 
LET-DG trees, (the complement of) the normalized ADR 
values incurred for the DG trees determined based on 
LET and BPI are not far different. In the case of tree 
lifetime, the difference in the normalized values for the 
lifetime of the LET and BPI- based DG trees is at least 
0.33 and is as large as 0.92 (with a high median of 0.69). 
On the other hand, in the case of aggregation delay per 
round, the difference in the complement of the 
normalized ADR values of the LET and BPI-based DG 
trees is at most 0.27 (with a low median of 0.11). 
 
     
(a) vmax = 1 m/s                        (b) vmax = 3 m/s 
 
     
(c) vmax = 5 m/s                        (d) vmax = 10 m/s 
Figure 10: Absolute Value of Average Aggregation 
Delay per Round (in time units). 
 
     
(a) LET                                  (b) BPI 
Figure 11: Complement of the Normalized Value of 
Average Aggregation Delay per Round. 
 
5.6 Energy consumption per round 
The BPI-DG trees incur lower values for energy 
consumption per round (ECR) for all scenarios of 
network density and node mobility (see Figures 12-13), 
and the LET-DG trees incur larger ECR values for all 
scenarios. The relatively better performance of the BPI-
based DG trees with respect to ECR could be primarily 
attributed to the less frequent network-wide broadcasts 
(due to a larger tree lifetime) and the short distance 
nature of the links that are part of the transmissions 
during data aggregation. For a given level of node 
mobility: the difference in the ECR values between the 
BPI and LET-based DG trees increases with increase in 
network density (attributed to the relatively unstable 
LET-based DG trees with increase in network density). 
Of course, for a fixed network density, the difference in 
the ECR values increase with increase in node mobility. 
Though both LET and BPI incur an increase in the 
magnitude for the ECR values with increase in network 
Bipartivity Index based Link Selection… Informatica 41 (2017)259–274 271 
density and/or node mobility, the ECR values for the 
LET-DG trees are significantly larger than those incurred 
for the BPI-DG trees.  
 
     
(a) vmax = 1 m/s                        (b) vmax = 3 m/s 
 
     
(c) vmax = 5 m/s                        (d) vmax = 10 m/s 
Figure 12: Absolute Value of Average Energy 
Consumption per Round (in Joules). 
For a given level of node mobility (see Figure 13): 
the complement of the normalized ECR values for the 
BPI-DG trees increases with increase in network density; 
whereas, the complement of the normalized ECR values 
for the LET-DG trees decreases with increase in network 
density. Thus, for a given level of node mobility: the 
relative performance (with respect to ECR) of the BPI-
DG trees vis-a-vis the LET-DG trees improves with 
increase in network density. On the other hand, (see 
Figure 13), for a given network density: the complement 
of the normalized ECR values for the LET-DG trees 
slightly increase with increase in node mobility; whereas, 
the complement of the normalized ECR values for the 
BPI-DG trees slightly decrease with increase in node 
mobility, more visibly in networks of high density. 
 
     
(a) LET                                  (b) BPI 
Figure 13: Complement of the Normalized Value of 
Average Energy Consumption per Round. 
5.7 Analysis of the relative performance 
tradeoff based on the NCRP scores 
We observe the lifetime incurred for the BPI-DG trees to 
be significantly larger than that of the LET-DG trees. 
Likewise, the average energy consumption per round 
incurred for the BPI-DG trees is lower than that incurred 
for the LET-DG trees. On the other hand, the aggregation 
delay per round incurred for the LET-DG trees is lower 
than the aggregation delay per round values incurred for 
the BPI-DG trees. This illustrates a complex {tree 
lifetime, energy consumption per round} vs. 
{aggregation delay per round} tradeoff. We use the 
normalization approach (introduced in Section 5.2) to 
analyze this tradeoff with respect to the above three 
performance metrics between LET and BPI. We observe 
(from Figure 14) the BPI-based link selection strategy to 
effectively balance this tradeoff and incur larger values 
for the NCRP score under all the 16 different scenarios 
of network density and node mobility. A closer look at 
the actual values (from Figures 8, 10 and 12) and 
normalized values (from Figures 9, 11 and 13) for the 
three individual performance metrics illustrates the same. 
 
     
(a) vmax = 1 m/s                        (b) vmax = 3 m/s 
 
     
(c) vmax = 5 m/s                        (d) vmax = 10 m/s 
Figure 14: Normalized Comprehensive Relative 
Performance Score for the Link Selection Strategies. 
With respect to the impact of network density and 
node mobility: the NCRP scores for the LET-DG trees 
decrease with increase in network density (for a fixed 
level of node mobility) and increase with increase in 
node mobility (for a fixed network density). On the other 
hand, the NCRP scores for the BPI-DG trees increase 
with increase in network density (for a fixed level of 
node mobility) and remain about the same with increase 
in node mobility (for a fixed network density). Thus, we 
observe the overall relative performance of the BPI-DG 
trees to only improve (or remain the same) with increase 
in network density and/or node mobility. 
Table 1 provides a comprehensive overview of the 
simulation results, identifying the link selection strategy 
that yields the most desirable values for the structural 
metrics and performance metrics. With respect to the 
structural metrics, we desire to have larger values for the 
fraction of leaf nodes and lower values for the number of 
child nodes per intermediate node and tree height. With 
respect to the performance metrics, we desire to have 
larger values for tree lifetime and lower values for the 
energy consumption per round and aggregation delay per 
round. With respect to the NCRP score (see equation 3 
for the formulation), we desire to have values closer to 1. 
On these lines, the LET strategy returns the most 
desirable values for two of the three structural metrics as   
272 Informatica 41 (2017) 259–274 N. Meghanathan  
 
Table 1: Link Selection Strategies (LET vs. BPI) that Incur the most Desirable Values for the Structural Metrics and 
Performance Metrics as well as the NCRP Score. 
 
well as the aggregation delay per round (all of which 
are not dependent on node mobility); however, the BPI 
strategy is useful to discover stable DG trees (larger tree 
lifetime) that also incur a lower energy consumption per 
round (attributed to the less frequent network-wide 
broadcasts and short distance nature of the links). The 
relatively better performance of the BPI-DG trees with 
respect to the tree lifetime and energy consumption per 
round and competitive values for the aggregation delay 
per round lift the normalized comprehensive relative 
performance (NCRP) scores to be above that of the LET 
strategy. The minimum and maximum difference in the 
NCRP scores incurred for the BPI-DG trees vis-a-vis the 
LET-DG trees are respectively 0.15 (observed in 
scenario of low network density and high node mobility) 
and 0.55 (observed in scenario of high network density 
and low node mobility). The median difference in the 
NCRP scores is 0.38.  
6 Conclusions 
The high-level contribution of this paper is a 
proposal to use the Bipartivity Index (BPI) metric (a 
spectral graph-theoretic metric used in complex network 
analysis) to determine stable data gathering (DG) trees 
for mobile sensor networks (MSNs). Our hypothesis in 
this research is that the end nodes of short distance links 
(the Euclidean distance between the end nodes of the link 
is far less than the transmission range of the nodes) are 
more likely to share a significant fraction of their 
neighborhood (and vice-versa). As short distance links 
are more likely to be stable too (and vice-versa), we 
propose to use the BPI strategy to evaluate and quantify 
the extent of shared neighborhood of the end vertices of 
the edges for determining stable DG trees in mobile 
sensor networks.  
We model the neighborhood of the end vertices of an 
edge u-v as an egocentric network EGu-v comprising of 
the end vertices and their neighbors as nodes and the 
edges incident on the end vertices as links. We have 
shown (through detailed theoretical analysis and 
illustrative examples) that edges whose egocentric 
networks have smaller values for the BPI are more likely 
to be short distance links. The egocentric network for an 
edge and its BPI score could be independently 
determined by the two end vertices of the edge based on 
just the one-hop neighborhood information and without 
knowledge about the location and mobility of the nodes. 
We quantify the link stability score (LSS) for an edge u-v 
as 1 - BPI(EGu-v). We define the bottleneck link weight 
of a path as the minimum of the weights of the 
constituent links on the path. Whenever a DG tree is 
required, we determine the maximum bottleneck link 
weight-based DG tree for which the bottleneck link 
weight of the path from any node to the root node is the 
maximum (the root node is the node with the largest sum 
of the LSS scores of its incident links). 
We have compared the performance of the BPI-
based DG trees with that of the DG trees determined 
based on the predicted link expiration time (LET) - the 
only well-known strategy so far [19] to determine stable 
DG trees for MSNs. We observe the BPI-DG trees to be 
significantly more stable as well as incur a lower energy 
consumption per round compared to that of the LET-DG 
trees. On the other hand, we observe the aggregation 
delay per round incurred for the LET-DG trees to be 
lower than the aggregation delay per round incurred for 
the BPI-DG trees. We thus observe a complex {tree 
lifetime, energy consumption per round} vs. 
{aggregation delay per round} tradeoff. We attribute this 
tradeoff to the unstable nature of the LET-DG trees 
(leading to more energy-intensive network-wide 
broadcast tree discoveries) and lower height as well as a 
larger fraction of leaf nodes (contributing to a lower 
aggregation delay per round).  
Finally, we propose the use of a normalization-based 
approach to evaluate the relative performance of the link 
selection strategies in a scale of 0...1 and thereby 
# nodes Tr. range vmax FLN  CNI TH TL ECR ADR NCRP 
50 25m 
1 m/s LET BPI LET BPI BPI LET BPI 
3 m/s LET BPI LET BPI BPI LET BPI 
5 m/s LET BPI LET BPI BPI LET BPI 
10 m/s LET BPI LET BPI BPI LET BPI 
50 35m 
1 m/s LET BPI LET BPI BPI LET BPI 
3 m/s LET BPI LET BPI BPI LET BPI 
5 m/s LET BPI LET BPI BPI LET BPI 
10 m/s LET BPI LET BPI BPI LET BPI 
100 25m 
1 m/s LET BPI LET BPI BPI LET BPI 
3 m/s LET BPI LET BPI BPI LET BPI 
5 m/s LET BPI LET BPI BPI LET BPI 
10 m/s LET BPI LET BPI BPI LET BPI 
100 35m 
1 m/s LET BPI LET BPI BPI LET BPI 
3 m/s LET BPI LET BPI BPI LET BPI 
5 m/s LET BPI LET BPI BPI LET BPI 
10 m/s LET BPI LET BPI BPI LET BPI 
Bipartivity Index based Link Selection… Informatica 41 (2017)259–274 273 
overcome the difficulty arising in analyzing the tradeoffs 
among the performance metrics whose values fall under 
different levels of magnitude (as is the case for the three 
performance metrics studied in this paper). We illustrate 
the use of the normalization-based approach to identify 
the link selection strategy that best balances the 
performance tradeoff as well as whose relative 
performance is more scalable (with increase in network 
density and/or node mobility). We observe the BPI-based 
link selection strategy to yield DG trees that incur the 
largest values for the normalized comprehensive relative 
performance (NCRP) scores under all the 16 scenarios of 
network density and node mobility. To vindicate the 
larger NCRP scores, we observe the BPI-based DG trees 
to simultaneously incur larger values for tree lifetime and 
lower values for energy consumption per round and not 
so relatively high values for aggregation delay per round. 
We observe the comprehensive relative performance of 
the BPI-DG trees to be more scalable with increase in 
network density and not much affected with increase in 
node mobility.  
7 Acknowledgment 
The work leading to this paper is funded through the 
Minority Leaders Program, “A Stable Trustworthy 
Neighborhood Scheme for Secure Mobile Sensor 
Networks,” funded by the Clarkson Aerospace/US Air 
Force Research Lab. Subcontract Number: JACK 15-
S7700-02-C2. The AFRL public clearance approval 
number for this article is 88ABW-2017-0475. The views 
and conclusions contained in this document are those of 
the author and should not be interpreted as necessarily 
representing the official policies, either expressed or 
implied, of the funding agency. 
8 References 
[1]  M. Abolhasan, T. Wysocki, E. Dutkiewicz, "A 
Review of Routing Protocols for Mobile Ad hoc 
Networks," Ad hoc Networks, vol. 2, no. 1, pp. 1-
22, 2004. 
[2]  T. Banerjee, B. Xie, J. H. Jun and D. P. Agarwal, 
"LIMOC: Enhancing the Lifetime of a Sensor 
Network with Mobile Clusterheads," Proceedings 
of the Vehicular Technology Conference Fall, pp. 
133-137, Baltimore, MD, USA, September 30 - 
October 3, 2007. 
[3]  C. Bettstetter, H. Hartenstein and X. Perez-Costa, 
“Stochastic Properties of the Random-Way Point 
Mobility Model,” Wireless Networks, vol. 10, no. 5, 
pp. 555-567, September 2004. 
[4]  T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. 
Stein, Introduction to Algorithms, 3rd Edition, MIT 
Press, July 2009. 
[5]  S. Deng, J. Li and L. Shen, "Mobility-based 
Clustering Protocol for Wireless Sensor Networks 
with Mobile Nodes," IET Wireless Sensor Systems, 
vol. 1, no. 1, pp. 39-47, March 2011. 
[6]  E. Estrada and J. A. Rodriguez-Velazquez, 
"Spectral Measures of Bipartivity in Complex 
Networks," Physical Review E 72, 046105, pp. 1-6, 
2005. 
[7]  W. Heinzelman, A. Chandrakasan and H. 
Balakarishnan, “Energy-Efficient Communication 
Protocols for Wireless Microsensor Networks,” 
Proceedings of the Hawaaian International 
Conference on Systems Science, Maui, HI, USA, 
January 2000. 
[8]  B. Hofmann-Wellenhof, H. Lichtenegger and J. 
Collins, Global Positioning System: Theory and 
Practice, 5th Edition, Springer, October 2013. 
[9]  B. Hull, V. Bychkovsky, Y. Zhang, K. Chen, M. 
Goraczko, A. Miu, E. Shih, H. Balakrishnan and S. 
Madden, “CarTel: A Distributed Mobile Sensor 
Computing System,” Proceedings of the 4th 
International Conference on Embedded Networked 
Sensor Systems, pp. 125-138, Boulder, CO, USA, 
November 2006. 
[10]  Y. Lai, J. Xie, Z. Lin, T. Wang and M. Liao, 
"Adaptive Data Gathering in Mobile Sensor 
Networks using Speedy Mobile Elements," Sensors, 
vol. 15, no. 9, pp. 23218-23248, 2015. 
[11]  S. Lindsey, C. Raghavendra and K. M. Sivalingam, 
"Data Gathering Algorithms in Sensor Networks 
using Energy Metrics," IEEE Transactions on 
Parallel and Distributed Systems, vol. 13, no. 9, pp. 
924-935, September 2002. 
[12]  C-M. Liu, C-H. Lee and L-C. Wang, "Distributed 
Clustering Algorithms for Data Gathering in 
Wireless Mobile Sensor Networks," Journal of 
Parallel and Distributed Computing, vol. 67, no. 
11, pp. 1187-1200, November 2007. 
[13]  P. V. Marsden, "Egocentric and Sociocentric 
Measures of Network Centrality," vol. 24, no. 4, pp. 
407-422, October 2002. 
[14]  M. Ma and Y. Yang, "SenCar: An Energy-Efficient 
Data Gathering Mechanism for Large-Scale 
Multihop Sensor Networks," IEEE Transactions on 
Parallel and Distributed Systems, vol. 18, no. 10, 
pp. 1476-1488, October 2007. 
[15]  M. Macuha, M. Tariq and T. Sato, "Data Collection 
Method for Mobile Sensor Networks Based on the 
Theory of Thermal Fields," Sensors, vol. 11, no. 7, 
pp. 7188-7203, July 2011. 
[16]  N. Meghanathan, “A Data Gathering Algorithm 
based on Energy-aware Connected Dominating Sets 
to Minimize Energy Consumption and Maximize 
Node Lifetime in Wireless Sensor Networks,” 
International Journal of Interdisciplinary 
Telecommunications and Networking, vol. 2, no. 3, 
pp. 1-17, July-September 2010. 
[17]  N. Meghanathan, "Exploring the Performance 
Tradeoffs among Stability-Oriented Routing 
Protocols for Mobile Ad hoc Networks," Network 
Protocols and Algorithms – Special Issue on Data 
Dissemination for Large scale Complex Critical 
Infrastructures, vol. 2, no. 3, pp. 18-36, November 
2010. 
[18]  N. Meghanathan, "A Comprehensive Review and 
Performance Analysis of Data Gathering 
Algorithms for Wireless Sensor Networks," 
274 Informatica 41 (2017) 259–274 N. Meghanathan  
 
International Journal of Interdisciplinary 
Telecommunications and Networking, vol. 4, no. 2, 
pp. 1-29, April-June 2012. 
[19]  N. Meghanathan, "Link Expiration Time and 
Minimum Distance Spanning Trees based 
Distributed Data Gathering Algorithms for Wireless 
Mobile Sensor Networks," International Journal of 
Communication Networks and Information 
Security, vol. 4, no. 3, pp. 196-206, December 
2012. 
[20]  N. Meghanathan, "Routing Protocols to Determine 
Stable Paths and Trees using the Inverse of 
Predicted Link Expiration times for Mobile Ad hoc 
Networks," International Journal of Mobile 
Network Design and Innovation, vol. 4, no. 4, pp. 
214-234, June 2012.  
[21]  N. Meghanathan and P. Mumford, "A 
Benchmarking Algorithm to Determine the 
Sequence of Stable Data Gathering Trees for 
Wireless Mobile Sensor Networks," Informatica – 
An International Journal of Computing and 
Informatics, vol. 37, no. 3, pp. 315-338, October 
2013. 
[22]  N. Meghanathan, Recent Advances in Ad Hoc 
Networks Research, Nova Science Publishers, 
August 2014. 
[23]  N. Meghanathan, "Stability-based and Energy-
Efficient Distributed Data Gathering Algorithms for 
Mobile Sensor Networks," Ad hoc Networks, vol. 
19, pp. 111-131, August 2014. 
[24]  N. Meghanathan, "A Generic Algorithm to 
Determine Maximum Bottleneck Node Weight-
based Data Gathering Trees for Wireless Sensor 
Networks," Network Protocols and Algorithms, vol. 
7, no. 3, pp. 18-51, November 2015. 
[25]  N. Meghanathan, "A Benchmarking Algorithm to 
Determine Minimum Aggregation Delay for Data 
Gathering Trees and an Analysis of the Diameter-
Aggregation Delay Tradeoff," Algorithms, vol. 8, 
no. 3, pp. 435-458, July 2015. 
[26]  N. Meghanathan, "A Greedy Algorithm for 
Neighborhood Overlap-based Community 
Detection," Algorithms, vol. 9, no. 1, p. 8: 1-26, 
2016. 
[27]  P. De Meo, E. Ferrara, G. Fiumara and A. Provetti, 
"On Facebook, Most Ties are Weak," 
Communications of the ACM, vol. 57, no. 11, pp. 
78-84, November 2014. 
[28]  T. S. Rappaport, Wireless Communications: 
Principles and Practice, 2nd edition, Prentice Hall, 
January 2002. 
[29]  G. Santhosh Kumar, M. V. Vinu Paul and K. Jacob 
Poulose, "Mobility Metric based LEACH-Mobile 
Protocol," Proceedings of the 16th International 
Conference on Advanced Computing and 
Communications, pp. 248-253, Chennai, India, 
December 2008. 
[30]  H. K. D. Sarma, R. Mall and A. Kar, "E2R2: 
Energy-Efficient and Reliable Routing for Mobile 
Wireless Sensor Networks," IEEE Systems Journal, 
vol. 10, no. 2, pp. 604-616, April 2015. 
[31]  W. Su and M. Gerla, “IPv6 Flow Handoff in Ad 
hoc Wireless Networks using Mobility Prediction,” 
Proceedings of the IEEE Global 
Telecommunications Conference, pp. 271-275, 
December 1999.  
[32]  D. Tao, S. Tang and H. Ma, "Low Cost Data 
Gathering using Mobile Hybrid Sensor Networks," 
Lecture Notes in Computer Science, vol. 7363, pp. 
193-206, July 2012. 
[33]  Y-C. Tseng, F-J. Wu and W-T. Lai, "Opportunistic 
Data Collection for Disconnected Wireless Sensor 
Networks by Mobile Mules," Ad Hoc Networks, 
vol. 11, no. 3, pp. 1150-1164, May 2013. 
[34]  R. Velmani and B. Kaarthick, "An Energy Efficient 
Data Gathering in Dense Mobile Wireless Sensor 
Networks," International Scholarly Research 
Notices Sensor Networks, vol. 2014, Article ID: 
518268, 10 pages, 2014.  
[35]  A. J. Viterbi, CDMA: Principles of Spread 
Spectrum Communication, 1st Edition, Prentice 
Hall, April 1995. 
[36]  H. Zhang and J. C. Hou, "Maintaining Sensing 
Coverage and Connectivity in Large Sensor 
Networks," Wireless Ad hoc and Sensor Networks: 
An International Journal, vol. 1, no. 1-2, pp. 89-
123, January 2005. 
 
 Informatica 41 (2017) 275–282 275
  
 
Power and Limitations of Formal Methods for Software Fabrication: 
Thirty Years Later  
Edgar Serna M. and Alexei Serna A. 
Facultad de Ciencias Básicas e Ingeniería, Corporación Universitaria Remington. Medellín, Antioquia, Columbia 
E-mail: edgar.serna@uniremington.edu.co, alexei.serna@uninremington.edu.co 
 
Keywords: formalization, automation, formal languages, software lifecycle, software quality 
Received: February 16, 2017 
 
In 1987, Michael Jackson presented his work "Power and Limitations of Formal methods for software 
fabrication" at the AIT Conference, which analyzed the advantages and limitations of formal methods up 
to that time. His conclusion was that formal methods had undoubted capabilities and advantages, but they 
also had serious limitations that prevented their widespread acceptance and adoption. The aim of this 
paper is to present the current context of formal methods compared with what Jackson described three 
decades ago. A tour of the strengths and limitations of formal methods is taken through a review of 
literature in the timeline of the past thirty years. The conclusion is that little progress has been made on 
this issue in relation to the situation presented by Jackson, and formal methods still need more work from 
academia, industry and the community. 
Povzetek: Prispevek analizira napredek formalnih metod s primerjavo z Jacksonovo metodo izpred 
trideset let.
1  Introduction
The idea of making mathematics an area of increased use 
and applicability in different disciplines and contexts can 
be traced to ancient Greece, where Pythagoras, Plato, 
Aristotle and Euclid tried to make its study and use 
accessible to a wide audience [1]. Beyond incipient 
astronomy, the development of physics, public works and 
the little there was of mechanics, however, its context 
continued to be limited to accounting and commercial 
calculations [2]. 
For a long time, initiatives were developed with 
similar objectives, and although some have been relatively 
successful, especially with the emergence of the 
engineering disciplines and scientific specializations, the 
situation appears to remain in other areas [3]. Despite 
these achievements, the general idea is that mathematics 
is a field of knowledge that is extremely complicated and 
difficult to learn and apply to social realities. This attitude 
has created many myths that have taken hold in the 
formative process, where students manifest a fear of 
taking mathematics courses whose content is higher than 
that in other courses [4, 5). Beyond the fact that these 
myths may or may not be true, the reality is that there is 
still no generalized context in which mathematics is 
appreciated for what it is and not what it seems. For 
example, one can mention that without the contributions 
of mathematics, major scientific and engineering 
developments would not have materialized in areas such 
as astronomy, physics, chemistry, natural sciences and, 
more recently, computer science [6]. In the latter, it was 
adopted as formal methods with the idea of mathematizing 
processes to develop software and design hardware [7].  
Although many moments of the appearance of formal 
methods can be found in the history of these sciences and 
many authors have submitted contributions in this regard, 
it was not until the 1960s that the concept was taken 
seriously, after the enactment of the so-called software 
crisis [8]. The community then directed its gaze to 
mathematics as a lifeline that had helped other disciplines, 
with the aim of integrating it into software development to 
solve this crisis and ones that might appear later. 
Since that time, various researchers, scientists, 
authors and organizations have been given the task of 
mathematizing software development and automating 
their tests in search of better-quality products. At this time 
there has been progress, but at the same time, many 
problems have been found due to the limitations 
diagnosed in formal methods [9]. In 1987, Michael 
Jackson [10], a British computer scientist and professor, 
made a presentation on what he considered the advantages 
and limitations of formal methods at that time. Three 
decades after his presentation, it is time to review whether 
mathematics in computer sciences has overcome those 
weaknesses and built upon its strengths, if it continues on 
the same path, or if it needs more time to achieve what was 
proposed as a lifesaver for software problems.    
Building on the work of Jackson, the aim of this article 
is to take a tour through three decades of publications on 
the advantages and limitations of formal methods and 
determine how far we have advanced in their potentiation 
and/or improvement. This work presents what authors 
have proposed, innovated and applied regarding the 
formalization of software development, and the results 
and conclusions of Jackson are contrasted with the current 
reality. In addition, current and future challenges for 
industry, academia and the community regarding the 
acceptance and widespread use of formal methods are 
described. 
276 Informatica 41 (2017) 275–282 E. Serna M. et al.  
 
2  Method 
To develop this review, it was applied the methodology 
proposed by Serna [47] to perform reviews of the 
literature. The search was performed in the following 
databases: ScienceDirect, ACM Digital Library, Scopus, 
and Web of Science. It was searched the term Formal 
Method(s) combined with the terms definition, 
description, power, limitations, best practices, effective 
development, and development, first in the title or 
abstract. By applying this method were selected 78 works 
including articles, books, works in events and websites. To 
this population, the sample was applied the 
inclusion/exclusion criteria (thematic pertinence, author's 
relevance, focus, quality of results, practical application, 
and others) in order to determine if the content contributes 
to the achievement of the goals set in this review. After 
this procedure, we get 48 works, and after performing a 
quick reading and applying the concepts of quality in order 
to determine the value of each of them for this research the 
final sample was constituted by 40 works. 
3 Jackson’s conclusions 
The work of Jackson [10] had the goal of presenting an 
analysis of the state of formal methods for the decade of 
the 1980s. In his presentation, he argued that it would be 
tedious and boring to describe the advantages and 
limitations alone, and for this reason, he dedicated almost 
all of the content to analyzing the fact that the problem 
was not so much with the formal methods but rather with 
the body of knowledge itself or the practice of developing 
software at that time. He asserted that these limitations 
could not be overcome solely by improving formal 
methods because they had been imposed by the inherent 
informality of that practice. To this end, it was one thing 
to describe the real world with mathematics and a very 
different thing to do so with natural language because the 
ambiguity of the latter creates complications for 
translation. 
He also argued that formal methods offered a range of 
formalization but did not indicate which option to select 
nor how to apply it in the development of software, a task 
that corresponded to the developer based on experience 
and skills. For him, at that time, it was thought that 
software development was like a manufacturing process, 
in which descriptions were written in some language and 
assumed to be analogous to the parts of a mechanical 
product, where language was the raw material to 
manufacture them. It was also assumed at that time that 
software development was primarily a task of 
composition, not of decomposition, which duplicated the 
thinking in the forties of engineers in the construction of 
rockets, who felt that the process consisted of five 
descriptions: a guidance system, a propulsion system, fuel, 
a structure and aerodynamic principles, when, to satisfy 
them all, only the structure, fuel and streamlining were 
needed. According to Jackson, that process is what defines 
composition, but it demands creativity and invention that 
cannot always be automated. 
He concluded that formal methods have undeniable 
advantages and potential, but for the practice of software 
development at the time, they also had serious limitations 
that hampered their widespread acceptance. Although 
most of his work was devoted to demonstrating that most 
of the blame belonged to the practice of software 
development, he listed some advantages and limitations, 
which are presented in Table 1. 
Power 
Their descriptions are accurate and non-
ambiguous 
Their descriptions can be manipulated by 
symbols 
Mathematics provides a high degree of reliability 
The math is based on a large body of knowledge 
 
Limitations 
They restrict the developer to a single language  
They are focused on transformations between 
descriptions 
They tend not to be methods 
Not all software projects can be formalized 
Formalisms tend to be isolated from each other 
Research focuses on individual formalisms 
A broad integration of formalisms is required 
Table 1: Power and limitations of formal methods [10]. 
Three decades have passed since these claims, and 
there remains in the environment the feeling that formal 
methods cannot become an alternative for developing 
reliable, secure and quality software. In the next section, 
the development of formal methods over the 30 years after 
the work of Michael Jackson is described. 
4 The last thirty years of formal 
methods 
In Seven myths of formal methods, Anthony Hall [11] 
presents his analysis of seven myths that existed at that 
time: 1) they ensure that the software is perfect, 2) they 
prove that the program is correct, 3) they are only useful 
in critical systems, 4) they involve complex mathematics, 
5) they increase the cost of development, 6) they are 
incomprehensible to customers, and 7) nobody uses them 
in real projects. The author claims that many of the things 
said for or against formal methods were generated from 
the experiences of developers when applying them but 
that, although there might be some uncertainty, the reality 
was that they had more advantages than disadvantages. In 
his own experience, said Hall, these myths had to be 
reformulated and established as a type of process because, 
by then, the transfer from academia to industry was 
working consistently. To Gaudel [12], the advantages and 
problems of formal methods were limited to specification 
and design, as shown in Table 2. 
According to Young [13], positive or negative 
opinions on formal methods had generated controversy for 
years, while the formal methods community fell short in 
explaining what they are and what their advantages are 
due to using descriptions and language that only 
Power and Limitations of Formal…  Informatica 41 (2017) 275–282   277 
community members understood and recognized. 
However, likewise, he asserted that there was a lack of will 
in the software engineering community to recognize the 
true value of formal methods. For him, if the goal was to 
improve the quality of software, it was necessary for both 
communities to work together to achieve it. For their part, 
Barroca and McDermid [14] felt that formal methods 
could be used in two different ways: to produce 
specifications for conventional systems development and, 
from that point, to generate formal specifications to verify 
the accuracy of the program. For these authors and at that 
time, the benefits of formal methods were as follows: 1) 
they ensure a consistent interpretation of the specification, 
2) they allow verification of the application, and 3) they 
remove the ambiguity of language. They also listed 
weaknesses: 1) their development status is low, 2) the 
specification cannot be validated, 3) they only have 
mathematical interpretations, and 4) non-functional 
requirements cannot be adequately articulated in the 
context.  
 Power Limitations 
Specification 
They make it 
possible to 
analyze and 
encourage it 
It is structured 
and reusable 
It is testable 
Correctness 
cannot be 
formalized 
To express the 
properties, it is 
necessary to 
formulate certain 
aspects of the 
application 
domain 
They only allow 
external 
verification 
Design 
It is the best way 
to prevent human 
errors 
It is a good 
approximation to 
zero failures 
Correctness tests 
are improved  
They are difficult 
to implement 
It is still 
necessary to 
check the design 
with a tool 
The mathematical 
rigor does not 
completely 
eliminate errors 
The tests do not 
provide security 
Table 2: Power and limitations of formal methods [12]. 
For Robert Vienneau [15], changes in computing in 
the 1970s and 1980s generated revolutionary ideas that 
materialized in formal methods, but there was not yet a 
unified philosophy about them. Although formal methods 
were promulgated as a technology applicable to the entire 
software life cycle, the author wondered why they were 
not more widely known. He asserted that part of the 
problem was educational and that many of its limitations 
would never be overcome, but he was convinced that some 
restrictions would be addressed through research and 
practice. Table 3 shows the limitations and advantages that 
this author described. 
Power 
They can be used to verify a system 
They complement the natural language descriptions 
and give them accuracy 
They can show that an implementation satisfies a 
specification 
More precise specifications are achieved 
Internal communication is improved 
They provide the ability to verify designs before 
running them during the test 
They offer higher quality and productivity 
 
Limitations 
They cannot be used to validate a system 
They can never replace the knowledge that the engineer 
has of the system 
They can never fully replace tests 
Their applicability is doubtful in systems with many 
lines of code 
Table 3: Power and limitations of formal methods [15]. 
Liu and Adams [17] concluded that developers 
utilized formal methods with the hope of refining 
processes and improving specifications but often did not 
achieve their goals due to the limitations of this 
technology: the refinement rules are not sufficient to 
guarantee that a refined specification satisfies the 
requirements, and in addition, these rules cannot be 
reutilized and are difficult to apply in practice. For this 
reason, they recommended modifying the existing 
refinement rules, if the objective is to make formal 
methods widespread. According to Craigen et al. [18], at 
that time, formal methods were a developing technology, 
and therefore, they exhibited limitations, like any other 
such technology. Moreover, it was necessary to determine 
two key aspects: 1) what were the boundaries between the 
real world and the world of mathematics and 2) what were 
the internal limitations of the mathematics. They felt it was 
difficult to address these issues because, at that time, 
research placed the real world in doubt; therefore, 
mathematizing the needs of the client became an informal 
process. This limitation hindered the widespread growth 
and recognition of formal methods among software 
professionals. 
Bowen and Hinchey [19] asserted that, for some 
reason, in the 1990s, formal methods had become one of 
the most controversial techniques in software engineering. 
They took as a foundation the work of Hall [12], and they 
added perspectives that they considered to be new myths: 
they delay the development process, tools do not support 
them, they are not really methods, they only apply to 
software, they are not necessary, they do not have support, 
and the formal-methods community always uses formal 
methods. For them, the problem was that more real 
relationships between academia and industry were 
required, it was necessary to spread experiences (positive 
and negative) by using them more widely, more research 
was needed, and it was necessary to demystify 
mathematics. Rushby [20] said that, in that decade, formal 
methods had progressed from an academic curiosity to an 
industrial reality, and he presented an analysis of their 
278 Informatica 41 (2017) 275–282 E. Serna M. et al.  
 
past, present and future. He concluded that to achieve 
widespread adoption, it was necessary to improve the tools 
and the scale of their applications, that theoretical research 
should provide better characterization, and that the 
software industry must have an open mind regarding this 
technology because the hardware industry was already 
enjoying its benefits. 
Steve Easterbrook [21] described three case studies in 
which they applied formal methods to model 
requirements. They argued that, in contrast to other 
projects in which Requirements Engineering was used 
very early on to validate needs, in their experience, they 
were able to improve the specification. They concluded 
that the benefits of formal modeling are that it reduces 
process costs, it enables more effective verification and 
validation, and maintenance is structured better. 
Meanwhile, for Kneuper [22], formal methods could help 
improve the reliability of software development but did 
not solve all problems. The author described the 
limitations that make universal solution by formal 
methods impossible: complete formalization is not 
possible, there is no guarantee that the informal user 
requirements are correct and complete, it is difficult 
(almost impossible) to ensure that the program is correct, 
they do not determine correct tests, abstraction does not 
accurately reflect the application, they are applied on a 
small scale, the technical development is insufficient, and 
developers do not have the proper mathematical training. 
According to John Knight and his team [23], by that 
time, formal methods had proven benefits, but there were 
several reasons they lacked broader acceptance: they 
extend the development cycle, they require complicated 
mathematics, and the existing tools are inadequate and 
incompatible with other software packages. However, 
after applying formal methods to critical application 
specifications, they concluded that, although several of 
those reasons could be valid, the main issue was to try to 
build a comprehensive evaluation framework for the 
specification. Given that until then, it had not been 
achieved, it became a stumbling block that the industry 
could not solve, but given the orientation of the subsequent 
research, it could be solved in future work. For Jeannette 
Wing [24], formal methods had limitations in ensuring the 
security of systems, but they delineated the boundaries of 
the systems and characterized their behavior more 
accurately, defining their desired properties with precision 
and providing a specification in terms of time. She 
explained that their limitations were because the system 
operates in an environment, and therefore, formal methods 
cannot provide total security. She also said that the future 
of formal methods was promising because, in that decade, 
many research initiatives were conducted.  
Jones et al. [25] conducted research on the 
contributions of formal methods to requirements 
engineering and found that some challenges still remained 
to be overcome: how to couple informality to the formality 
of requirements, better manage changes, allow 
traceability, improve accuracy in the validation of the 
specification, offer better alternatives for non-functional, 
reconcile some inconsistencies in notations, allow 
multiple notations and create a body of knowledge that 
includes all parts involved in this phase. In the same vein, 
Heylighen [26] stated that the validation of knowledge 
requires formal expressions of the same and that 
mathematical determinism requires greater co-pagination 
with operational determinations. He believed that any 
formalization process has advantages: it removes 
ambiguity, it defines the extension of terms, it is 
independent of time, mathematical language is universal, 
and it is reusable and testable. However, it also has 
limitations: it is generally isolated from the context, it has 
intrinsic limitations, and it assumes normal conditions as 
implicit contexts, whereas causal factors determine 
context dependence. He predicted that it was necessary to 
overcome the limitations and potentiate the advantages to 
popularize the formalization of software. 
Wordsworth [27] summarized the benefits of formal 
methods as follows: 1) successful cases have been 
sufficiently reported, and 2) although the basis is 
mathematical, it is not always necessary. His caveats, 
however were that formal methods demand some degree 
of mathematical sophistication, they are not taken 
seriously in programs of study, users are satisfied with 
traditional methods, the requirements should be specified 
more precisely, and developers prefer to code without 
complete specification. According to [28], formal 
methods had not yet achieved greater penetration; a wide 
gap persisted between research in academia and industry 
application. They maintained that the fact that industry 
still did not believe in formal methods was due to the loss 
of scalability, limited access to specialists and the 
immaturity of tools and techniques. 
Peter Amey [29] presented what he called the reality 
of formal methods and described a series of cases of 
successful software development. He inquired why, 
despite its utility, the approach still was not widely used 
in industry, and he concluded that it was because, 
typically, they were trying to use development at 
inopportune times. That is, the error was not how but 
when. He suggested it would be advisable to start with 
specification and then reach verification and validation. 
Edmonds and Bryson [30] argued that the idea of formal 
Power  
They provide units of measure 
They facilitate the detection of errors 
They ensure proper operation 
They reduce errors 
They help improve abstraction 
They perform rigorous analysis 
They are reliable 
They allow effective test cases 
Limitations 
They require informality to guarantee the specification 
It is not easy to see that the implementation satisfies the 
specification 
It is not possible to guarantee that the tests are correct 
The language features are complex 
Technical environments do not always recognize a 
formal specification 
Table 4: Power and limitations of formal methods [37]. 
Power and Limitations of Formal…  Informatica 41 (2017) 275–282   279 
methods offered two advantages: 1) the specification is 
unambiguous, and 2) it can be self-handled syntactically. 
However, they felt that formal language presented 
difficulties when trying to translate to or from other 
languages, which generated two problems: 1) natural 
language translation is slow, and 2) coding is delayed. 
Martin Gogolla [31] summarized the benefits and 
problems of formal methods through a literature review 
and classified his findings based on indicators: domain of 
application, persons, properties, tools, understanding, 
development and general criticism. He concluded that the 
success or failure of formal methods was not determined 
by their mathematical properties but by the low usability 
of existing tools. 
For Glass [32], formal methods had existed for a long 
time but still did not achieve the impact they should have, 
even though the specification is considerably more 
understandable, the confidence of the analyst increases 
because they identify the key problem to solve, errors in 
the final product are reduced, and maintenance costs are 
reduced due to the knowledge acquired. Hall [33] 
attempted to demonstrate the benefits of formal methods 
and asserted that achieving them was not automatic 
because there was no better way nor better method to do 
so. He thought that they were only part of the solution to 
the problems of software development and that their 
success depended largely on a clear integration: that 
intelligent use is required, that researchers should dedicate 
more time to them, that practical issues of integration and 
access are as important as the theoretical issues, and that 
developers must let go of the fear of formalisms. 
Sommerville [34] wrote that since the 1980s, many 
engineers and researchers had proposed the use of formal 
methods as the best way to improve the quality of software 
products, but that dream still had not come true. He 
concluded that there were four reasons: 1) the emergence 
of new methods and management proposals that have 
helped improve the quality and success of Software 
Engineering; 2) a new market where quality seems to be 
of secondary importance; 3) the limited reach of formal 
methods, which still do not adequately exceed 
specifications; and 4) limited scalability because large 
projects are still not satisfied. Still, for him, formalization 
was an excellent way to discover errors in specification. 
David Parnas [35] argued that in the last 40 years, 
three alarming gaps had appeared in the software field: 1) 
between research and practice, 2) between software 
development and traditional engineering disciplines, and 
3) between computer sciences and classical mathematics. 
He argued that advocates of formal methods proposed 
them as the solution to any of those gaps, even though, up 
until that time, they could not be verified. He concluded 
that formal methods had been left with only that 
perspective, and it was time to rethink them. To achieve 
this rethinking, he proposed the following: 1) software has 
problems, but some formal methods with problems will 
not solve them, 2) more research is needed as is fewer 
defensive efforts, 3) movement should be slow, not all at 
once, 4) abstractions should be simple, but true, and 5) our 
role in the model must be as engineers, not as philosophers 
and logicians. For the IET [36], formal methods offered 
the following advantages: they allow a consistent and 
reasonably complete specification, they reduce the 
likelihood of error and the cost of detection, and they 
permit identifying ambiguity in the specification and 
verification of security requirements. Nonetheless, they 
were not widely used because the industry did not give 
them real opportunities and because academia did not 
view them seriously.  
In the article by Batra et al. [37], it became clear that 
by then, the demand for incorporating formalization in 
Information Systems had increased because the 
specification represented actual requirements, and the 
formal methods could ensure that the implementation met 
the specifications and demands of security, reliability and 
quality, although they also had weaknesses. Table 4 
describes the advantages and limitations of formal 
methods for these authors. 
In the results of a survey conducted by Fitzgerald [38] 
on the impact of formal methods on the cost, time and 
quality of the software, the opinions were divided: 25% 
reported a decrease and 20% an increase in time; in terms 
of cost, 33% said there was a reduction and 8% said it 
increased; and in quality, 88% asserted that it improved 
and 8% were not sure. In [9] identified eight obstacles to 
the research, teaching and practice of formal methods: 1) 
there is insufficient research and teaching, 2) support tools 
are not sufficient for use on a large scale, 3) students are 
not taught Computer Science or Software Engineering in 
mathematical terms, 4) no foundations are strengthened 
and no attempt is made to present new functions, 5) there 
are not enough graduates with mathematical knowledge to 
serve the industry, 6) professors of computer science and 
Software Engineering do not receive the same training 
they did 30 years ago, 7) formal methods lack tools for 
managing versions and configuration control, and 8) 
professionals who are trained in formal methods do not 
find support among their industry fellows, so they tend to 
abandon the practice. 
Ishikawa et al. [39] stated that formal methods were 
increasingly attracting more attention as a solution to the 
high demand for efficient and reliable software, but a gap 
had developed between knowledge and teaching with 
which software engineers were educated and what was 
required to implement these methods. Meanwhile, for 
Mayo et al. [40], the limitations of formal methods lay in 
semantics and traceability, as software and hardware are 
formally tested during the development process only for 
explicit statements made ahead of time and as the 
requirements are validated only in the semantics in which 
they were tested. According to Gross et al. [41], the 
exhaustive testing of software systems is intractable and 
expensive, but if formal methods are incorporated 
throughout the design process, errors can be identified as 
they are introduced and the total cost of development 
dramatically reduced.  
5 Analysis of results 
After presenting a review of the timeline of formal 
methods in the past 30 years, looking for the advantages 
and limitations published by various authors, it is difficult 
280 Informatica 41 (2017) 275–282 E. Serna M. et al.  
 
to determine whether the advantages of formal methods 
have been improved or new ones found. Regarding the 
limitations, it cannot be stated clearly whether they have 
been overcome or if instead they have become more acute 
or others have appeared. The conclusions and 
presentations in these decades are divided between 
positivism and negativism toward formal methods, even 
predicting its possible disappearance from the software 
development stage. In any case, after analyzing these 
conclusions, the map of reality of formal methods in these 
three decades can be seen from three dimensions: from 
academia, from the community and from the industry. 
Next, we consider each from the perspective of the results. 
In the academic dimension, many works report that 
formal methods do not achieve the expected penetration 
because the curriculum in Computer Science and Software 
Engineering still does not pay adequate attention to them. 
Thus, these professionals are not educated in applied 
mathematics, and the few who are have not found peers in 
industry interested in sharing their knowledge. In this 
sense, academia should provide an opportunity for formal 
methods and include them in its content, and in addition, 
professors need more training to avoid improvising in the 
classroom. While software problems will not be solved in 
this way from one moment to another, if progress has 
already been made, it is better to exploit formal methods 
to see if they can improve software quality [42]. 
Furthermore, education systems should take 
responsibility for the fact that students have mythicized 
mathematics because they are structured to educate 
everyone equally and in all areas. With this approach, the 
skills that each individual may have for one or another 
discipline are wasted because they do not receive a 
vocational orientation that tells them how to orient their 
educational needs. The reality is that to understand 
mathematics, one must first develop logical and abstract 
reasoning, but education systems have not contemplated 
this foundation. Hence, the student prefers more 
theoretical or less logical areas because they require less 
effort. This reality is combined with the fact that software 
development is highly abstract because it is a non-tangible 
engineering product, and only the outputs and not the 
processes can be observed. This characteristic 
differentiates it from other products, such as civil 
engineering, in which the manufacturing process is 
constantly evident. The result is that professionals are not 
adequately trained in mathematics, and thus, formal 
methods are not practiced nor experienced to exploit their 
advantages and overcome their limitations. 
In the community dimension, the opinion is reiterated 
in the literature review that formal methods have 
demonstrated their power in the formalization of 
specification. For many authors, Requirements 
Engineering is where formal methods have had the 
greatest acceptability and where the most success stories 
are found. The effect has been that the community has 
devoted less effort to further strengthening procedures and 
tools to elicit and specify requirements, dedicating itself 
instead to possibilities in other phases of the life cycle. 
Some authors criticize this trend in that the community 
believes that formal methods have already exceeded their 
goal and that thus another goal should be developed, when 
the reality is that there are still many problems in the 
formalization of specification. One recommendation is 
that what has been achieved so far with requirements 
should first be strengthened and standardized, and other 
possibilities can then be considered. 
Another issue with the formal methods community is 
that it is perceived as closed to the participation of other 
stakeholders because language has limited its 
communication to the strictly mathematical [43] and 
because transdisciplinary work is not considered as an 
alternative. With the development of different disciplines, 
many researchers interested in contributing from their 
specialty to the development of other specialties have 
appeared, as in the case of Neurocomputation for 
understanding the brain and how people learn. If the 
formal methods community were more open to 
contributions from areas such as philosophy, psychology 
or didactics, it could achieve better results than it has so 
far. 
On the other side is the software industry, a crucial 
factor in the current map of the reality of formal methods. 
For years, software was developed in laboratories and with 
military support because its potential was considered only 
from that perspective. Over time, society realized that 
software could be expanded as a solution to other needs, 
commercialization began, and the software industry 
appeared with the aim of development for sale. However, 
software was nonetheless adopted and adapted with the 
methodology that military scientists had built in their 
laboratories and applied to develop the products offered 
on the market. This approach to software development 
triggered what NATO deemed a crisis in the late 1960s. It 
is understandable that industry has intended to mainly 
produce to sell and make a profit, but software is a product 
that is not manufactured but rather is created (developed); 
therefore, different procedures are needed from the ones 
used, for example, to manufacture an aircraft or turbine.  
With respect to formal methods, the industry still does 
not assimilate them at their full potential because it 
believes that they delay processes and reduce usefulness. 
The issue is that, if not in industry, where can the 
proposals of the community and of academia be verified 
and validated? The three dimensions must work in unison 
so that a software-dependent society can enjoy better-
quality software products. This need does not mean that 
formal methods are the immediate and magical solution to 
the software crisis, but as an alternative, it is worth the 
effort to give them an opportunity while there is no other 
alternative.  
6 Conclusions 
The increasing complexity of systems in this century is a 
challenge for research in Computer Science. The hardware 
and software that make up these systems have gone in a 
few years from a few components and lines of code to 
hundreds of thousands. One need only compare the reality 
that Jackson described in his work, three decades ago, with 
the one in which we currently live, in the midst of a 
software-dependent society with high demands for 
Power and Limitations of Formal…  Informatica 41 (2017) 275–282   281 
quality, reliability and product security. However, the 
software component costs half or more of the total value 
of the development of a system, and its applications 
impact economies around the world. For all these reasons, 
it is necessary to innovate with regard to development 
processes because, although many think otherwise, we 
still have not overcome the so-called software crisis of the 
60s.  
Two key issues can be identified from this analysis of 
the reality of formal methods in the past three decades. 1) 
The processes for developing software are still performed 
as they were more than 50 years ago. Very little innovation 
has occurred in this sense, and interest seems geared more 
to proposing and selling new languages and 
methodologies than to positioning and strengthening the 
ones that exist and have been proven to work. New 
approaches are not always the solution, and often what is 
achieved is to increase the range of options but hinder the 
work of developers. 2) Formal methods are mathematical 
and therefore are not yet widespread. In this sense, we 
must understand that mathematics is based on the 
understanding and application of logic and pure 
abstractions, and although computers exist in physical 
reality, software is basically responsible for representing 
and manipulating non-physical data. That is, mathematics 
represents the physical reality, while software models and 
simulates it. 
Formal methods were developed over decades and 
have introduced principles, paradigms and influential 
conceptual innovations into computer science for the 
development software. This is reflected in the fact that a 
quarter of the Turing Awards between 1966 and 2013 
recognize work with a significant component in formal 
methods. Nonetheless, they still seem to be at a 
crossroads: as an advantage, they seem well developed 
and are supported by a large number of applications, users 
and important critical developments; however, as a 
limitation, they have ceased to be a major component in 
computer science and engineering training, few professors 
are working on them, course offerings at the 
undergraduate and graduate levels are scarce, they are 
difficult to apply in important projects, and only a small 
number of graduates welcome them as a source of work. 
Moreover, education systems do not adequately develop 
the logical interpretation and abstracting ability of 
students [44, 45], which are necessary for logical 
reasoning. This area of study even seems to have 
decreased in recent years, which has made new 
generations increasingly prejudiced regarding 
mathematics [46]. 
Thirty years after the work of Michael Jackson, the 
outlook for formal methods seems not to have changed 
much. The advantages and limitations that this author 
described remain, and others seem to have become more 
acute because variables entered the scene that at the time 
were unknown: the abandonment of mathematics as the 
center of the universe of educational systems, the social 
demand for new and innovative products in a very short 
time frame, the increasing complexity of problems and the 
emergence of tools that attempt to displace developers, 
among others. The facts that the community of formal 
methods remains isolated from the reality of Computer 
Science and that the industry does not provide the 
necessary space to popularize its application are also of 
little help.  
We can thus conclude that, as a solution to the 
problems of software development, formal methods still 
have a long way to go. Work over the past thirty years has 
been slow and achievements few. It has not been possible 
to form a suitable environment to establish formal 
methods as an area of academic and industrial interest, and 
professors have lacked the training and experience to 
include them in the curriculum. In addition, students still 
perceive mathematics as an obstacle to be overcome to 
graduate, rather than as an important part of the learning 
process. If the formal methods community is integrated 
and works with other knowledge disciplines, if the 
industry works a little more and if academia grounds its 
theories in an attempt to overcome these limitations, 
formal methods could constitute an alternative way for 
software to achieve the security and quality expected by 
society.  
7 References 
[1] Cohen, B. (1995). A brief history of ‘Formal 
Methods’. Formal aspects of computing, 1(3), 1-10. 
[2] Neugebauer, O. (1969). The Exact Sciences in 
Antiquity. USA: Dover Publications. 
[3] Serna, M.E. (Ed.) (2015). Avances en ingeniería. 
Medellín: Editorial Instituto Antioqueño de 
Investigación. 
[4] Dani, S. (1993). 'Vedic Mathematics': Myth and 
Reality. Economic and Political Weekly, 28(31), 
1577-1580. 
[5] Dowling, P. (1998). The Sociology of Mathematics 
Education: Mathematical Myths/Pedagogic Texts. 
London: Falmer Press. 
[6] Holloway, C. (1997). Why Engineers Should 
Consider Formal Methods. 16th Digital Avionics 
Systems Conference. Irvine, USA, 16-22. 
[7] Butler, R. (2001). What is Formal Methods? NASA. 
Online [Aug 2016]. 
[8] Naur, P. & Randell, B. (1968). Software 
Engineering. Scientific Affairs Division NATO. 
Germany, Garmisch. 
[9] Bjørner, D. & Havelund, K. (2014). 40 years of 
formal methods- Some obstacles and some 
possibilities? Lecture Notes in Computer Science, 
8442, 42-61. 
[10] Jackson, M. (1987). Power and limitations of formal 
methods for software fabrication. Journal of 
Information Technology, 2(2), 1-6. 
[11] Hall, A. (1991). Seven Myths of formal methods. 
IEEE Software, 7(5), 11-19. 
[12] Gaudel, M. (1991). Advantages and limits of formal 
approaches for ultra-high dependability. 
Proceedings of the Sixth International Workshop on 
Software Specification and Design. Como, Italy, 
237-241. 
[13] Young, W. (1991). Formal Methods versus Software 
Engineering - Is there a conflict? Fourth Testing, 
282 Informatica 41 (2017) 275–282 E. Serna M. et al.  
 
Analysis, and Verification Symposium. Victoria, 
Canada, 188-899. 
[14] Barroca, L. & McDermid, J. (1992). Formal 
Methods: Use and relevance for the development of 
safety critical systems. The Computer Journal, 
35(6), 579-599. 
[15] Vienneau, R. (1993). A review of Formal Methods. 
Technical Report, Kaman Science Corporation. 
[16] Gerhart, S., Craigen, D. & Ralston, T. (1994). 
Experience with Formal Methods in Critical 
Systems. IEEE Software, 11(1), 21-28. 
[17] Liu, S. & Adams, R. (1995). Limitations of formal 
methods and an approach to improvement. 
Proceedings Asia Pacific Software Engineering 
Conference. Brisbane, Australia, 498-507. 
[18] Craigen, D., Gerhart, S. & Ralston, T. (1995). 
Industrial applications of formal methods to model, 
design and analyze computer systems - An 
international survey. New Jersey: Noyes Data. 
[19] Bowen, J. & Hinchey, M. (1995). Seven more myths 
of formal methods - Dispelling industrial prejudices. 
Lecture Notes in Computer Science, 873, 105-117. 
[20] Rushby, J. (1996). Mechanized Formal Methods: 
Progress and prospects. Lecture Notes in Computer 
Science, 1180, 43-51. 
[21] Easterbrook, S. et al. (1996). Experiences using 
formal methods for requirements modeling. 
Technical Report NASA-CR-203085. 
[22] Kneuper, R. (1997). Limits of Formal Methods. 
Formal Aspects of Computing, 3(1), 1-16.  
[23] Knight, J., Dejong, C., Gibble, M. & Nakano, L. 
(1997). Why are formal methods not used more 
widely?  Fourth NASA Langley Formal Methods 
Workshop. Virginia, USA, 1-12. 
[24] Wing, J. (1998). A symbiotic relationship between 
formal methods and security. Proceedings Conf. on 
Computer Security, Dep. and Assu: From Needs to 
Solutions. Williamsburg, USA, 26-38 
[25] Jones, S., Till, D. & Wrightson, A. (1998). Formal 
Methods and Requirements Engineering - 
Challenges and Synergies. Journal of Systems and 
Software, 40(3), 263-273. 
[26] Heylighen, F. (1999). Advantages and limitations of 
formal expression. Foundations of Science, 4(1), 25-
56. 
[27] Wordsworth, J. (1999). Getting the best from formal 
methods. Information and Software Technology, 41, 
1027-1032. 
[28] Broadfoot, G. & Broadfoot, P. (2003). Academia and 
industry meet - Some experiences of formal methods 
in practice. Tenth Asia-Pacific Software Engine. 
Conference. Chiang Mai, China, 49-58. 
[29] Amey, P. (2004). Dear Sir, yours faithfully - An 
everyday story of formality. In Redmill, F. & 
Anderson, T. (Eds.), Practical Elements of Safety. 
UK: Springer, 3-15. 
[30] Edmonds, B. and Bryson, J. (2004). The 
insufficiency of Formal Design Methods. 
Proceedings Third International Joint Conference 
on Autonomous Agents and Multiagent Systems. 
New York, USA, 938-945. 
[31] Gogolla, M. (2004). Benefits and problems of 
Formal Methods. LNCS, 3063, 1-15. 
[32] Glass, R. (2004). The mystery of Formal Methods 
Disuse. Communications of the ACM, 47(8), 15-17. 
[33] Hall, A. (2005). Realising the benefits of Formal 
Methods. LNCS, 3785, 1-4. 
[34] Sommerville, I. (2009). Software Engineering. New 
York: Pearson. 
[35] Parnas, D. (2010). Really rethinking ‘formal 
methods’. Computer, 34(1), 28-34. 
[36] IET. (2011). Formal Methods. The IET. 
[37] Batra, M., Malik, A. & DAVE, M. (2013). Formal 
methods - Benefits, challenges and future direction. 
Journal of Global Research in Comp. Scien., 45, 21-
25. 
[38] Fitzgerald, J. (2013). Industrial deployment of 
formal methods: Trends and challenges. In 
Romanovsky, A. and Thomas, T. (Eds.), Industrial 
Deployment of System Engineering Methods. Berlin: 
Springer, 123-143. 
[39] Ishikawa, F., Yoshioka, N. & Tanabe, Y. (2015). 
Keys and roles of formal methods education for 
industry: 10 Year experience with Top SE Program. 
Proceedings First Workshop on Formal Methods in 
SEE & Training. Oslo, Norway, 35-42. 
[40] Mayo, J., Armstrong, R. & Hulette, G. (2015). 
Digital System Robustness via Design Constraints: 
The Lesson of Formal Methods. In 9th IEEE 
International Systems Conference. Vancouver, 
Canada, 109-114. 
[41] Gross, K., Fifarek, A. & Hoffman, J. (2016). 
Incremental Formal Methods Based Design 
Approach Demonstrated on a Coupled Tanks 
Control System. Proceedings 17th International 
Symposium on High Assurance Systems 
Engineering. Orlando, USA, 181-188. 
[42] Alvear, A. & Quintero, G. (2015). Integrating 
software development techniques, usability, and 
agile methodologies. Actas de Ingeniería 1, 94-103. 
[43] Polansky, J. & Sinclair, M. (2014). The importance 
of training in formal methods in Software 
Engineering. Revista Antioqueña de las Ciencias 
Computacionales y la Ingeniería de Software 
(RACCIS), 4(2), 52-56. 
[44] Serna, M.E. (2013). Prueba funcional del software - 
Un proceso de Verificación constante. Medellín: 
Editorial Instituto Antioqueño de Investigación. 
[45] Serna, M.E. & Serna, A.A. (2013). Is it in crisis 
engineering in the world? A literature review. 
Revista Facultad de Ingeniería, 66, 197-206. 
[46] Tucker, A., Kelemen, C. & Bruce, K. (2001). Our 
curriculum has become Math-Phobic! Proceedings 
32th Technical Symposium on Computer Science 
Education. Charlotte, USA, 243-247. 
[47] Serna, M.E. (2016). Methodology for perform 
reliable literature reviews. Revista Investigación 
Económica, in press. 
 
 Informatica 41 (2017) 283–288 283
  
Improved Lane Departure Warning Method Based on Hough 
Transformation and Kalman Filter 
Minjian Liang1,2, Zhou Zhou1 and Qingsong Song1 
1 School of Information Engineering, Chang’an University, Nan Er Huan Zhong Duan, Xi’an, 710064, China 
Email: qssong@chd.edu.cn 
2 Guangdong Special Equipment Inspection and Research Institute: Branch of Zhuhai,  
West Renmin Street, Zhuhai, 519002, China 
 
Keywords: automobile active safety, lane departure warning, Hough transform, Kalman filter 
Received: June 24, 2016 
Abstract: Lane departure warning is the key issues of automobile active safety problems. In this paper, an 
improved lane departure warning method is proposed based on Hough transformation and Kalman filter, 
noted as HK-LDWS. At first, the captured colour lane videos are decomposed into frames which are 
transformed and truncated into binary images subsequently. Secondly, Hough transformation is explored 
to detect lane lines in the truncated binary images, and Kalman filter is used to predict and track the 
detected lines. Finally, lane departure warnings are delivered out regarding as the predetermined safe 
distance based on the lateral distances. The actual road test results show that HK-LDWS can track lanes 
and make all departure warning correctly besides that it costs less than 35 milliseconds for per frame 
processing. HK-LDWS is an efficient solution for the lane departure warning problem. 
Povzetek: S pomočjo Houghove transformacije in Kalmanovega filtra je razvita nova metoda za povečano 
avtomobilsko varnost.
1 Introduction 
Lane departure is actually a kind of response distortion of 
sensors, and lane departure warning system, as one kind 
of key technologies for intelligent vehicles, can at least 
reduce 24% of traffic accidents which happened due to 
lane departure [1]. It has an important practical 
significance to develop lane departure warning system for 
driver safety alert, traffic accident avoidance, and even for 
ambient intelligence [2]. Lane departure warning system 
mainly consists of a HUD (head up display), camera, 
controller and sensor; when the lane departure system is 
running, camera (usually placed in the side of the body or 
the mirror position) will always collect driving lane 
marking line, through image processing parameters 
obtained in the current position of the car in the driveway. 
When the detected vehicle deviation lane, the sensor will 
collect data and the operating state of vehicle driver, and 
then the controller sends out the alarm signal; the whole 
process is about 0.5 seconds that will provide more time 
for the driver; and if the driver turn on the lights, the 
normal line change, then lane departure warning system 
will not make any tips. road and vehicle state perception, 
lane departure evaluation algorithm and signal display 
interface are composed he three basic modules of lane 
departure warning system. 
Chen et al. propose a hyperbolic road model based 
lane recognition and departure warning algorithm, which 
at first searches out edge points of lane marks, and then 
identifies marks’ parameters by exploring least squares 
fitting method, tracks identified lanes by particle filter 
algorithm, and finally judges the departure via spatial-
temporal model [3]. Liu et al. propose another lane 
recognition algorithm based on deformable template and 
genetic algorithm, which to obtain the best assignments of 
the template parameters use a genetic algorithm to search 
the maximum value of likelihood function [4]. Wang et al. 
propose a B-Snake model based lane detecting and 
tracking method which use CHEVP (Canny/ Hough 
Estimation of Vanishing Point) to determine the initial 
position of lines, and then use minimum mean square error 
to update the control points of lines [5]. Lin et al propose 
a fuzzy algorithm for lane detection and tracking [6]. 
Spline curve model is also verified for lane departure 
warning system [7].  
Although most of the algorithms are demonstrated to 
be with good alarm performance, there are still 
improvement requirements as far as noise-robustness and 
time-consuming are concerned, however.  In view of those 
two concerned, here a lane departure warning method 
based on Kalman filter-based matching and tracking is 
proposed, which is noted as HK-LDWS. Firstly, the 
captured lane video is by frame transformed into binary 
images and cut by an assigned proportion coefficient. 
Secondly Hough transformation is explored to detect lines 
in the images, and HK-LDWS is designed to match and 
track the detected lines. And finally, a lane departure 
warning algorithm based on lateral distance is realized to 
deliver one alarms. The experiments show that HK-
LDWS is can make early-warning accurately and 
efficiently. 
2 HK-LDWS 
HK-LDWS consists of four modules, i.e., video image 
pre-processing, lane detection, tracking and departure 
warning, as shown in Figure 1. The pre-processing module 
carries out lane image binarization and truncation. The 
284 Informatica 41 (2017) 283–288 M. Liang et al.  
 
detection module use Hough transformation to extract 
lines in the images. The following tracking module 
explores Kalman filter to track the extracted lines based 
on the assumption that the lines are moving linearly and 
continuously. The last module provides departure alarms 
based on the calculated deviated distances from the right 
lanes. 
Video image
Image preprocessing
Lane detection
Matching and Tracking
Departure warning
 
Figure 1: HK-LDWS modules.  
2.1 Image pre-processing 
This module executes two calculations--binarization and 
truncation. In order to improve the noise robustness of 
HK-LDWS, the input video images are at first transformed 
by frame into grey level images, and then converted into 
0-1 binary images. Although lots of pre-processing 
methods have been proposed for image segmentation such 
as edge or corner detection, fuzzy logic [8-9], since being 
as one classic non-parametric and unsupervised adaptive 
threshold selection method here the Otsu method is 
exploited, where “0” points correspond to the pixels not or 
maybe not on the lines, “1” to the pixels on or maybe on 
the lines within the images [10]. The Otsu method is a 
global dynamic binarization method, also known as Otsu 
method, is a kind of grey image value of two commonly 
used algorithms. The basic idea of the algorithm is: use a 
grey image per the grey target size, divided into two parts 
part and the background, in this two within class variance 
minimum and maximum variance when the threshold 
value is the optimal binarization threshold. 
Further the binary images are truncated by an assigned 
proportion coefficient to reduce the detection area and 
avoid noise interferences ahead of vehicle. Notes the 
original size of the images as m n , m , n  represents 
image height and width respectively. Notes the assigned 
proportion coefficient as a， [0,1]a . The images are 
truncated from top to bottom with the proportion a , i.e., 
the top a  are cut off, the bottom (1 )a  are remained. 
The remaining (1 )a  is taken as the region of interest 
(ROI), which is to be explored in the following modules. 
2.2 Lane detection 
In this module, Hough transformation is used to extract 
lines contained in the ROIs. The lanes are regarded as 
straight lines since almost all the expressways are 
designed with the least curvatures [11]. 
  Note one line is described as the Equation 
*cos( ) *sin( )x y    . It can be said that those 
collinear ( , )x y  pixels in the Cartesian coordinate are 
mapped to the same one ( , )  point in the polar 
coordinate.  
  Being one of the most typical line detection 
algorithms Hough transformation executes accumulation 
calculation in the ( , )   coordinate space where the 
maximum accumulation values correspond to the 
assignments to the extracted lines’ parameters. The 
detailed lane detection process is as follows and shown in 
Fig. 2.  
(i) Uniformly divide the ( , )   space into small 
districts. Each district has an accumulator
( , )acc   , of which the initial value is set 
zero. And in addition, all the “1” pixels within 
ROI makes up of the initial point set;  
(ii) If the point set is empty, the detection process 
ends; otherwise randomly pick up a pixel from 
the set to calculate out value of   corresponding 
to all  , and then add up by one to the 
( , )acc   , to which the district corresponds 
having the largest  ;  
(iii) Delete the picked-up point from the set; 
(iv) Judge whether there is an accumulator of which 
the value exceeds to some threshold thr or not. 
 
Figure 2: Lane detection process based on Hough 
transformation. 
 
Improved Lane Departure Warning Method Based... Informatica 41 (2017) 283–288 285 
If no, back to step(ii); if yes, determine a line 
based on the accumulator value, and then reset 
the accumulator to be zero, and delete all the 
pixels on this line from the set; 
(v) Back to step(ii). 
2.3 Matching and tracking 
Since most expressways are designed to be structured, and 
the consideration that the lane model should be with cheap 
computational consumption, it is assumed that the lanes 
within the two consecutive frames are straight and 
continuous, and the vehicle is moving at the same one 
speed. Therefore, the motion model of the lanes relative to 
vehicles can be supposed to be uniform and linear 
[12]. Based on this supposition, Kalman filter is explored 
here for line matching and tracking. 
  Assume that in the ROIs of each frames it can be 
extracted out at most 20 lines, all of which make up of a 
line repository, i.e., {( , )}ik ik  , 1,2,...,20i  , 0k 
.    Note [ , , , ]
T
k k k k kX      as the detected line 
status, /k kd dk  , /k kd dk   represent radial 
velocity and angular velocity respectively, k  is time step. 
  Lane update equation is supposed, 
1 1k k kX AX W                                                       (1) 
  Observation equation, 
k k kZ HX V                                                             (2) 
Where A  is the state transition matrix (STM), 
1 0 1 0
0 1 0 1
0 0 1 0
0 0 0 1
A
 
 
 
 
 
 
. STM is used to find the solution 
of linear system which represented by state-space and in 
the time-variant case, there are many different functions 
satisfying these requirements that related to the structure 
of system. STM need to be determined before analysis on 
the time-varying solution can continue. 
1 0 0 0
0 1 0 0
H
 
  
 
 is the measurement matrix, kW  
and kV  are system noise and observation noise 
respectively. 
  Kalman filter is executed via two phases--time 
update and measurement update. Time update calculates 
the estimations of the state vectors through the prediction 
of current state vectors and their covariance matrix. 
Measurement update obtains the corrections of the state 
vectors through residuals computation based on the state 
vectors’ estimations and the latest measurements. It is just 
via the prediction-correction cycle for Kalman filter to 
complete the recursive estimation of the state vectors. 
Time update equation, 
1
ˆ ˆ
k kX AX
 
                                                                 (3) 
1
T
k kP AP A Q

                                                      (4) 
Measurement update equation, 
1[ ]T Tk k kK P H HP H R
                                       (5) 
 ˆ ˆ ˆk k k k kX X K Z HX                          (6)                                       
( )k k kP I K H P
                                                        (7) 
where ˆ kX

, kK , ˆ kX , kZ  are the prediction value, 
the gain, the estimation value, and the observation value 
at time step k  respectively. The covariance matrix for the 
system noise and the observation noise is noted as Q  and 
R  respectively, which are initialized as follows, 
0.05 0 0 0
0 0.05 0 0
0 0 0.05 0
0 0 0 0.05
Q
 
 
 
 
 
 
, 
1 0
0 1
R
 
  
 
 
Specifically, the proposed HK-LDWS is as follows, 
the procedure is shown with Fig.3. 
(i) Initialize the line repository to be zero, i.e. 
{( , )}ik ik   all elements are zero, 
1,2,...,20i  , 0k  , and initialize another 
20 matching counters iN , 1,2,...,20i  , to be 
zeros also. One line in the repository has a 
counter. 
(ii) Input the extracted lines which being outputted 
from the lane detection module at time step k . 
(iii) Match the extracted lines with the already 
existing lines in the repository at time step 
( 1)k   one by one, calculate the distance iD  
between each of them, i.e., 
( 1) ( 1)| | | | *i ik j k ik j k imageD Size       
， imageSize  is the width of the ROIs, 
, 1,2,...,20i j  ; 
(iv) Update the values of those matching counters iN
, 1,2,...,20i  . if 
*
miniD D , the match is 
successful, the extracted line {( , )}ik ik   is 
replaced by the estimated line ( 1) ( 1)( , )j k j k    
which is generated by Kalman filter, and in 
addition, the corresponding counter iN  is 
updated as 1i iN N  ; If the match fails, the 
repository remains as it was at time step ( 1)k 
, but iN  is updated 1i iN N  . Here 
*
minD  is 
286 Informatica 41 (2017) 283–288 M. Liang et al.  
 
noted as a predetermined threshold for the 
matching distances. 
(v) Track the lines in the repository by Kalman filter 
formulated by Equation (3) ~ (7). Only those 
lines of which the value iN  exceed to some 
threshold are being tracked. Under the 
consideration of noise-robustness and 
computational efficiency, the threshold is set to 
be ten, i.e., only those lines which have been 
uninterruptedly detected in no less than ten 
consecutive ROIs are to be tracked by Kalman 
filter; those lines ( 10iN  ) are not to be tracked 
at all. 
(vi) Update the line repository with the estimated 
lines by Kalman filter. 
(vii)  Iterate until the end. 
 
 
Figure 3: Kalman filter-based matching and tracking 
procedure. 
2.4 Departure warning 
The lane departure warning algorithm based on lateral 
distance is realized here [13]. At first, the polar 
coordinates of the lines obtained from the matching and 
tracking module are transformed into Cartesian 
coordinates. And further, the intersection positions that the 
tracked lines pass through the bottom boundary of the 
ROIs are calculated out. Therefore, the minimum distance 
among the intersections and the central point at the bottom 
boundary of the ROI can be obtained. If the minimum is 
less than some predetermined safe distance threshold, the 
module delivers sound alarms. 
3 Experiments 
The performances of HK-LDWS especially computational 
efficiency and warning accuracy are evaluated with the 
situations that the proportion coefficient a  is assigned 
different values. It is defined three performance indices, 
i.e., effective detection rate (
e
dr ), time consumption per 
frame ( ft ), and number of missing alarms ( an ). 
It is calculated per frame for the distances ( d ) between 
the extracted lines from Hough transformation and the 
estimated lines generated by Kalman filter as the same 
way as the distance iD  during matching are calculated. If 
the value of d  is more than five pixels, the extracted 
lines are considered invalid, otherwise the extracted are 
effective. Note the number of frames within which the 
extracted are effective as 
e
fN , the total number of frames 
in the test video as 
v
fN . There is /
e e v
d f fr N N . Note 
the total consumption time for one whole video as vt . 
There is /
v
f v ft t N . Missing alarm happens once, i.e., 
1a an n  , when the departure occurs it should be 
delivered out an alarm but no alarm.  
The value of the truncated proportion coefficient a  is 
assigned 0.3, 0.4, 0.5, 0.6, and 0.7 respectively. The test 
result is shown in Table 1. According to Table 1, while a  
is assigned 0.3, 0.4, 0.5, the means of the distance d  are 
usually less than five pixels, and it has no missing alarms. 
However, by contrast while a  is chosen 0.6 and 0.7, d  
reach 7.8635 and 13.6823 respectively, and in addition the 
corresponding number of missing alarms an  gets to 12 
and 27 respectively.  
When a  is set 0.5, the time consumption per frame vt  
is just 33.92ms. When a  are set 0.3 and 0.4, vt  rise up to 
44.39ms and 45.13ms respectively. By contrast while a  
increases up to 0.6, vt  decreases to 30.86ms. Compared 
with the situation that 0.5a  , it is reduced by 9% in time 
consumption, however, the number of effective extracted 
frames is only 262, the effective detection rate (
e
dr ) is 
only 77.74%, and in addition the number of missing 
alarms reaches to 12. The computational efficiency for the 
situation that 0.7a  is also improved where the time 
consumption per frame is less than 30ms, the warning 
accuracy gets even worse, the effective detection rate is 
just a bit more than 50%, the number of missing alarms is 
up to 20 times, however. 
It can be said that the situation that 0.5a   
corresponds to the best performances as far as 
computational efficiency and warning accuracy are 
Improved Lane Departure Warning Method Based... Informatica 41 (2017) 283–288 287 
concerned, where the effective detection rate reaches up to 
96.14%, the time consumption per frame is about 
33.92ms, and there are no missing alarms. The 
computational efficiency for the situations that 0.5a   
is better than that for the situation 0.5a  , the warning 
accuracy is worse, however. The warning accuracy for the 
situations that 0.5a   is as the same good as that for the 
situation 0.5a  , but their computational efficiency is 
slightly worse. The performances corresponding to the 
situation 0.5a   are fully fit for practical requirements, 
therefore it is chosen for the HK-LDWS.   
It is shown in Fig.4 for the extracted lines from Hough 
transformation and the estimated lines generated by 
Kalman filter under three different situations (left 
0.4a  , middle 0.5a  , right 0.6a  ) where the 
extracted lines are denoted blue solid, the estimated are 
red dashed. For the situations  are 0.4 and 0.5, the 
extracted are almost coincided with the estimated, for 
0.6a  , there are obvious deviations between the 
extracted and the estimated, however. 
The calculated concreted values of   and   with the 
situation 0.5a   are demonstrated with Fig.5, where the 
curves obtained from Kalman filter are more smoothly 
than that from Hough transformation. Therefore, it can be 
said that HK-LDWS is more noise-robust than the method 
purely based on Hough transformation. 
During tracking the residuals of Kalman filter are 
recorded based on Equation (6). Note the residual of   
and    as   and   respectively, which are 
illustrated with Fig.6. As being marked via the rectangular 
frames there are four regions where obvious abrupt 
changes happen, and correspondingly the lane departure is 
being occurred indeed, the hypothesis that the motion 
model of the lanes is uniform and linear does not hold at 
this point. After delivering out an alarm the HK-LDWS 
reset itself. 
a
 
(a)   
   
(b) 
 
(c) 
Figure 4:  The extracted lines and the estimated lines 
with different a  assignments. 
a  
 
  
effective 
detection rate 
(
e
dr ) 
time consumption per 
frame ( ft ) /ms 
number of missing 
alarms ( an ) mean 
standard 
deviation 
0.3 3.2537 2.0784 337 327 97.03% 45.13 0 
0.4 3.4178 2.1891 337 322 95.55% 44.39 0 
0.5 3.2701 2.3588 337 324 96.14% 33.92 0 
0.6 7.8635 11.5493 337 262 77.74% 30.86 12 
0.7 13.6823 50.2165 337 187 55.49% 29.05 27 
Table 1: Test results for different truncated proportion coefficient assignments. 
d
v
fN
e
fN
 
(a) the curves of   
 
(b) the curves of   
Figure 5: The curve of lines parameter. 
288 Informatica 41 (2017) 283–288 M. Liang et al.  
 
4 Conclusion 
HK-LDWS is proposed as a method of lane departure 
warning based on Kalman filter-based matching and 
tracking. The pre-processing module gives out the 
truncated binary images for the subsequent detection 
module, which extract the lane lines by Hough 
transformation. And then Kalman filter is explored to 
track the extracted lines based on the assumption that the 
lines are moving linearly and continuously. Finally, the 
lane departure warning algorithm based on lateral distance 
is realized to deliver out deviation alarms. The actual road 
test results show that HK-LDWS takes less than 35 
milliseconds for per frame processing, can accurately and 
quickly track lanes and make all departure warning 
correctly. Its performance of time consumption and 
tracking accuracy fit for the actual requirements. It’s 
concluded that the proposed HK-LDWS is an efficient 
solution for the lane departure warning problems. Time 
series based image analysis using different pre-processing 
algorithms need to be studied in future work. 
5 Acknowledgement 
This work was supported in part by the Fundamental 
Research Funds for the Central Universities (Grant No. 
310824162022) and National Natural Science Foundation 
of China (Grant No. 61201406). 
6 References 
[1] J. M. Clanton, D. M. Bevly and A. S. Hodel. A Low-
Cost Solution for an Integrated Multisensor Lane 
Departure Warning System, IEEE Transactions on 
Intelligent Transportation Systems, 10(1): 47 – 59, 
2009. 
[2] Marcello Cinque, Antonio Coronato, Alessandro 
Testa. On Dependability Issues in Ambient 
Intelligence Systems, International Journal of 
Ambient Computing and Intelligence, 3(3): 18-27, 
2011. 
[3] B. Chen, Lane recognition and departure warning 
based on hyperbolic model, Journal of Computer 
Applications, 33(9): 2562-2565, 2013. 
[4] T. Liu, N. Zheng and H. Cheng. A novel approach of 
road recognition based on deformable template and 
genetic algorithm, in Proceedings Of the 2003 IEEE 
International Conference on Intelligent 
Transportation Systems, Piscataway, NJ, USA, 
Vol.2, pp. 1251-1256, 2003. 
[5] Y. Wang, E. K. Teoh and D. Shen. Lane detection 
and tracking using B-Snake, Image & Vision 
Computing, 22(4): 269–280, 2004. 
[6] J. Wang, C. Lin and S. Chen. Applying fuzzy method 
to vision-based lane detection and departure warning 
system, Expert systems with applications, 37(7): 
113-126, 2010. 
[7] S. Mammar, S. Glaser and M. Netto. Time to line 
crossing for lane departure avoidance: a theoretical 
study and an experimental setting, IEEE 
Transactions on Intelligent Transportation Systems, 
7(7): 226-241, 2006. 
[8] Sudipta Ghosh, Debasish Kundu, and Gopal Paul. A 
Fuzzy Logic Approach in Emotion Detection and 
Recognition and Formulation of an Odor-Based 
Emotional Fitness Assistive System, International 
Journal of Synthetic Emotions (IJSE), 6(2): 14-34, 
2015. 
[9] Nilanjan Dey, Moumita Pal, Achintya Das. A 
Session Based Blind Watermarking Technique 
Within the NROI of Retinal Fundus Images for 
Authencation Using DWT, Spread Spectrum and 
Harris Corner Detection, International Journal of 
Modern Engineering Research (IJMER), 3(3): 749-
757, 2012. 
[10] H. Tian, S. K. Lam and T. Srikanthan. Implementing 
Otsu's thresholding process using area-time efficient 
logarithmic approximation unit, in Proceedings of 
2003 International Symposium on Circuits & 
Systems, vol.4, IV-21-IV-24, 2003. 
[11] S. K. Lee, W. Kwon and J. W. Lee. A vision based 
lane departure warning system, Proc. Of the 1999 
IEEE/RSJ International Conference on Intelligent 
Robots & Systems, pp.160-165, 1999. 
[12] C. Mei, J. Todd and P. Dean. AURORA: A Vision 
Based Roadway Departure Warning System, 
Carnegie Mellon University technical report, 1997. 
[13] P. Hsiao, K. Hung, S. Huang, W. Kao, C. Hsu and Y. 
Yu. An embedded lane Departure Warning System, 
Proc. IEEE 15th International Symposium on 
Consumer Electronics (ISCE), Singapore, pp.162-
165, 2011. 
 
 
(a)  
 
(b)  
Figure 6: The residuals of Kalman filter during 
tracking with the situation 0.5a  . 


Informatica 41 (2017) 289–304 289
Decision Tree Based Data Reconstruction for Privacy Preserving Classification
Rule Mining
G. Kalyani
Research Scholar, Acharya Nagarjuna University, India
E-mail: kalyanichandrak@gmail.com
M.V.P. Chandra Sekhara Rao
Professor, Dept of CSE, RVR & JC College of Engineering, Guntur, India
B. Janakiramaiah
Professor, Dept of CSE, DVR & Dr.HS MIC College of Technology, Vijayawada, India
Keywords: privacy, sensitive patterns, data reconstruction, decision tree, classification rules
Received: June 8, 2017
Data sharing among the organizations is a general activity in several areas like business promotion and
marketing. Useful and interesting patterns can be identified with data collaboration. But, some of the
sensitive patterns that are supposed to be kept private may be disclosed and such disclosure of sensitive
patterns may effects the profits of the organizations that own the data. Hence the rules which are sensitive
must be concealed prior to sharing the data. Concealing of sensitive patterns can be handled by modifying
or reconstructing the database before sharing with others. However, to make the reconstructed database
usable for data analysts the utility or usability of the database is to be maximized. Hence, both privacy
and usability are to be balanced. A novel method is proposed to conceal the classification rules which are
sensitive by reconstructing a new database. Initially, classification rules identified from the database are
made accessible to the owner of the data to spot out the sensitive rules that are to be concealed. In the next,
from the non-sensitive rules of the database, a decision tree will be constructed based on the classifying
capability of the rules, from which a new database will be reconstructed. Finally, the released reconstructed
database to the analysts reveals only non-sensitive classification rules. Empirical studies proved that the
proposed algorithm preserves the privacy effectively. In addition to that utility of the classification model
on the reconstructed database was also be preserved.
Povzetek: Predstavljena je metoda strojnega učenja, ki skrbi za privatnost podatkov.
1 Introduction
Significant improvements in data storage have led to rise
in inexpensive data storage techniques for databases. Im-
provements in storing and analyzing enormous amounts of
data present a challenge to people and organizations for
transforming this data into valuable knowledge. Data min-
ing, which involves extorting the patterns that are novel and
valuable from mass repositories of data, is efficient in trans-
forming the data into knowledge.
Various data mining algorithms are in usage for mining
interesting patterns from the collected data. Patterns like
classification rules, association rules and clusters can be
discovered with mining techniques. On the other side, in
order to get the mutual benefits data will be shared among
the collaborated organizations. But, some sensitive infor-
mation or patterns may exist with in the data which is to
be maintained as private, since the revelation of sensitive
information or pattern may affect the business deals of the
data owner and violates the privacy issues of the data owner
as an end user. Hence, along with the need of sharing and
collaborative mining, the importance of protecting the in-
formation or patterns against disclosure is one of the most
important point in the security issues of data mining [1, 2].
To preserve the sensitive information or patterns from un-
wanted disclosure, privacy preserving data mining (PPDM)
has emerged as a security area in data mining and database
field [1, 12].
1.1 Classification of approaches in PPDM
PPDM is an interesting research area in the data mining
community. It concentrates on the privacy issues of
individuals or organizations which are violated due to the
disclosure of sensitive information or patterns. PPDM
converts the original database into a transformed database
in such way that no sensitive data or pattern can be mined
from the transformed database. Various methodologies
exists in the literature, for this transformation to protect
sensitive information or knowledge. A taxonomy for the
PPDM techniques based on a set of parameters is discussed
and the taxonomy is shown in Figure 1.
290 Informatica 41 (2017) 289–304 G. Kalyani et al.
Based on the parameter whether the data owner requires
privacy for the data or knowledge, PPDM techniques were
classified as:
– Data Hiding Techniques (Protecting Sensitive
Data)
Data hiding approaches[3, 6, 11] investigate about
maintaining the privacy of data or information before
applying the data mining techniques on the database.
These approaches concentrate on the exclusion of pri-
vate information from the database before sharing the
data with others. Perturbation, sampling, suppression,
transformation[17], etc. are the general techniques
used to create a transformed database. The final aim of
data hiding is, after sharing the transformed database
receiver has to get valid data mining results without
disclosing the private data of the data owner.
– Knowledge Hiding Techniques (Protecting Sensi-
tive Knowledge)
Knowledge hiding approaches [4, 13] investigate on
the protection of sensitive knowledge inferred from
the data(instead of the data), by applying the mining
tools on the original database. The ultimate goal of
knowledge hiding techniques is no sensitive knowl-
edge is to be mined by applying the data mining tech-
niques on the transformed database. Knowledge hid-
ing approaches mainly deals with the following tech-
niques.
– Data Distortion Technique: This technique
tries to protect the knowledge by changing the
parameters associated with the sensitive knowl-
edge. These techniques works by altering 0s to
1s or vice versa in the specified transactions of
the database, which may generate unwanted side
effects in the new database[16].
– Data Blocking Technique: In this technique,
0’s and 1’s related to the data of the sensitive
knowledge will be replaced by “?”(Unknown) in
selected transactions instead of doing insertion
and deletion of items [15].
– Reconstruction Based Technique: This tech-
nique reconstructs a database from the sanitized
knowledge, extracted from the original database.
When compared to the heuristic methods side ef-
fects will be reduced in reconstructed database
[8, 19, 21].
The paper concentrates on protecting the sensitive knowl-
edge by reconstructing the database from the non-sensitive
knowledge mined from the original database i.e. knowl-
edge hiding based on reconstruction based technique.
1.2 Problem motivation
In business organizations, classification techniques reveals
a set of classification rules.Among the rules mined, some
are crucial for decision making and there by to increase
their profits. In order to get some mutual benefits, organi-
zations share their data with others also. By getting their
data, others also can identify all the classification rules. In
some cases the person who owns the data does not want to
reveal some of the rules to others even though the data was
shared with them. The set of rules which are crucial and
important for gaining the profits must be kept confidential
i.e. they ,must not be revealed to others even they have ap-
plied classification techniques on the shared data. The set
of rules which are to be hidden from disclosure to others
are called as sensitive classification rules.
The focus of this paper is on the privacy of classifica-
tion rules mined from the databases. The need of privacy
in classification rule mining was explored with an example
scenario [21]. A credit card company agreed to share their
credit card approvals to a new home loan company. When
people have applied for the credit card, their data will be
maintained as a separate record in the database of Credit
Card Company. The attributes financial status, experience,
gender, salary, age and address are maintained for every
person. The class label is maintained as the approval re-
sult of their credit card application. After getting the data
from the credit card company, the home loan company con-
structs a classification model to categorize the applicants
of home loan. Based on the classification model and pre-
dicted results, the home loan company can decide the ap-
proval of the home loan to the applicants. The home loan
company gets benefited by avoiding the approvals to the
wrong applicants based on the data taken from the credit
card company. The home loan company can also make use
of the credit card company database in another manner to
improve their business. By changing the class label to the
address attribute, the home loan company can identify the
appropriate group or individual customers to send adver-
tising mails about their offers. Hence, to avoid such type
of advertising to their customers, the credit card company
should modify their database before sharing with the home
loan company in such a way that classification rules which
are useful for identifying a group of valued customers must
not be revealed to the home loan company. The above sce-
nario clearly indicates the need of preserving the sensitive
classification rules before sharing the data with the others.
2 Literature review
In the perspective of privacy in classification rule mining,
the major part of the work in research concentrates on the
privacy of individual data. In [5], privacy of individual data
can be achieved by data reduction. In the data reduction
method, the effect of non-sensitive knowledge on the sen-
sitive knowledge was analyzed. For preserving the privacy
of individual data, a decision tree can be constructed by col-
Decision Tree Based Data Reconstruction for. . . Informatica 41 (2017) 289–304 291
Figure 1: The Taxonomy of the PPDM Techniques.
lecting data from multiple parties without revealing others
to their data was proposed in [7, 9].
In [18], the authors projected a classification rule hid-
ing method based on reconstruction of categorical datasets.
The methodology modifies the tuples of the original
database which contains the values related to both sensi-
tive and non-sensitive classification rules and then makes
use of the tuples related to the non-sensitive rules to create
its transformed database.
This paper [3] projected a novel method to defend the
sensitive classification rules based on the reconstruction
process for categorical datasets. Initially, the owner of the
data will identify a set of sensitive rules that are to be con-
cealed among the rules revealed from the original database.
Later, the set of non-sensitive rules along with the charac-
teristics extracted from the database are used to construct
a decision tree. Finally, the new database is reconstructed
which reveals only non-sensitive classification rules.
In [20], the authors proposed a template-based technique
to protect against the threats caused by data mining func-
tionality. The technique focuses on two points: preserving
the privacy of knowledge and increase the usefulness of
non-sensitive knowledge that can be derived from the data.
Sensitive rules are indicated by a set of “privacy templates”.
Template includes the sensitive information which is to be
concealed, a set of corresponding attributes, and the rela-
tionship between the two. Authors proved that suppressing
the attribute values is an efficient approach to protect sen-
sitive rules. For a large dataset, identifying an optimal pos-
sibility for suppression may be hard, because it needs to do
optimization over all suppression’s.
In [14], Verykios et al. projected a method for hiding the
classification rules which are considered as private. Hiding
is achieved before publishing the data on the web through
data perturbation approach in categorical databases. The
method used the characteristics of sequential covering clas-
sification algorithms. Modification will be done to the tu-
ples of sensitive rules in such a way that the alterations are
spread to the tuples of the significant non-sensitive rules.
The spreading will be proportional to the rank in the rule
set. So that, the method guarantees that the sensitive rules
are hidden and maintains the current structure of the rule
set, thereby the usefulness of the new database is maxi-
mized. Authors have proposed another distribution method
with a modification to the basic method. Authors have
proved that both the methods are effective in terms of pri-
vacy and usefulness of the new database.
3 Proposed method
3.1 Problem statement
Consider a database (D) consists of n tuples comprises of
m dimensions along with associated labels known as class
with number of distinct classes as C. By applying a classifi-
cation rule mining algorithm on D, number of classification
rules (CR) can be discovered. Given a set of classification
rules among CR which are treated as sensitive classifica-
tion rules (SCR ⊂ CR) by domain expert (the data owner),
the process of classification rule hiding is to appropriately
reconstruct a database with the intention of mining the re-
constructed database (D1) by using any classification rule
mining algorithms, reveals all the non-sensitive classifica-
tion rules (NSCR=CR - SCR) that are revealed from the
original database, whereas all the SCR are shielded from
revelation and new rules (originally non-existent rules) can-
not be mined.
292 Informatica 41 (2017) 289–304 G. Kalyani et al.
3.2 Framework
The framework shown in Figure 2 addresses the problem
statement. From the original database, a number of clas-
sification rules are discovered by applying any classifica-
tion algorithm, which are useful to the data owner for fore-
casting purpose. The data owner or domain expert iden-
tifies the sensitive classification rules which must be pre-
served from revelation when classification rule mining al-
gorithms are applied on the database before sharing the
data with the others. The proposed method for classifica-
tion rule hiding reconstructs a sanitized database by consid-
ering the original database, set of classification rules gen-
erated and a set of identified sensitive classification rules
as input. By applying the classification rule mining al-
gorithm on the reconstructed database, only non-sensitive
classification rules which are discovered from the original
database, are only be discovered and all the sensitive rules
will be hidden from disclosure.
3.3 RCRH (Reconstruction based
Classification Rule Hiding)Method
The proposed algorithm for classification rule hiding was
reconstruction based algorithm, i.e. the transformed
database will be reconstructed from the set of NSCR. The
required input for the classification rule hiding is, the
database D, classification rules CR mined from D and a
set of sensitive classification rules SCR among CR which
were decided by data owner depending on to whom they
wish to share the database. The result of the algorithm is a
reconstructed database D1.
The proposed algorithm first eliminates the SCR from
CR which are the possible classification rules from D (step
2 to 4 of Algorithm 1). Then for every rule in CR, calculate
a measure called as capability of the rule. The Capability of
the rule indicates the number of the tuples that are correctly
classified by that rule (step 5 to 6 of Algorithm 1). The
process of calculating the capability for a rule was shown in
Algorithm 2. Then arrange the rules in the decreasing order
of their capability values because high capability indicates
the maximum ability of classifying the data in the database
D. Now consider the rules in order and construct a decision
tree with the non-sensitive classification rules only.
The construction of the decision tree will be as follows:
Consider the rules in decreasing order of their capability
values. Calculate the information gain of all the attributes
of the database with respect to D. Information gain of an
attribute is the measure of the difference in entropy before
and after the tuples are divided into groups based on that
attribute (step 7 to 9 of Algorithm 1). The information gain
of an attribute is calculated as: Gain(A)= Entropy(D)- En-
tropy(D,A). Entropy(D) and Entropy(D,A) can be calcu-
lated by using the equations (1) and (2). The process of
calculating the info-gain of an attribute was shown in Al-
gorithm 4.
E (D) =
c∑
i=1
−Pi log2 Pi (1)
Where D is the database, c is number of distinct class la-
bels, Pi is the probability of the ith class label.
E (D,A) =
∑
V ∈A
P (V ) ∗ E (V ) (2)
Where D is a Database, A is an attribute for which entropy
is calculated, V is value of an attribute, P (V) probability of
value V, E (V) is entropy of value V.
Consider the rule in CR in the decreasing order of capa-
bility values. The attributes of that rule are considered in
decreasing order of their info-gain values. By considering
the attributes in the order of info-gain, construct a path in
the decision tree with the attribute having the highest info-
gain at the root node. The possible values of that attributes
in database D are considered as possible branches from that
node. The path will be extended in the similar manner by
considering all the attributes in considered rule. The ca-
pability of a rule will be considered as a measure for the
branch created in the decision tree. The class label of that
rule is given as a leaf node in the branch. For the next rules,
based on the order of the attributes path will be checked in
the decision tree. If the path matches with the existing path
it continues and whenever the match fails, the new path will
be constructed from that point. The same process will be
repeated to all the non-sensitive rules of D (step 10 to 16 of
Algorithm 1).
After the decision tree has constructed, then the trans-
formed database will be reconstructed from the decision
tree. The process of reconstructing the database will be
applied to all the paths of the decision tree by considering
only one path at a time. Hence, consider a single path in
the decision tree. A path in the decision tree is associated
with capability which indicates the influence of that rule on
the database D. Insert number of tuples in the transformed
database D1 equal to the capability of that path in the
decision tree.
The path in the decision tree may not contain all the
attributes of the database D. Hence, if tuples are added in
the database for a path in the decision tree, the tuples in
the constructed database may contain some missing values
related to the attributes which were not existed in the path
of the decision tree (step 17 to 23 of Algorithm 1).
The missing values in the reconstructed database are
to be filled by using methods to fill the missing values
efficiently. The process of filling the missing values is
shown in Algorithm 5. Consider all the attributes of the
D1 as TA (step 3 of Algorithm 5). Select an attributes
of the D1 which are having not null values, i.e. the set
of the attributes which are having some data values as
SA ( step 4 of Algorithm 5). Identify the combination of
the distinct values in the set of attributes SA, as a string
which is indicated by C (step 5 of Algorithm 5). Scan
the database D to retrieve the set of tuples which matches
Decision Tree Based Data Reconstruction for. . . Informatica 41 (2017) 289–304 293
Figure 2: The Proposed Framework for Classification Rule Hiding.
Algorithm 1: RCRH(Reconstruction based Classification Rule Hiding).
Data: Original Database D, Classification rules CR, Sensitive Classification Rules SCR.
Result: Transformed Database D1
1 begin
2 for every rule r ∈ CR do
3 if r ∈ SCR then
4 CR = {CR− r}; /* discard the sensitive rules */
5 for every rule i = 1 to |CR| do
6 capb_i = Capability(R) ; /* calculating the classifying ability of the rule */
7 Ent_D = Entropy (D); /* entropy of database D */
8 for every attribute A ∈ D do
9 Ig_A = Info-gain(A,D); /* calculating the gain of attributes in D */
10 while (|CR| > 0 ) do
11 RL={r/r,∀k ∈ CR, capb_r ≥ capb_k} ; /* select the rule with max capability */
12 while (RL is not empty ) do
13 attri={x/x ∈ i and ∀y ∈ i, Ig_x ≥ Ig_y} ;
14 create attri as non-terminal node of DT; /* creating a path in the tree */
15 Discard attri from RL ;
16 Assign class label of RL as terminal node; /* adding of terminal node */
17 for every path P ∈ DT do
18 count=0 ;
19 repeat
20 Generate a tuple in D1 with the attributes in P; /* adding of tuples in D1 */
21 count++;
22 until (count==capb_P));
23 Fill_Missing_Values(D1) /* to fill the missing values in D1 */
24 Return D1;
294 Informatica 41 (2017) 289–304 G. Kalyani et al.
Algorithm 2: Function Capability(R).
Data: Original Database D,Classification rule R.
Result: Capability of Rule R.
1 begin
2 Count=0;
3 for each tuple T ∈ D do
4 if T ∈ R then
/* if tuple is classified by rule R */
5 Count++;
6 Return Count;
Algorithm 3: Function Entropy(D).
Data: Original Database D, Number of distinct class labels C.
Result: Entropy of D.
1 begin
2 Ent_D = 0;
3 for i= 1 to C do
/* getting no.of tuples with i
th class label */
4 Tc = Select count(*) from D where class=Ci;
5 L = log( Tc|D| );
6 Ent_D = Ent_D + ( Tc|D| ) ∗ L;
7 Return ( - Ent_D);
Algorithm 4: Function Info_gain(A,D).
Data: Database D, No.of distinct values V in A, Ent_D, No.of distinct class labels C.
Result: Information gain of A
1 begin
2 Ent_A = 0 ;
3 for i= 1 to V do
4 Tv = select * from D where A=Vi; /* getting tuples with ith value of A */
5 E_Tv = 0;
6 for j = 1 to C do
/* getting the no.of tuples with j
th class label */
7 Tvc = select count(*) from Tv where class=Cj ;
8 L = log( Tvc|Tv|) ∗ L;
9 E_Tv = E_Tv + ( Tvc|Tv| ) ∗ L;
10 Ent_A = Ent_A+ ( |Tv||D| ) ∗ (−E_Tv); /* calculating entropy of A */
11 Ig_A = (Ent_D) - (Ent_A); /* Gain of attribute A */
12 Return Ig_A;
Decision Tree Based Data Reconstruction for. . . Informatica 41 (2017) 289–304 295
Algorithm 5: Fill_Missing_Values(D1).
Data: Original Database D,Reconstructed Database D1.
Result: Reconstructed Database D1
1 begin
2 repeat
3 TA[ ]= attributes in D1 ;
/* get all the attributes of D
1
*/
4 SA[ ] = attributes which are not empty in D1 ;
5 Let C be the combination of values in the attributes of SA ;
6 for every tuple t ∈ D do
7 if (values of SA[] in t == C) then
/* values of the SA[]attributes in tuple */
8 temp = temp ∪ t
/* add the tuple to temp buffer */
9 for each tuple t ∈ temp do
10 Count the occurrences of each distinct value in the attributes other than SA[] ;
11 Select the attribute A in which more number of occurrence are related to the same distinct value V ;
12 Insert value ’V’ in attribute ’A’ in D1;
13 until (SA[ ]==TA[ ]);
to the combination C in the set of the attributes SA. Let
the retrieved tuples be in the buffer temp (step 6 to 8 of
Algorithm 5). By scanning the tuples in the temp buffer,
count the number of occurrences of each distinct value of
the attribute which does not belong to the set SA (step 9
to 10 of Algorithm 5). Then, select the attribute which has
the major importance i.e. occurrence of a particular value
in the attribute is more than the other values (step 11 of
Algorithm 5). The selected value is filled with the value
which has the maximum number of occurrences (step 12
of Algorithm 5). Repeat the process of filling the missing
values by considering the new set of selected attributes SA,
which are filled with the values until the selected attributes
are equal to the total set of attributes in the database i.e. all
the attributes are filled completely.
Let us consider a small example to demonstrate the work-
ing of the proposed method. Table 1 shows the sample
database considered for the demonstration. The database
contains 30 tuples with 6 attributes A0 to A5 which are
binary-valued attributes with two possible values True and
False. A class label which has two distinct classes C0 and
C1 is associated with each tuple.
Table 2 includes the 12 classification rules which are
identified by applying a classification rule mining on the
database of Table 1. Rule number 7 of Table 2 is considered
as the sensitive classification rule which requires protection
from the disclosure.
Consider the non-sensitive rules among the rules mined
from the database to construct a decision tree from which
the database was reconstructed in classification rule hiding.
Hence, among the 12 rules discovered from the database,
we are considering 11 rules (other than the rule 7 which
is sensitive). For every non-sensitive rule calculate the
Figure 3: The Decision Tree Path for Rule 1 in Table 2.
capability which indicates the classification ability of that
rule on the database (Steps 5 to 6 of Algorithm 1). The
rules and their capability values are shown in Table 2.
Calculate the measure info-gain for every attribute A0
to A5. The info-gain specifies how much information we
gained by doing the split using that particular attribute.
The attribute which will have maximum info-gain will be
better for splitting the database (Steps 7 to 8 of Algorithm
1). The info-gain values of the attributes A0 to A5 are
shown in Table 3.
Construction of the decision tree is as follows: consider
the non-sensitive rules in the decreasing order of their ca-
pability. Hence, consider the rule A2=False & A5=False
⇒ C0 which has highest capability 8. The rule contains the
attributes A2 and A5. Order these attributes based on the
info-gain. Hence the attributes will be considered in the
order A5 and A2. Create a path in the decision tree with
the values of the rule in the order A5 and A2. The tree is
as shown in Figure 3.In figures False is indicated with "F"
and True is indicated with "T".
296 Informatica 41 (2017) 289–304 G. Kalyani et al.
Table 1: The Sample Database.
Tuple.No A0 A1 A2 A3 A4 A5 Class
1 TRUE FALSE FALSE FALSE TRUE FALSE C0
2 TRUE TRUE FALSE FALSE FALSE TRUE C0
3 FALSE FALSE TRUE TRUE FALSE TRUE C1
4 FALSE TRUE FALSE TRUE TRUE FALSE C0
5 TRUE TRUE FALSE FALSE FALSE TRUE C0
6 TRUE TRUE TRUE TRUE FALSE FALSE C1
7 TRUE TRUE FALSE TRUE TRUE FALSE C0
8 FALSE FALSE TRUE FALSE TRUE TRUE C1
9 TRUE TRUE FALSE TRUE FALSE TRUE C0
10 FALSE FALSE TRUE TRUE TRUE TRUE C1
11 FALSE TRUE FALSE FALSE FALSE TRUE C0
12 TRUE FALSE TRUE FALSE TRUE FALSE C0
13 TRUE TRUE FALSE TRUE TRUE TRUE C1
14 FALSE FALSE FALSE TRUE TRUE TRUE C1
15 FALSE FALSE FALSE TRUE TRUE FALSE C0
16 TRUE FALSE FALSE TRUE FALSE FALSE C0
17 TRUE TRUE TRUE FALSE FALSE TRUE C1
18 TRUE FALSE FALSE TRUE FALSE TRUE C0
19 TRUE FALSE TRUE FALSE FALSE FALSE C0
20 FALSE FALSE FALSE TRUE FALSE TRUE C1
21 FALSE TRUE FALSE TRUE TRUE TRUE C1
21 FALSE TRUE FALSE TRUE TRUE TRUE C1
22 FALSE TRUE FALSE TRUE TRUE FALSE C0
23 FALSE TRUE TRUE TRUE FALSE FALSE C0
24 FALSE TRUE FALSE FALSE TRUE FALSE C0
25 TRUE TRUE TRUE FALSE FALSE TRUE C0
26 TRUE FALSE TRUE TRUE FALSE FALSE C0
27 TRUE FALSE FALSE TRUE FALSE TRUE C0
28 TRUE FALSE TRUE TRUE TRUE TRUE C1
29 TRUE TRUE TRUE TRUE TRUE TRUE C0
30 TRUE TRUE FALSE TRUE FALSE FALSE C0
Figure 4: The Decision Tree Path for Rule 2 in Table 2.
Then consider the next rule with maximum capability 5
which is A0=False & A1=False & A5 =True ⇒ C1. Cre-
ate a path in the decision tree corresponding to this rule as
shown in Figure 4.
The next rule in order is A0=True & A2=False &
A4=False & A5=True⇒ C0 with next maximum capabil-
ity 5. The tree after creating a path in the decreasing order
of their info-gain values is as shown in Figure 5.
By repeating the process for all the non-sensitive rules
the complete decision tree can be constructed. The com-
plete decision is as shown in Figure 6.
The first path in the decision tree which is with A2 and
A5 attributes with false value and class label as C0 is con-
sidered and corresponding to this path, 8 (the capability
of rule) tuples are inserted into the reconstructed database.
The remaining attributes are indicated by null values. To
fill these null values consider the combination of the values
in the attributes, in which values are available. In this case
it is False, False, C0 for the attributes A2, A5 and class cor-
respondingly. By comparing this combination in the orig-
inal database, the number of tuples found is 8. Count the
number of occurrences of each distinct value in each of the
attributes A0, A1, A3 and A4. The value True occurred 4,
5, 6 and 6 times in A0, A1, A3 and A4 attributes respec-
tively. The value False occurred 4, 3, 2 and 2 times in A0,
A1, A3 and A4 attributes respectively. Since, the majority
of the occurrences are for A3 and A4 by value True, the
Decision Tree Based Data Reconstruction for. . . Informatica 41 (2017) 289–304 297
Table 2: Capability Values of the Classification Rules.
Rule.No Classification Rules Capability
1 A2 = False & A5 = False⇒ C0 8
2 A1 = False & A2 = True & A5 = False⇒ C0 3
3 A0 = False & A1 = True & A2 = True & A5 = False⇒ C0 1
4 A0 = True & A1 = True & A2 = True & A5 = False⇒ C1 1
5 A0 = False & A1 = False & A5 = True⇒ C1 5
6 A0 = False & A1 = True & A3 = False & A5 = True⇒ C0 1
7 A0 = False & A1 = True & A3 = True & A5 = True⇒ C1 –
8 A0 = True & A2 = False & A4 = False & A5 = True⇒ C0 5
9 A0 = True & A2 = True & A4 = False & A5 = True⇒ C0 1
10 A0 = True & A1 = False & A4 = True & A5 = True⇒ C1 1
11 A0 = True & A1 = True & A2 = False & A4 = True & A5 = True⇒ C1 1
12 A0 = True & A1 = True & A2 = True & A4 = True & A5 = True⇒ C0 1
Table 3: Information Gain of the Attributes in Table 1.
S.No Attribute Name Info - Gain
1 A0 0.0598
2 A1 0.0258
3 A2 0.0598
4 A3 0.0304
5 A4 0.0258
6 A5 0.1835
Figure 5: The Decision Tree Path for Rule 3 in Table 2.
Figure 6: The Complete Decision Tree of all the Rules in
Table 2.
missing values of A3 and A4 are filled with value True.
Now for the tuples corresponding to the first path, the val-
ues are available for A2, A3, A4, A5 and class. Repeat the
process for filling of A0 and A1 by considering the com-
bination values in these attributes. After all the attributes
are filled up the next path in the tree will be considered in
the similar manner until the process of generation and fill-
ing will be completed for all the paths in the constructed
decision tree. Finally, the reconstructed database obtained
is shown in Table 4.
4 Evaluation measures
To assess the performance or efficiency of an algorithm
some metrics are to be considered. Classification rule hid-
ing algorithms are also be assessed with a set of measures.
The four metrics for the evaluation of the proposed method
are as follows:
The first measure is Hiding Failure, which measures the
fraction of sensitive classification rules that are revealed
from the reconstructed database. Through this, the amount
298 Informatica 41 (2017) 289–304 G. Kalyani et al.
Table 4: The Reconstructed Database.
Tuple.No A0 A1 A2 A3 A4 A5 Class
1 FALSE TRUE FALSE TRUE TRUE FALSE C0
2 FALSE TRUE FALSE TRUE TRUE FALSE C0
3 FALSE TRUE FALSE TRUE TRUE FALSE C0
4 FALSE TRUE FALSE TRUE TRUE FALSE C0
5 FALSE TRUE FALSE TRUE TRUE FALSE C0
6 FALSE TRUE FALSE TRUE TRUE FALSE C0
7 FALSE TRUE FALSE TRUE TRUE FALSE C0
8 FALSE TRUE FALSE TRUE TRUE FALSE C0
9 TRUE FALSE TRUE FALSE FALSE FALSE C0
10 TRUE FALSE TRUE FALSE FALSE FALSE C0
11 TRUE FALSE TRUE FALSE FALSE FALSE C0
12 FALSE TRUE TRUE TRUE FALSE FALSE C0
13 TRUE TRUE TRUE TRUE TRUE FALSE C0
14 FALSE FALSE TRUE TRUE TRUE TRUE C1
15 FALSE FALSE TRUE TRUE TRUE TRUE C1
16 FALSE FALSE TRUE TRUE TRUE TRUE C1
17 FALSE FALSE TRUE TRUE TRUE TRUE C1
18 FALSE FALSE TRUE TRUE TRUE TRUE C1
19 FALSE TRUE FALSE FALSE FALSE TRUE C0
20 FALSE TRUE FALSE FALSE FALSE TRUE C0
21 TRUE TRUE FALSE FALSE FALSE TRUE C0
22 TRUE TRUE FALSE FALSE FALSE TRUE C0
23 TRUE FALSE FALSE TRUE FALSE TRUE C0
24 TRUE FALSE FALSE TRUE FALSE TRUE C0
25 TRUE FALSE FALSE TRUE FALSE TRUE C0
26 TRUE TRUE FALSE TRUE TRUE TRUE C1
27 TRUE TRUE TRUE FALSE FALSE TRUE C0
28 TRUE TRUE TRUE FALSE FALSE TRUE C0
29 TRUE FALSE TRUE TRUE TRUE TRUE C1
30 TRUE TRUE TRUE TRUE TRUE FALSE C0
of sensitive knowledge that is preserved can be also be es-
timated.
The second and third measures are related to the side-
effects of the hiding process. Second metric Miss Cost is
one that deals with the fraction of the non-sensitive classi-
fication rules which are mined from D and cannot be mined
from the reconstructed database D1. The third metric Arti-
factual Rules is the fraction of the rules which are not de-
rived from the original database D, but can be derived from
the reconstructed database D1.
The fourth measure is the Usability of the reconstructed
database. It is measured through the ability of an attribute
to classify the database. In order to increase the usability
of the reconstructed database the classification model con-
structed from the reconstructed database should be as close
as to the model constructed with the original database.
It means the parameter information gain of the attributes
in the reconstructed database must be with the minimum
difference with the information gain of the attributes in the
original database. Hence usability is calculated as the sum
of the differences between the information gains of the
attributes in D and D1.
4.1 Hiding Failure (HF)
The hiding failure is calculated as follows:
HF =
|SCR(D1)|
|SCR(D)|
where |SCR(D1)| indicates the number of sensitive classi-
fication rules revealed fromD1, and |SCR(D)|denotes the
number of sensitive classification rules discovered from D.
4.2 Miss Cost (MC)
The miss cost is calculated as:
MC =
|NSCR(D)| − |NSCR(D1)|
|NSCR(D1)|
Where |NSCR(D)| refers to the number of non-sensitive
classification rules revealed from D and |NSCR(D1)|
Decision Tree Based Data Reconstruction for. . . Informatica 41 (2017) 289–304 299
Table 5: Characteristics of the Datasets.
S.No
Name of the
Database
No.of
Instances
No.of
Attributes
1 PIMA - DIABETES 768 9
2
GERMAN CREDIT
RATING 1000 21
3
CONGRESSIONAL
VOTING RECORDS 435 17
4 MUSHROOM 8124 23
refers to the number of non-sensitive classification rules
discovered from D1.
4.2.1 Artifactual Rules (AR)
This is measured as:
AR =
|CR′| − |CR ∩ CR′|
|CR′|
Where |CR| and |CR′| stands for, number of classification
rules that are generated from D and D1 respectively.
4.2.2 Usability
The difference between the gains of the attributes is mea-
sured as:
U =
√∑m
i=1(
oi−ri
oi
)2
m
∗ 100
Where oi and ri are the gain ratios for the ith attribute on
D and D1 and m is the number of attributes in D.
A classification rule hiding algorithm with no hiding fail-
ure and artifactual rules i.e. 0% of HF and AR and with re-
duced miss cost and high usability of the D1 is considered
as an efficient algorithm.
5 Experimental results
Experiments were conducted by considering the real
life databases PIMA-DIABETES, GERMAN CREDIT
RATING, CONGRESSIONAL VOTING RECORDS and
MUSHROOM which are available in UCI data reposi-
tory[10]. The characteristics of the databases used in the
experiments were shown in Table 5.
The results of the proposed method are compared with
a classification rule hiding method by considering gain ra-
tios, proposed by Natwichai in [3]. The Natwichai(Gain)
method was also a reconstruction based method. Initially it
constructs a decision tree from non-sensitive classification
rules, and then each path is simply generated as a set of
tuples in reconstructed database. In the proposed method,
after constructing the tree from the non-sensitive classifica-
tion rules and at the time of reconstructing the database the
missing values are identified efficiently by considering the
probability of the possible values in the original database.
Hence the usability of the reconstructed database increases
by reducing the miss cost and artifactual rules.
Experiments were conducted with four classification al-
gorithms: C4.5(J48), PART, BF TREE and AD TREE
which are rule based algorithms available in weka tool.
In the experiments, same classification algorithm was used
twice i.e. once on D and second onD1 to discover the clas-
sification rules which are used to evaluate the performance
measures. All the experiments were done by selecting only
one classification rule as sensitive rule while all the remain-
ing as non-sensitive rules. After the classification rules are
generated by the algorithm, randomly one rule is selected
as sensitive.
By applying C4.5, PART, BF TREE and AD TREE
algorithms on the PIMA-DIABETES database the gen-
erated classification rules are 20, 13, 3 and 21 with an
accuracy of 84.5, 81.25, 77.21 and 79.69 respectively.
After reconstructing the database by using the proposed
algorithm, the rules generated are 20, 12, 2 and 20 with
an accuracy of 83.98, 80.48, 76.02 and 79.04 respec-
tively. With C4.5 on the reconstructed PIMA-DIABETES
database one non-sensitive rule was loosed, and one new
rule was generated.
Similarly, by applying C4.5, PART, BF TREE and AD
TREE algorithms on the GERMAN CREDIT RATING
database the generated classification rules are 103, 78, 39
and 21 with an accuracy of 85.5, 89.7.84.2 and 75.4 respec-
tively. After reconstructing the database by using the pro-
posed algorithm, the rules generated are 101, 77 38 and 20
with an accuracy of 84.9, 89.01, 87.6 and 75.1 respectively.
With C4.5 on GERMAN CREDIT RATING reconstructed
database one non-sensitive rule was loosed, and three new
rules were generated.
By applying C4.5, PART, BF TREE and AD TREE al-
gorithms on MUSHROOM database the generated classifi-
cation rules are 25, 13, 7 and 21 with an accuracy of 100,
100, 99.95 and 99.9 respectively. After reconstructing the
database by using the proposed algorithm, the rules gen-
erated are 23, 12, 6 and 20 with an accuracy of 100, 100,
98.53 and 98.14 respectively. With C4.5 and AD TREE on
Mushroom reconstructed database one non-sensitive rule
was loosed, and with AD TREE one new rule was gener-
ated.
By applying C4.5, PART, BF TREE and AD TREE algo-
rithms on CONGRESSIONAL VOTING database the gen-
erated classification rules are 6, 7, 36 and 21 with an ac-
curacy of 97.24, 97.47, 98.39 and 97.93 respectively. Af-
ter reconstructing the database by using the proposed algo-
rithm, the rules generated are 4, 6, 36 and 19 with an ac-
curacy of 96.25, 95.87, 98.14 and 96.89 respectively. With
all the four algorithms on CONGRESSIONAL VOTING
reconstructed database one non-sensitive rule was loosed,
and 1 and 2 new rules were generated with PART and BF
TREE respectively.
The results of the experiments with the proposed method
and Natwichai (Gain) method [3] on four databases with
four classification algorithms were shown in Table 6 and
Table 7 respectively.
300 Informatica 41 (2017) 289–304 G. Kalyani et al.
(a)
(b)
Figure 7: (a)Comparison of Miss Cost on Four Databases. (b)Comparison of Artifactual Rules on Four Databases.
Decision Tree Based Data Reconstruction for. . . Informatica 41 (2017) 289–304 301
(a)
Figure 8: Comparison of Difference between Gains of the Attributes on Four Databases.
302 Informatica 41 (2017) 289–304 G. Kalyani et al.
Table 6: Experimental Values of Proposed Method.
Database ClassificationAlgorithm
Original
Database
Reconstructed
Database
Performance
Measures
No.of
Rules
Accuracy
of the Model
No. of
Rules
Accuracy
of the Model HF MC AR
PIMQ-DIABETES
C4.5 20 84.15 19 83.98 0 1 1
PART 13 81.25 12 80.48 0 0 0
BF TREE 3 77.21 2 76.02 0 1 1
AD TREE 21 79.69 21 79.04 0 0 1
GERMAN CREDIT
RATING
C4.5 103 85.5 101 84.9 0 3 2
PART 78 89.7 78 89.01 0 0 1
BF TREE 39 84.2 38 87.6 0 0 0
AD TREE 21 75.4 20 75.1 0 0 0
CONGRESSIONAL
VOTING RECORDS
C4.5 6 97.24 4 96.25 0 1 0
PART 7 97.47 6 95.87 0 1 1
BF TREE 36 98.39 36 98.14 0 1 2
AD TREE 21 97.93 19 96.89 0 1 0
MUSHROOM
C4.5 25 100 23 100 0 1 0
PART 13 100 12 100 0 0 0
BF TREE 7 99.95 6 98.53 0 0 0
AD TREE 21 99.9 20 98.14 0 1 1
Table 7: Experimental Values of Natwichai (Gain) Method.
Database ClassificationAlgorithm
Original
Database
Reconstructed
Database
Performance
Measures
No. of
Rules
Accuracy
of the Model
No. of
Rules
Accuracy
of the Model HF MC AR
PIMQ-DIABETES
C4.5 20 84.15 16 71.32 0 4 1
PART 13 81.25 10 66.25 0 2 0
BF TREE 3 77.21 3 25.73 0 1 2
AD TREE 21 79.69 17 64.51 0 4 1
GERMAN CREDIT
RATING
C4.5 103 85.5 89 71.87 0 16 3
PART 78 89.7 73 81.93 0 5 1
BF TREE 39 84.2 35 77.56 0 4 1
AD TREE 21 75.4 16 61.03 0 4 0
CONGRESSIONAL
VOTING RECORDS
C4.5 6 97.24 3 68.62 0 3 0
PART 7 97.47 5 75.69 0 2 1
BF TREE 36 98.39 29 80.25 0 8 2
AD TREE 21 97.93 18 84.93 0 3 1
MUSHROOM
C4.5 25 100 21 89.06 0 5 2
PART 13 100 10 86.92 0 2 0
BF TREE 7 99.95 5 81.39 0 1 0
AD TREE 21 99.9 17 89.62 0 4 1
Generally the performance metrics are to be evaluated in
terms of the percentage as % of hiding failure, % of miss
cost, % of artifactual rules and % of the difference between
the gains of the attributes. The comparison of these param-
eters for both proposed and Natwichai (Gain) methods was
plotted in the Graphs. In both the methods, percentage of
hiding failure was zero i.e. no sensitive rules will be gen-
erated from the reconstructed databases. So the Graphs are
included only for the other two parameters i.e. miss cost,
artifactual rules and difference in gains of the attributes in
D and D1. The graphs were drawn in python by consider-
ing the Natwichai (Gain) and proposed method on X-axis,
the classification algorithms used to generate the rules from
the databases are on Z-axis and the parameter used for com-
parison in terms of percentages on Y-axis.
The comparison of the miss cost on four databases is
shown in Figure 7(a). with C4.5, PART, BF TREE and
AD TREE algorithms on PIMA-DIABETES the % of miss
cost with proposed method was 5.3, 0, 50 and 0 respec-
tively. with same algorithms on GERMAN CREDIT RAT-
ING database the % of miss cost with proposed method
was 2.9, 0, 0 and 0 respectively. with same algorithms on
CONGRESSIONAL VOTING RECORDS database the %
of miss cost with proposed method was 20, 16.67, 2.8 and
5 respectively. with same algorithms on MUSHROOM
database the % of miss cost with proposed method was
4.17, 0, 0 and 5 respectively. In all the four databases the
percentage of miss cost was reduced in proposed method
when compared to the existing method.
The comparison of artifactual rules on four databases is
shown in Figure 7(b). with C4.5, PART, BF TREE and
AD TREE algorithms on PIMA-DIABETES the % of arti-
factual rules with proposed method was 5.3, 0, 50 and 4.8
respectively. with same algorithms on GERMAN CREDIT
RATING database the % of artifactual rules with proposed
method was 1.9, 1.3, 0 and 0 respectively. with same
algorithms on CONGRESSIONAL VOTING RECORDS
database the % of artifactual rules with proposed method
was 0, 16.67, 5.7 and 0 respectively. with same algorithms
on MUSHROOM database the % of artifactual rules with
proposed method was 0, 0, 0 and 5 respectively. In all the
four databases the percentage of ghost rules generated was
reduced in the proposed method when compared to the ex-
isting method.
The comparison of the difference between the informa-
tion gains of the attributes in four databases is shown in
Figure 8(a). with C4.5, PART, BF TREE and AD TREE
algorithms on PIMA-DIABETES the % of difference be-
tween the information gains with proposed method was
27.42, 14.83, 26.12 and 12.23 respectively. with same al-
gorithms on GERMAN CREDIT RATING database the
% of difference between the information gains with pro-
posed method was 21.19, 15.1, 24.3 and 25.5 respectively.
with same algorithms on CONGRESSIONAL VOTING
RECORDS database the 5.3 respectively. with same al-
gorithms on MUSHROOM database the % of difference
between the information gains with proposed method was
10.9, 7.7, 7.3 and 5.5 respectively. The proposed algorithm
reduces the difference in gains of the attributes thereby in-
Decision Tree Based Data Reconstruction for. . . Informatica 41 (2017) 289–304 303
creasing the usability of the reconstructed database which
is going to be released without compromising on privacy of
the sensitive rules.
Hence, the experimental assessment clearly indicates
that the proposed method will reconstruct a database by
hiding all the sensitive rules, with minimum loss in non-
sensitive rules, minimum artifactual rules generated and
by improving the usability of the reconstructed database.
6 Conclusion
Preserving the privacy of sensitive classification rules is a
very important issue in application areas that involves col-
laboration with data sharing. A new algorithm is projected
for defending the sensitive classification rules from disclo-
sure. With the projected method which is reconstruction
based classification rule hiding, new database will be re-
constructed from which sensitive rules will not be disclosed
and the side effects of the hiding process miss cost and ar-
tifactual rules are kept minimal. Moreover, the usability of
reconstructed database will be maximized to make it useful
with valid data mining results for a data analyst. The ex-
perimental analysis of the results is the evidence to indicate
that the proposed algorithm is effective, i.e. it can preserve
the privacy and data utility very well.
References
[1] Mahdi Aghasi and Rozita Jamili Oskouei (2016),Pri-
vacy Preserving Data Mining Survey of Classifica-
tions, Soft Computing Applications, Advances in In-
telligent Systems and Computing,Springer.
[2] Neha Jain and Lalit Sen Sharma, An Ontology based
on the Methodology Proposed by Ushold and King,
International Journal of Synthetic Emotions, Volume
7 Issue 1, January 2016 , Pages 13-26, DOI: 10.4018/
IJSE.2016010102.
[3] Reena, Raman Kumar, Effect of Randomization for
Privacy Preservation on Classification Tasks, Pro-
ceedings of the International Conference on Informat-
ics and Analytics(ICIA-2016), ICPS: ACM Interna-
tional Conference Proceeding Series.
[4] Kalles, Dimitris, Vassilios S. Verykios, and Athana-
sios Papagelis. "Hiding decision tree rules by data set
operations." Information, Intelligence, Systems and
Applications (IISA), 2015 6th International Confer-
ence on. IEEE, 2015.
[5] Aldeen, Yousra Abdul Alsahib S., Mazleena Salleh,
and Mohammad Abdur Razzaque. "A comprehensive
review on privacy preserving data mining." Springer-
Plus 4.1 (2015): 694.
[6] Xu L.,Jiang C.,Wang J.,Yuan J.and Ren
Y.(2014).Information security in big data: pri-
vacy and data mining.IEEE, 1149-1176.
[7] Dhanalakshmi, M., and E. Siva Sankari. (2014).Pri-
vacy preserving data mining techniques-survey.In
Proc.of International Conference on Infor-
mation Communication and Embedded Sys-
tems(ICICES),IEEE.
[8] Chouragade, Komal N., and Trupti H. Gurav.
(2014).A Survey on Privacy-Preserving Data Mining
using Random Decision Tree.Int.J.Science and Re-
search,pp 2891-2894.
[9] Taneja S., Khanna S.,Tilwalia S. and Ankita
.(2014).A Review on Privacy Preserving
Data Mining:Techniques and Research Chal-
lenges.International Journal of Computer Science
and Information Technologies pp 2310-2315.
[10] Lichman M. (2013). UCI Machine Learning Repos-
itory [http://archive.ics.uci.edu/ml]. Irvine, CA: Uni-
versity of California, School of Information and Com-
puter Science.
[11] Stephen O’Shaughnessy and Geraldine Gray, Devel-
opment and Evaluation of a Dataset Generator Tool
for Generating Synthetic Log Files Containing Com-
puter Attack Signatures, : International Journal of
Ambient Computing and Intelligence (IJACI),2011,
DOI: 10.4018/jaci.2011040105.
[12] Alka Gangrade , Durgesh Kumar Mishra , Ravindra
Patel, Classification Rule Mining through SMC for
Preserving Privacy Data Mining: A Review, Inter-
national Conference on Machine Learning and Com-
puting IPCSIT vol.3 (2011) c© (2011) IACSIT Press,
Singapore.
[13] Hatice Gunes and Maja Pantic, Automatic, dimen-
sional and Continuous Emotion recognition, In-
ternational Journal of Synthetic Emotions, Vol-
ume 1, Issue 1 c© 2010, IGI Global, DOI:
10.4018/jse.2010101605.
[14] Delis A, Verykios V.S, Tsitsonis A.(2010) A Data
Perturbation Approach to Sensitive Classification
Rule Hiding,25th Symposium On Applied Computing.
[15] Gkoulalas-Divanis A, Verykios VS (2010) Associa-
tion rule hiding for data mining. Springer, New York.
[16] Kadampur M., D.V.L.N S.(2010)A noise Addition
scheme in Decision tree for privacy preserving data
mining. Journal of Computing,137-144.
[17] Upadhayay A. K., Aggarwal A.,Masand R. and Gupta
R. (2009).Privacy Preserving Data Mining: A New
Methodology for Data Transformation. Proc. of the
First International Conference on Intelligent Human
Computer Interaction, Springer India.
304 Informatica 41 (2017) 289–304 G. Kalyani et al.
[18] Aliki Katsarou, Aris Gkoulalas-Divanis, and Vassil-
ios S. Verykios (2009) Reconstruction-based Classifi-
cation Rule Hiding through Controlled Data Modifi-
cation, IFIP International Federation for Information
Processing, Volume 296; Artificial Intelligence Appli-
cations and Innovations III,Springer pp. 449–458.
[19] Juggapong Natwichai Xue Li Maria E. Orlowska
(2006) Reconstruction-based Algorithm for Clas-
sification Rules Hiding, Seventeenth Australasian
Database Conference (ADC2006), Hobart, Australia.
Conferences in Research and Practice in Information
Technology (CRPIT), Vol.49.
[20] K. Wang, B.C.M. Fung, P.S. Yu (2005) Template-
Based Privacy Preservation in Classification Prob-
lems, 5th IEEE International Conference on Data
Mining, pp. 466-473.
[21] Natwichai J., Li X.,Orlowska M. (2005), Hiding clas-
sification rules for data sharing with privacy preser-
vation, Proceedings of 7th International Conference
on Data Warehousing and Knowledge Discovery,
Lecture Notes in Computer Science, Springer, pp.
468–467.
 Informatica 41 (2017) 305–315 305 
Accelerating XML Query Processing on Views 
Yin-Fu Huang and Yu-Hsien Cho 
National Yunlin University of Science and Technology 
123 University Road, Section 3, Touliu, Yunlin, Taiwan 640, R.O.C. 
E-mail: huangyf@yuntech.edu.tw, http://mdb.csie.yuntech.edu.tw 
 
Keywords: XPath, labeling schemes, materialized views, T-Bitmap, tag index, value index, navigation, twig query 
Received: April 13, 2016 
 
With the widespread use of the eXtensible Markup Language (XML), more and more applications store 
and query XML documents in XML database systems. Thus, how to efficiently process a query and find 
the specified patterns conforming the query from XML documents is a crucial issue. In this paper, some 
processing methods are employed on XML documents to improve document retrieval. First, a 
materialized view is built from an original document for each query. Then, on each materialized view, 
auxiliary structures such as T-Bitmap and indexes are also built to further accelerate query processing. 
Finally, four experiments are conducted to show the superiority of the proposed approach. 
Povzetek: Predstavljena je metoda za hitrejše iskanje po bazah XML dokumentov. 
1 Introduction 
Since XML (eXtensible Markup Language) was widely 
used to exchange data over the web, more and more 
applications store and query XML documents in XML 
database systems. Different from other data formats, an 
XML document is composed of elements and values with 
a nested structure, and could be modeled as a tree 
structure. XPath and XQuery are the standard XML 
query languages proposed by W3C. They can be used to 
describe patterns with specified predicates on multiple 
elements with tree structured relationships. However, 
how to efficiently process a query and find the specified 
patterns conforming the query from XML documents is a 
crucial issue. 
In the past, different methods have been proposed in 
querying XML documents. One of research directions 
was to build materialized views on XML documents. The 
goal is to reduce the number of visited nodes during tree 
traversing by searching from the root of a materialized 
view, rather than from the root of an original XML 
document tree. Another research direction was to 
construct index or access methods to query XML 
documents for facilitating query processing. In this 
paper, we integrate the methods from these two research 
directions as our motivation for accelerating XML query 
processing on views. The reason is that the performance 
of a materialized view is better than a non-materialized 
view because not only these data can be accessed without 
re-materialization, but also they can be fetched faster by 
building indexes on these data beforehand. Besides, a 
materialized view is usually used in accessing a large 
amount of data, such as data warehouse applications, in 
support of management’s decision-making process 
through OLAP queries, almost read operations In short, 
the motivation is for decision makers to accelerate XML 
query processing in a data warehouse. 
In summary, we highlight the contributions of this 
paper as follows: 
1) In this study, we build materialized views from an 
XML document for each query to reduce the search 
space of queries, and also build auxiliary structures 
such as T-Bitmap and indexes to further accelerate query 
processing. 
2) Comprehensive experiments are conducted to verify 
the superiority of the proposed approach. 
3) The space vs. time issue is explored when multiple 
materialized views are integrated together to save the 
space. 
The remainder of this paper is organized as follows. 
Section 2 presents the previous work proposed in 
querying XML documents. In Section 3, basic concepts 
such as query processing and materialized views on 
XML documents are introduced. In Section 4, we 
propose a system architecture consisting of view 
processing and query processing. In Section 5, four 
experiments are conducted to show the superiority of our 
approach. Finally, we make conclusions in Section 6. 
2 Previous work 
As mentioned in Section 1, one research direction on 
querying XML documents was to build materialized 
views on XML documents to reduce the number of 
visited nodes during tree traversing, thereby leading to 
faster query processing. Godfrey et al. [1], and Murthy 
and Banerjee [2] proposed SQL/XML syntax for query 
processing on views, whereas Halevy [3] and Jayavel et 
al. [4] proposed various query syntax such as join to 
handle views and focused on the problem of evaluating 
XML queries over XML views of relational data. 
However, users must be familiar with these various query 
syntax. Katsifodimos et al. [5] considered choosing the 
306 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 
best views to materialize within a given space budget to 
improve the performance of a query. Roantree and Liu 
[6] approach is to segment a materialized view into 
fragments to minimize the effect of view changes. 
Bonifati et al. [7] presented an algebraic approach for 
propagating source updates to materialized views. Wu et 
al. [8, 9] proposed a bitmapped materialized views 
approach for optimizing XML queries. Gosain et al. [10] 
provided a survey of materialized view evolution 
methods, which aims at studying the materialized view 
evolution in relational databases and data warehouses as 
well as in a distributed setting. Gosain and Sachdeva [11] 
drew several conclusions about the status quo of 
materialized view selection and a future outlook is 
predicted on bridging the large gaps that were found in 
the existing methods. 
Another research direction was to construct index or 
access methods to query XML documents, also 
improving query processing. Some studies investigated 
constructing index methods to query XML documents 
[12-15]. Bruno et al. [12] and Jiang et al. [13] used a 
structure join method to determine element relationships 
based on the numbering scheme. This method has good 
performances for an ancestor-descendant axis, but it 
might fetch useless nodes for a parent-child axis, because 
all descendant nodes must be accessed to check if they 
are real children. Therefore, Huang and Wang developed 
an efficient query processing algorithm for retrieving 
XML documents [14]. Hsu et al. also proposed a path 
clustering method based on the concept of summary 
indexes for the processing of both structural and content 
queries on XML documents [15]. Karthiga and 
Gunasekaran [16] used tree-based association rules to 
mine the semantics from XML documents, which 
provide information on both the structure and the content 
of XML documents. The mined knowledge is used to 
provide the quick answers to queries and an approach 
called path based indexing is used to improve the speed 
of data retrieval. Alghamdi et al. [17], and Thi Le et al. 
[18] proposed approaches to optimizing twig queries by 
utilizing the semantics/constraints defined in XML 
schemas. Furthermore, Ordonez focused on the 
optimization of linear recursive queries in SQL [19]. 
Subramaniam and Haw [20] proposed an XML labeling 
scheme that helps quick determination of structural 
relationship among XML nodes and supports dynamic 
updates without relabeling nodes in case of update 
occurrences. Belgamwar et al. [21] follows an upside 
down approach which explicitly stores the values and 
only reconstructs the internal nodes, if needed. As a 
solution, they proposed a compressed internal storage 
format for native XML database systems where the inner 
structure of the gathered documents is virtualized. Ferro 
and Silvello [22] introduced a new paradigm where 
traditional approaches based on traversing trees are 
replaced by a brand new one based on basic set 
operations which directly return the desired subtree, 
avoiding to create it. Tudor [23] proposed an 
optimization model for XML data processing based on a 
heuristic algorithm to extract data from XPath views. 
3 Basic concepts 
3.1 XML documents 
XML is a markup language which was proposed by W3C 
in 1996. The main purpose of the standard language is to 
provide data descriptions and data exchanges across 
different platforms. Like other markup languages, the 
contexts of XML are declared between start and end tags; 
however, especially different from others, the tags can be 
flexibly defined by users to describe data, and 
furthermore XML is supported in different platforms and 
systems. That is why it becomes the most common 
format for data exchanges. 
An XML document is with a nested structure, and it 
could be represented as a rooted, ordered, and labeled 
tree structure. Figure 1 and Figure 2 illustrate an XML 
document and its corresponding tree representation, 
respectively. In the document, there is a unique root 
element called “root” and one of the descendant 
elements, called “Book”, has seven child element nodes; 
i.e., Title, Chapter, Para, Author with an attribute node 
“Id”, Publisher, Name, Email, and their texts. The 
symbols as shown in Figure 2 are circles, rectangles, and 
triangles; they represent elements, texts, and attributes, 
respectively. 
 
<?xml version="1.0" standalone="yes"?>
<root>
   <store>
      <Books category="Technology">
         <Book>
            <Title>How to know XML</Title>
            <Chapter>
               Introduction to XML
               <Para>Your First XML</Para>
            </Chapter>
            <Author Id="Q345">John</Author>
            <Publisher>
               <Name>XML tech</Name>
               <Email>John@hpdiy.zzn.com</Email>
            </Publisher>
         </Book>
         <Book>
              
         </Book>
      </Books>
   </store>
</root>
...
 
Figure 1: XML document. 
3.2 XPath 
XPath (XML Path Language) is an expression language 
for addressing and querying an XML document. In 
XPath expressions, each step is separated by "/" and 
contains three components: axis, node test, and 
predicate. Axis defines the relationship to be followed in 
the document tree. Node test defines what kind of nodes 
Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 307 
is required. Predicate is optional and provides the 
capability to filter nodes, according to selection criteria. 
Given an XPath example “//child::Publisher 
[child::Name='XML tech'] /child::Email” , it is to get the 
email of the publisher whose name is “XML tech”. When 
navigating the XML document, it must start from the 
root element “root”, then the descendant node 
“Publisher”. Beneath “Publisher”, we search the child 
nodes to find the node called “Email”. Besides, during 
the search, it must have a child node called “Name” 
whose text matches with the specified predicate “XML 
tech”. In general, the example above can be abbreviated 
to “//Publisher [Name ='XML tech'] /Email”. 
 
 
Figure 2: XML document tree. 
3.3 Labeling schemes 
One of the major query searches is to determine the 
relationships between nodes. In order to determine 
element relationships quickly, several different labeling 
schemes have been proposed. O'Connor and Roantree 
categorized labeling schemes into containment schemes, 
prefix schemes and prime number schemes [24]. Here, 
labeling schemes are classified into prefix-based ones 
and region-based ones (or containment schemes). 
Dewey code [25] is a prefix-based labeling scheme 
that records the position information of a node, according 
to the path from the root to the node. For example, 
Dewey-id of node “Para” is 1.1.1.1.2.2, and indicates that 
we can get node “Para” if we search alone the path (the 
first node of level 1, the first node of level 2, the first 
node of level 3, the first node of level 4, the second node 
of level 5, the second node of level 6). Besides, since 
(1.1.1.1.2) is the prefix of (1.1.1.1.2.2), the relationship 
between node “Chapter” (1.1.1.1.2) and “Para” 
(1.1.1.1.2.2) can be deduced as a parent-child one. 
However, the drawback of the prefix-based labeling 
scheme is its lengthy Dewey codes, especially when the 
levels of an XML document tree are too deep. 
The region-based labeling scheme [12] is another 
numbering scheme. The label contains three elements 
(start, end, level) where the start value and end value 
forms a region. The region of an upper-level node (i.e., 
ancestor or parent) must cover those of lower-level nodes 
(i.e. children or descendants). In other words, if node A 
covers node B, then A.start < B.start and B.end < A.end. 
Besides, the level value represents the node level in a 
document tree. With the coverage information, we can 
determine the relationships between nodes quickly. As 
for the labeling, we can label each node by traversing an 
XML document tree in a depth-first search way. 
3.4 XML document storage 
An XML documents can be stored in a few different 
forms, such as in flat files, in relational databases, and in 
native XML databases. For an XML document to be 
stored in flat files, we need to parse the files in advance 
before accessing them. Although it is the simplest form, 
the parsing time would be very lengthy when the XML 
document size is too large. Besides, it also incurs multi-
user access and concurrency control problems. For an 
XML document to be stored in relational database, since 
the XML document is a tree structure, it must use some 
middleware to translate the XML format into relational 
tables. Besides, when querying the XML document, it is 
also necessary to translate a query into an SQL 
statement, and execute join operations repeatedly among 
different relation tables, so that it exposes lower 
efficiency. Native XML databases aim to provide 
complete XML document storage and manipulation. 
Different from other database systems, native XML 
databases use an XML document as a basic unit of 
storage, and defines an XML model used to store and 
retrieve XML documents. 
3.5 Materialized views 
A view is a virtual and derived table defined by users for 
facilitating to express a complicated query. Rather than 
physically stored as parts of a database, a view definition 
is merely recorded by the database system. It is evaluated 
only when a user issues a query involving this view. 
However, a materialized view is the one which is 
physically stored in the database, in addition to its 
definition. Absolutely, the performance of a materialized 
view is better than a non-materialized view because not 
only these data can be accessed without re-
materialization, but also they can be fetched faster by 
building indexes on these data beforehand. Thus, a 
materialized view is usually used in accessing a large 
amount of data, such as a data warehouse or in business 
intelligence applications, where we need to take more 
time to query them. A data warehouse is a subject-
oriented, integrated, time-variant, and nonvolatile 
collection of data in support of management’s decision-
making process. It can be accessed by decision makers 
through OLAP queries, almost read operations. In short, 
our design is for decision makers to accelerate XML 
query processing in a data warehouse. 
In this paper, based on native XML databases, we 
use materialized views to query required data from an 
original document. Here, a materialized view can be 
defined using the “CREATE MATERIALIZED VIEW” 
308 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 
function and an XPath expression. For the materialized 
views on an original document, we build auxiliary files 
and construct indexes using numbering schemes to avoid 
unnecessary sub-tree traversal, thereby improving the 
navigation efficiency of a query. 
4 System architecture 
4.1 Overview 
In order to achieve faster query processing on the views 
defined in a native XML database, we propose a system 
architecture consisting of an offline phase and an online 
phase, as shown in Figure 3. In the offline phase called 
view processing, we build view-relevant structures such 
as T-Bitmap and indexes to accelerate later query 
processing. In the online phase called query processing, 
the system can promptly respond to view-based queries, 
utilizing the T-Bitmap and indexes built beforehand. 
 
 
Figure 3: System architecture. 
4.2 View processing 
In this section, the motivation of view materialization is 
introduced first. Then, we build relevant structures such 
as T-Bitmap and indexes on materialized views to further 
accelerate query processing. 
4.2.1 View pre-processing 
Usually, an Xpath expression is used to address and 
query an XML document. However, for the query 
execution, the system always searches an XML 
document tree from the root. When a query is frequently 
executed, the system performance would be degraded 
since a large amount of unnecessary sub-tree traversal 
cannot be avoided. For the query with an Xpath 
expression as shown in Figure 4, we can define a 
materialized view beforehand, which is rooted from node 
“Books” with an attribute “category” matching with the 
specified predicate “Technology”, as shown in Figure 5. 
Then, the materialized view can be created from the 
original document, as shown in Figure 6. Thus, rather 
than traversing the original document tree always from 
the root, the system only needs to search the materialized 
view, thereby improving the navigation efficiency of the 
query. 
 
 
Figure 4: Query with an XPath expression. 
 
Figure 5: View definition. 
<?xml version="1.0" standalone="yes"?>
      <Books category="Technology">
         <Book>
            <Title>How to know XML</Title>
            <Chapter>
               Introduction to XML
               <Para>Your First XML</Para>
            </Chapter>
            <Author Id="Q345">John</Author>
            <Publisher>
               <Name>XML tech</Name>
               <Email>John@hpdiy.zzn.com</Email>
            </Publisher>
         </Book>
         <Book>
            <Title>Small World</Title>
            <Chapter>
               Q&A
               <Para>The One</Para>
            </Chapter>
            <Author Id="A854">Jimmy</Author>
            <Publisher>
               <Name>Network</Name>
               <Email>Jimmy@hpdiy.zzn.com</Email>
            </Publisher>
         </Book>
      </Books>
...
  
Figure 6: Materialized view. 
Before building T-Bitmap and indexes on a 
materialized view to further accelerate query processing, 
we must determine the relationships between nodes (i.e., 
parent-child axes and ancestor-descendant axes) in a 
materialized view using the region-based labeling 
scheme as mentioned in Section 3.3. We traverse a 
materialized view and label nodes in a depth-first search 
way. When a node is visited first, its start value is 
created; when we leave the node, the end value is 
labeled. After traversing the whole materialized view, all 
the nodes in the view are completely labeled as shown in 
Figure 7. 
 
CREATE MATERIALIZED VIEW mv AS( 
SELECT extract(sys_nc_rowinfo$, 
'/root/store/Books[@category="Technology"]') 
FROM XMLTABLE); 
XPath:/root/store/Books[@category="Technology"] 
/Book[Title="How to know XML"]/Publisher 
/Email=‘John@hpdiy.zzn.com’ 
Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 309 
Book(1,25)
Title(2,4)
How to know 
XML
(3,3)
Introduction 
to XML
(6,6)
Para(7,9)
Your First 
XML
(8,8)
Chapter(5,10)
Id(12,14)
Q345
(13,13)
John(15,15)
Author(11,16)
Publisher(17,24)
Name(18,20)
XML tech
(19,19)
John@hpdiy.zzn.com
(22,22)
Email(21,23)
Books(0,251)
…
 
Figure 7: Labeling nodes in a depth-first search way. 
4.2.2 Building T-Bitmap 
T-Bitmap is a bit string type, which is used to record 
what descendant nodes are beneath a current node. First, 
a dictionary recording the positions in T-Bitmap and the 
corresponding tags is created, as shown in Table 1. Then, 
the T-Bitmap value on each node can be calculated using 
OR operators. For an example as shown in Figure 8, the 
T-Bitmap of node “Publisher” can be calculated by 
combining the T-Bitmaps of “Name”, “Email”, and itself 
using OR operators. 
Table 1: Dictionary: positions in T-Bitmap and 
corresponding tags. 
Position 0 1 2 3 
Tag Books Book Title Chapter 
Position 4 5 6 7 
Tag Para Author Id Publisher 
Position 8 9 - - 
Tag Name Email - - 
 
        Name         0000000010
OR   Email         0000000001
-----------------------------------------------------
                            0000000011
OR  Publisher   0000000100
-----------------------------------------------------
                                 0000000111
                         (b)
Title(2,4)
(0010000000)
Chapter
(5,10) Author(11,16) Publisher(17,24)
Book(1,25)
(0111111111)
Para(7,9)
(0000100000)
Id(12,14)
(0000001000)
Name(18,20)
(0000000010)
Email(21,23)
(0000000001)
(0000000111)(0000011000)(0001100000)
OR operator
OR operator
Level2
Level3
Level4
(a)
Level1
...
Books(0,251)
(1111111111)
OR operator
 
Figure 8: Combing T-Bitmaps using OR operators. 
4.2.3 Building index 
Here, we use labeling codes to build two kinds of index 
trees; i.e., tag index trees and value index trees. To 
illustrate the tag index construction, we extend the 
storage model in Figure 7. As shown in Figure 9, we can 
see a lot of nodes with the same tag names but with the 
different labeling codes; e.g., node “Email(21, 23)” and 
“Email(46, 48)”. We can build the index tree of each tag 
using the start values in labeling codes as keys and the 
well-known B+-tree algorithm, as shown in Figure 10 
where the pointers of a leaf node in the tag index tree 
indicate the positions of corresponding nodes in the 
materialized view. For the query with an XPath: 
“Book//Email”, when processing the current node 
“Book(26, 50)”, we can use the tag index tree of “Email” 
to locate each leaf node by following the dotted path, and 
find out node “Email(46, 48)” covered by node 
“Book(26, 50)”. 
  
Figure 9: Extended storage model. 
  
Figure 10: Tag index tree. 
Besides, we can also build a value index tree 
according to the text values of nodes in the document, as 
shown in Figure 11. The construction method is the same 
as that used to build tag index trees. However, we 
generate only one value index tree for each materialized 
view, and the records of a leaf node are with the [text, 
start, pointer] format where the pointers also indicate the 
positions of corresponding nodes in the materialized 
view. 
4.3 Query processing 
In this section, the query transformation based on a 
materialized view is introduced first. Then, according to 
different axes specified in the transformed query, we 
make use of the T-Bitmap and indexes built in the view 
processing to accelerate query processing. Finally, we 
also introduce subsequent processing for a query 
specifying the particular predicate in an Xpath 
expression. 
John,15
Introduction to XML,6 Q345,13 The One,33
pA854,38
pHow to know XML,3
pIntroduction to XML,6
pJimmy@hpdiy.zzn.com,47
pJimmy,40
pJohn,15
pNetwork,44
pJohn@hpdiy.zzn.com,22
pQ345,13
pSmall World,28
pQ&A,31
pThe One,33
PYour First XML,8
pXML,19
 
Figure 11: Value index tree. 
310 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 
4.3.1 Query pre-processing 
After materialized views are created, a query should be 
transformed based on its corresponding materialized 
view. For the query as shown in Figure 4, its Xpath 
expression (traversing from node root) is transformed 
into a new one (traversing from node “Books”) as shown 
in Figure 12. 
 
 
Figure 12: Query transformation based on a materialized 
view. 
Before utilizing T-Bitmap and indexes to accelerate 
query processing, we must deal with parent-child axes 
and/or ancestor-descendant axes specified in the 
transformed query. We execute navigation or indexing 
according to different axes, and recursively check nodes 
in the document tree. 
4.3.2 Navigation 
For a parent-child axis, we use a navigation way to 
search nodes in the document tree. Here, T-Bitmap can 
be used to avoid unnecessary search during the 
navigation, since it provides the information whether 
result nodes are beneath the current processing node. For 
a query “/R/A/C” as shown in Figure 13, we can use an 
AND operator to determine whether node C is beneath 
the current processing node. First, for current node R, 
Query(11010)^R(11111)=Query(11010) indicates node 
C is beneath node R. Then, for the next node A1, 
Query(01010)^A1(01100)Query(01010) indicates node 
C cannot be beneath node A1, and we do not need to 
search the sub-tree rooted at node A1. Next, for node A2, 
Query(01010)^A2(01110)=Query(01010) indicates node 
C is beneath node A2. Finally, we recursively check 
nodes in the document tree until node C is found. 
4.3.3 Indexing 
For a query “/A//B∙ ∙ ∙” specifying an ancestor-
descendant axis as shown in Figure 14, although we can 
also use T-Bitmap to search node B, the search based on 
a parent-child axis would go through a lot of unnecessary 
intermediate nodes. Therefore, we use indexes to directly 
search a descendant node, instead of using T-Bitmap. As 
mentioned in Section 4.2.3, we use the start value in the 
labeling code of the ancestor as the key to search the tag 
index tree of the descendant. Then, we check whether the 
descendant node is covered by the ancestor node; if yes, 
we fetch the descendant node and proceed to parse the 
query downward. 
4.3.4 Subsequent processing 
In this section, we investigate the processing for the 
query specifying a particular predicate. One is the query 
specifying values, and another is the twig query. 
4.3.4.1 Query specifying values 
For a query “/Books/Book[Author = ‘Jimmy’]/Publisher 
= ‘XML tech’ ”, two values are specified for filtering 
nodes in the document. As shown in Figure 15, we use 
the value index tree as mentioned in Section 4.2.3 to find 
out value nodes “Jimmy(12)”, “XML tech1(7)”, and 
“XML tech2(15)”, and then put them into their 
corresponding queues, respectively. When processing 
node “Book” in the query, we fetch node “Book1”, and 
find that node “Jimmy(12)” is not covered by node 
“Book1”; i.e., [2,9] < 12. Then, for the next node 
“Book2”, although node “Book2” covers node 
“Jimmy(12)”, node “XML tech1(7)” is not covered by 
node “Book2”. Then, we choose the next value node 
“XML tech2(15)” and do the same inspection. Finally, 
we find out node “Book2” is the result node. 
Book1
1111
Books
Book2
John
XML
tech1
Jimmy
XML
tech2
0111 0111
0010 0001 0010 0001
Author1 Author2Publisher1 Publisher2
(1,18)
(2,9) (10,17)
(3,5) (6,8) (11,13) (14,16)
4 7 12 15
Book
Jimmy
0111
0010 0001
Author Publisher
Query: /Books/Book[Author = ‘Jimmy’]/Publisher = ‘XML tech’
BookBooks PublisherAuthor
T-Bitmap
XML
tech
Jimmy
XML
tech2
XML
tech1
V1 V2
V1 queue V2 queue
Books 1111
 
Figure 15: Value node processing. 
SELECT extract(sys_nc_rowinfo$,'/Books 
/Book[Title="How to know XML"]/Publisher 
/Email=' John@hpdiy.zzn.com '') 
FROM mv; 
R
A
C
Query
11010
01010
00010
R
A3A2A1
B B C B D
11111
01100
00100 00100 00010 00100 00001
01110 01101
R A B C D
T-Bitmap
 
Figure 13: Navigation using T-Bitmap. 
A
B
.......
......
Ancestor-descendant 
relationship
intermedia nodes
 
Figure 14: Unnecessary intermediate nodes. 
Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 311 
4.3.4.2 Twig Query 
Besides, for a query “/Book[Publisher]/Author” as shown 
in Figure 16(a), we can also process it in the same way as 
done in Section 4.3.2. As shown in Figure 16(b), two 
solutions “Book, Publisher, Author1, Jim” and “Book, 
Publisher, Author2, Jimmy” exist in the document. They 
can be found by 1) merging Path1 and Path2, and 2) 
merging Path1 and Path3, as shown in Figure 16(c). 
 
Book
Publisher Author
Book
Publisher Author1 Author2
Jim Jimmy
PublisherBook Author
T-Bitmap
Book
Publisher
Book
Author1
Book
Author2
twig query
/Book[Publisher]/Author
(a)
XML document
(b) (c)
Path 1 Path 2 Path 3
 
Figure 16: Twig query processing. 
5 Experiments 
In this section, four experiments are conducted to show 
the superiority of our approach proposed in this paper. 
These experiments are written in Java (JDK1.7) and 
conducted on an Intel Pentium4 3GHz CPU with 3G 
main memory in Windows 7. In the first experiment, we 
present the comparisons among different methods. In the 
second experiment, we investigate the effect of query 
types on different methods, especially on our method. In 
the third experiment, we use synthesis documents to 
analyze our method, and try to find some characteristics. 
In the last experiment, we address the space vs. time 
issue if multiple materialized views can be integrated 
together to save the space; in other words, more than one 
query would search from a materialized view. 
5.1 Comparisons among different methods 
In this experiment, we compare the search ways in 
different methods as shown in Figure 17. The first 
method is the original search way which is always from 
the root of a document tree. The second method was 
proposed by Godfrey et al. [1], which searches from the 
root of a materialized view. The last method is ours 
which also searches from the root of a materialized view, 
but with the aid of auxiliary data structures. 
To fairly compare with the method proposed by 
Godfrey et al., an XML benchmark available on the 
XMark site is used in the experiment, which has data size 
113.794MB and 1,513,518 nodes. Also, we follow the 
similar query types and comparison ways used in the 
experiments conducted by Godfrey et al. As shown in 
Table 2, there are twelve different types of queries tested 
in the experiment. To contrast with the searching ways as 
shown in Figure 17, the columns as shown in Table 3 are 
1) query types, 2) searching time on the original tree, 3) 
view creation time, 4) searching time using Godfrey et 
al.' method, and 5) searching time using our method. The 
experimental results show that our method performs 
much better than the Godfrey et al.' method in all the 
queries, especially for long-path and twig queries Q3, 
Q7, Q8, Q9, and Q11. This is why our method uses the 
T-Bitmap and index structures to accelerate query 
processing. 
5.2 Effect of query types on different 
methods 
In this experiment, we use the same XML benchmark as 
the first experiment, but different queries as shown in 
Table 4. For the searching axis, Q1 and Q2 are based on 
the parent-child axis, Q3 and Q4 on the ancestor-
descendant axis, and the others on mixed axes. For the 
query types, Q1, Q3, Q5, Q6, and Q7 are path queries, 
whereas the others are twig queries. Furthermore, Q1, 
Q3, Q4, Q5, Q6, Q7, and Q10 are with value predicates. 
In this experiment, as shown in Table 5, our method 
is still the best one among different methods in all the 
queries. Even taking the worst case Q8 as an example, 
our method is 206 times faster than the original way, and 
88 times faster than the Godfrey et al.' method. The 
reason is, as shown in Table 6, the number of nodes 
visited for Q8 in our method is only 1/28 times of the 
original way, and is only 1/10 times of the Godfrey et al.' 
method. 
For the path queries (i.e., Q1, Q3, Q5, Q6, and Q7), 
both execution time of the Godfrey et al.' method and 
ours is less than one second. For the twig queries (i.e., 
Q2, Q8, and Q9), they cost more execution time than the 
path queries since a large amount of nodes are visited. 
However, for the similar twig queries (i.e., Q4 and Q10), 
they do not cost much execution time since only a very 
small amount of nodes are required to visit. In summary, 
the advantages of our method are 1) when dealing with 
twig queries, we only need to check T-Bitmap and skip 
an entire sub-tree if not matched, and 2) when dealing 
with the queries with value predicates, we can use the 
value index tree to achieve efficient processing. 
From begining to end
rootOriginal search
Index search
Godfrey method Our method
 
Figure 17: Search ways in different methods. 
 
312 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 
5.3 Experiments on synthesis documents 
In this experiment, six synthesis documents with 
different fanout are used to analyze our method for three 
different types of queries as shown in Table 7. Q1 is 
based on the parent-child axis, Q2 is based on the 
ancestor-descendant axis, and Q3 is a twig query with 
three predicates and based on mixed axes. As shown in 
Table 8, we find that the execution time of each query 
increases as the fanout increases. Moreover, regardless of 
the complexity in Q3, it still costs almost the same time 
as Q1 and Q2 using our method; i.e., its execution time 
would not increase significantly even if it is a twig query 
with three predicates. However, for Q3 using the original 
way and/or the Godfrey et al.' method, their execution 
time increases seriously as shown in Table 9. Especially 
for the Godfrey et al.' method, the execution time for 
fanout30 is 202 times slower than that for fanout5. 
5.4 Experiments on space vs. time 
In this experiment, we explore the space vs. time issue 
when multiple materialized views are integrated together. 
In order to achieve the premise that more than one query 
can search from a materialized view, we reuse seven 
queries (i.e., Q3, Q4, Q5, Q6, Q7, Q8 and Q10) as shown 
in Table 4. Then, we find that 1) Q3, Q6, and Q7 can 
search from the materialized view built based on Q3, 2) 
Q4 and Q10 can search from the materialized view built 
based on Q4, and 3) Q5 and Q8 can search from the 
materialized view built based on Q8. Thus, after the 
integration, we have three materialized views for these 
seven queries. The data space and execution time 
between no integration and integration are shown in 
Table 10. Taking the group (Q3, Q6, Q7) as an example, 
the overall data space is 3+1+3=7(KB) and the total 
execution time is 3+2+12=17(ms) if each query has its 
own materialized view; however, after the integration, 
the data space is only 3(KB), but the total execution time 
increases to 3+5+26=34(ms). 
In order to explore the relationship between data 
space and execution time for these two strategies (i.e., 
no-integration and integration), we define two terms: 1) 
amount ratio for data space and 2) speed ratio for 
execution time as follows. 
Table 2: Twelve kinds of queries. 
Q1 /site/regions/europe/item/mailbox/mail 
Q2 /site//item/mailbox/mail 
Q3 /site//africa/item/description/parlist/listitem 
Q4 /site//person/profile/interest[@category] 
Q5 /site//person/profile[age]/interest[@category="category620"] 
Q6 /site//person/profile[contains(age,"18")]/education 
Q7 /site//open_auction[@id="open_auction5"]//date 
Q8 /site//category/description[text]/parlist/listitem 
Q9 /site//category/description[text/keyword]/parlist/listitem 
Q10 /site/*/*/item/mailbox/mail 
Q11 /site//*//africa/item/name 
Q12 /site//item/mailbox[count(mail)] 
Table 3: Comparisons among different methods. 
Time(ms) 
Query Original tree 
% View 
creation 
* View non-
indexed 
# View 
indexed 
Q1 193718 30131 64980 5752 
Q2 195146 32147 84574 20599 
Q3 310315 31137 100130 593 
Q4 185361 32890 99890 29436 
Q5 193474 31147 147344 6865 
Q6 191034 29702 65248 9106 
Q7 203447 27112 132522 279 
Q8 212511 28317 60353 232 
Q9 241417 30169 64384 916 
Q10 185357 31603 80727 19811 
Q11 199274 28417 65059 320 
Q12 209115 31731 99283 20687 
  % View creation: Views created for both methods 
  * View non-indexed: Godfrey et al.' method 
  # View indexed: Ours 
 
Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 313 
( )
( )
strategy
space strategy
amount ratio
space original
                         (1) 
( )
( )
strategy
time strategy
speed ratio
time original
                               (2) 
where space(strategy) is the overall data space used in 
the no-integration or integration strategies, time(strategy) 
is the total execution time required in the no-integration 
or integration strategies, space(original) is the data space 
of the original document, and time(original) is the 
execution time on the original document. 
Then, we can use these two terms to judge which 
strategy is better for the system performance as follows. 
integration no integration
no integration integration
amount ratio speed ratio
amount ratio speed ratio


 

 
 
(3) 
or  
( ) ( )
( ) ( )
space integration time no integration
space no integration time integration



  (4) 
For Equation (4), the former term represents that the 
data space benefits for the integration strategy can be 
Table 4: Different kinds of queries. 
Q1 
Q2 
/site/open_auctions/open_auction[@id="open_auction5"]/initial 
/site/open_auctions/open_auction[annotation/author]/bidder/date 
Q3 
Q4 
//site//open_auctions//open_auction[@id="open_auction0"]//current 
//person[@id="person0"][creditcard]//watch 
Q5 
Q6 
Q7 
Q8 
Q9 
Q10 
/site/regions//item[@id="item0"]//mail 
//open_auction[@id="open_auction0"]/bidder/date 
/site/open_auctions/open_auction[@id="open_auction0"]/../end 
/site/regions//item[//text/bold]//location 
//closed_auctions/closed_auction[//description/text]/seller 
//people/person[@id="person0"][//business]/name 
Table 5: Execution time. 
Time(ms) 
Query Original tree 
View 
creation 
View non-
indexed 
View 
indexed 
Q1 31818 14920 193 8 
Q2 68769 15731 37949 1883 
Q3 29718 11787 140 3 
Q4 30989 10056 103 3 
Q5 27976 10705 119 3 
Q6 30123 11723 48 2 
Q7 32680 12135 139 12 
Q8 4344137 222996 1853537 21111 
Q9 1560306 246697 201444 16185 
Q10 28425 10374 70 2 
Table 6: Number of nodes visited. 
Nodes 
Query Original tree 
View non-
indexed 
View 
indexed 
Q1 48005 57 5 
Q2 259604 87331 23794 
Q3 24001 64 3 
Q4 31502 24 8 
Q5 34124 50 3 
Q6 25960 8 5 
Q7 24005 64 3 
Q8 1204343 432651 43502 
Q9 736217 102736 39003 
Q10 25647 24 5 
Table 7: Three kinds of queries. 
Q1 /root/L1/R 
Q2 //root//R 
Q3 /root//L1[//R][//Q][//S] 
 
314 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 
gained (i.e., data space reduced) whereas the latter term 
represents that the execution time benefits for the no-
integration strategy can be gained (i.e., execution time 
reduced). If Equation (4) with the equal weighting 
between space and time is true, the integration strategy 
should be adopted. According to the data taken from 
Table 10, we can calculate the former term and latter 
term in Equation (4), as shown in Table 11. From the 
statistical data, we find that the integration strategy is 
better for group (Q3, Q6, Q7) and group (Q4, Q10), but 
the no-integration strategy is better for group (Q5, Q8). 
Absolutely, different strategies can be adopted for 
different query groups at the same time to make the 
system performance in the best status. 
6 Conclusions 
In this paper, we employ some processing methods on 
XML documents to improve document retrieval. The 
goal is to reduce the number of visited nodes during tree 
traversing, thereby leading to faster query processing. To 
achieve this goal, we focus on the usage of database 
views. First, we build a materialized view from an 
original document for each query. Then, on each 
materialized view, we also build auxiliary structures such 
as T-Bitmap and indexes to further accelerate query 
processing. According to different axes specified in an 
Xpath expression, we have different techniques to handle 
them. Finally, through the experiments, we 1) compare 
the performances among different methods, 2) 
investigate the effect of query types on them, 3) use 
Table 8: Comparisons for different fanout in our method. 
Time(ms) 
 Q1 Q2 Q3 
Fanout5 358 326 392 
Fanout10 571 554 569 
Fanout15 1065 969 1023 
Fanout20 1514 1432 1486 
Fanout25 2164 2023 2137 
Fanout30 2701 2579 2658 
Table 9: Comparisons for different fanout in different methods. 
Q3 Time(ms) 
 
Original tree 
View non-
indexed 
View 
indexed 
Fanout5 143618 419 392 
Fanout10 278615 796 569 
Fanout15 436672 2700 1023 
Fanout20 655059 10815 1486 
Fanout25 678379 38339 2137 
Fanout30 879231 84779 2658 
Table 10: Comparisons between no integration and integration. 
No integration Space(KB) Time(ms) 
Q3 3 3 
Q4 1 3 
Q5 2 3 
Q6 1 2 
Q7 3 12 
Q8 56226 21111 
Q10 1 2 
 
Integration Space(KB) Time(ms) 
(Q3, Q6, Q7) 3 3+5+26 
(Q4, Q10) 1 3+2 
(Q5, Q8) 56226 16491+21111 
Table 11: Comparisons based on Equation (4). 
 (Q3, Q6, Q7) (Q4, Q10) (Q5, Q8) 
Former term 3/7=0.43 1/2=0.5 56226/56228=1 
Latter term 17/34=0.5 5/5=1 21114/37602=0.56 
 
Accelerating XML Query Processing on Views Informatica 41 (2017) 305–315 315 
synthesis documents to analyze our method, and 4) 
address the space vs. time issue if materialized views are 
integrated together. 
References 
[1]  Godfrey P, Gryz J, Hoppe A, et al. Query rewrites 
with views for XML in DB2. In: Ioannidis Y, Lee 
D, Ng R, eds. Proceedings of the IEEE 25th 
International Conference on Data Engineering, 
Shanghai, China, 2009. 1339-1350 
[2]  Murthy R, Banerjee S. XML schemas in Oracle 
XML DB. In: VLDB Endowment, eds. Proceedings 
of the 29th International Conference on Very Large 
Data Bases, Berlin, Germany, 2003. 1009-1018 
[3]  Halevy Y. Answering queries using views: a 
survey. Very Large Data Bases Journal, 2001, 10: 
270-294 
[4]  Jayavel S, Jerry K, Eugene S, et al. Querying XML 
views of relational data. In: VLDB Endowment, 
eds. Proceedings of the 27th International 
Conference on Very Large Data Bases, Rome, Italy, 
2001. 261-270 
[5]  Katsifodimos A, Manolescu I, Vassalos V. 
Materialized view selection for XQuery workloads. 
In: Fuxman A, eds. Proceedings of ACM SIGMOD 
International Conference on Management of Data, 
Scottsdale, Arizona, USA, 2012. 565-576 
[6]  Roantree M, Liu J. A heuristic approach to 
selecting views for materialization. Software: 
Practice and Experience, 2014, 44: 1157-1179 
[7]  Bonifati A, Goodfellow M, Manolescu I, et al. 
Algebraic incremental maintenance of XML views. 
ACM Transactions on Database Systems, 2013, 38: 
14:1-14:45 
[8]  Wu X, Theodoratos D, Wang W H, et al. 
Optimizing XML queries: bitmapped materialized 
views vs. indexes. Information Systems, 2013, 38: 
863-884 
[9]  Wu X, Theodoratos D, Kementsietsidis A. 
Configuring bitmap materialized views for 
optimizing XML queries. World Wide Web, 2015, 
18: 607-632 
[10]  Gosain A, Sabharwal S, Gupta R. Architecture 
based materialized view evolution: a review. 
Procedia Computer Science, 2015, 48:256-262 
[11]  Gosain A, Sachdeva K. A systematic review on 
materialized view selection. In: Satapathy S et al., 
eds. Proceedings of the 5th International 
Conference on Frontiers in Intelligent Computing: 
Theory and Applications. Advances in Intelligent 
Systems and Computing, Vol. 515. Springer, 
Singapore, 2017. 663-671 
[12]  Bruno N, Koudas N, Srivastava D. Holistic twig 
joins: optimal XML pattern matching. In: Franklin 
M J, eds. Proceedings of ACM SIGMOD 
International Conference on Management of Data, 
Madison, WI, USA, 2002. 310-321 
[13]  Jiang H, Wang W, Lu H, et al. Holistic twig joins 
on indexed XML documents. In: VLDB 
Endowment, eds. Proceedings of the 29th 
International Conference on Very Large Data 
Bases, Berlin, Germany, 2003. 273-284 
[14]  Huang Y F, Wang S H. An efficient XML 
processing based on combining T-Bitmap and index 
techniques. In: Biaz S, Bellaachia A, eds. 
Proceedings of the IEEE International Symposium 
on Computers and Communication, Marrakech, 
Morocco, 2008. 858-863 
[15]  Hsu W C, Liao I E, Wu S Y, et al. An efficient 
XML indexing method based on path clustering. In: 
Alhajj R S, eds. Proceedings of the 20th IASTED 
International Conference on Modeling and 
Simulation, Banff, Alberta, Canada, 2009. 339-344 
[16]  Karthiga D, Gunasekaran S. Optimization of query 
processing in XML document using TAR and path 
based indexing. International Journal of Computer 
Science and Network Security, 2013, 13: 119-127 
[17]  Alghamdi N S, Rahayu W, Pardede E. Object-based 
semantic partitioning for XML twig query 
optimization. In: Barolli L et al., eds. Proceedings 
of the IEEE 27th International Conference on 
Advanced Information Networking and 
Applications, Barcelona, Spain, 2013. 846-853 
[18]  Thi Le D X, Maghaydah M, Orgun M A, et al. 
Optimization of XML queries by using semantics in 
XML schemas and the document structure. In: Lin 
X et al., eds. Proceedings of the 14th International 
Conference on Web Information Systems 
Engineering, Nanjing, China, 2013. 343-353 
[19]  Ordonez C. Optimization of linear recursive queries 
in SQL. IEEE Transactions on Knowledge and Data 
Engineering, 2010, 22: 264-277 
[20]  Subramaniam S, Haw S C. ME labeling: a robust 
hybrid scheme for dynamic update in XML 
databases. In: Ismail M, Ramli N, eds. Proceedings 
of the IEEE 2nd International Symposium on 
Telecommunication Technologies, Langkawi, 
Malaysia, 2014. 126-131 
[21]  Belgamwar H C, Dhore S M, Rathod P U, 
Deshmukh S S, Nandanwar G S. Review on storing 
and indexing XML documents upside down. 
International Journal for Engineering Applications 
and Technology, 2015, Manthan-15 
[22]  Ferro N, Silvello G. Descendants, ancestors, 
children and parent: a set-based approach to 
efficiently address XPath primitives. Information 
Processing & Management, 2016, 52:399-429 
[23]  Tudor N L. Query optimization against XML data. 
Studies in Informatics and Control, 2016, 25:173-
180 
[24]  O’Connor M F, Roantree M. Desirable properties 
for XML update mechanisms. In: Proceedings of 
the 2010 EDBT/ICDT Workshops, Lausanne, 
Switzerland, 2010 
[25]  Lu J, Ling T W, Chan C Y, et al. From region 
encoding to extended Dewey: on efficient 
processing of XML twig pattern matching. In: 
VLDB Endowment, eds. Proceedings of the 31st 
International Conference on Very Large Data 
Bases, Trondheim, Norway, 2005. 193-204 
316 Informatica 41 (2017) 305–315 Y.-F. Huang et al. 
 
 Informatica 41 (2017) 317–331 317 
Optimization, Modeling and Simulation of Microclimate and Energy 
Management of the Greenhouse by Modeling the Associated Heating and 
Cooling Systems and Implemented by a Fuzzy Logic Controller using 
Artificial Intelligence 
Didi Faouzi 
Materials and Renewable Energy Research Unit M.R.E.R.U University of Abou-bakr Belkaïd 
B.P. 119, Tlemcen, Algeria  
E-mail: didifouzi19@yahoo.com / didifouzi19@gmail.com 
 
Nacereddine Bibi-Triki 
Materials and Renewable Energy Research Unit M.R.E.R.U University of Abou-bakr Belkaïd 
B.P. 119, Tlemcen, Algeria  
E-mail: n_bibitriki@hotmail.fr 
 
Bentchikou Mohamed 
University of Yahia Fares, Médéa, Laboratory Director (LBMPT), Médéa Algeria  
E-mail: bentchikou.mohamed@univ-medea.dz 
 
Abderrahmane Abène  
Euro-Mediterranean Institute of Environment and Renewable Energies, University of Valenciennes, France 
E-mail: a.abene@yahoo.fr 
 
Keywords: optimization, modelling, simulation, heater systems, intelligent control, fuzzy logic, greenhouse, cooling 
pads  
Received: May 1, 2017 
 
Agricultural greenhouse aims to create a favorable microclimate to the requirements of growth and 
development of culture, from the surrounding weather conditions, produce according to the cropping 
calendars fruits, vegetables and flower species out of season and widely available along the year. It is 
defined by its structural and functional architecture, the quality thermal, mechanical and optical of its 
wall, with its sealing level and the technical and technological accompanying. The greenhouse is a very 
confined environment, where multiple components are exchanged between key stakeholders and them 
factors are light, temperature and relative humidity.  This state of thermal evolution is the level sealing of 
the cover of its physical characteristics to be transparent to solar, absorbent and reflective of infrared 
radiation emitted by the enclosure where the solar radiation trapping effect otherwise called "greenhouse 
effect" and its technical and technological means of air that accompany. New climate driving techniques 
have emerged, including the use of control devices from the classic to the use of artificial intelligence such 
as neural networks and / or fuzzy logic, etc...  As a result, the greenhouse growers prefer these new 
technologies while optimizing the investment in the field to effectively meet the supply and demand of these 
fresh products cheaply and widely available throughout the year. In north Africa, greenhouse cultivation 
is undergoing significant development. To meet an increasingly competitive market and conditioned by 
increasingly stringent quality standards, "Greenhouse" production systems (heating and air-conditioning 
systems) Become considerably sophisticated and then disproportionately expensive. That is why locks who 
want to remain competitive must optimize their investment by controlling production conditions. The aim 
of our work is to model heating and air conditioning systems whose goal of heating and cooling the air 
inside our model and implemented in our application of climate control are due to the fuzzy logic that Has 
the role of optimizing the cost of the energy supplied using MATLAB software. 
Povzetek: V Matlabu je razvit inteligentni sistem/dom za steklenjake v Alžiriji. 
1 Introduction 
Increased demand and requirement of fresh products 
consumers throughout the year, led parallelly to a rapid 
development of agricultural greenhouse, which is today 
modern and quite sophisticated. 
Agricultural greenhouse aims to create a favorable 
microclimate to the requirements of the plant, necessary 
for its growth and development, from the surrounding 
weather conditions. it produces based cropping 
calendars, off-season products, cheap and widely 
available along the year. [8] 
318 Informatica 41 (2017) 317–331 D. Faouzi et al.  
It is defined by its structural and functional 
architecture, the optical quality, thermal and 
mechanical coverage and the accompanying technical 
means. it is considered as a very confined environment 
where many components are exchanged between them, 
and in which  the main factor involved in this medium 
is light, temperature and relative humidity [7-9]. to 
manage the greenhouse microclimate, greenhouse 
growers often use methods such as passive static 
ventilation (opening), shade screens, evaporative 
cooling etc ... and occasionally the active type. these 
methods are less expensive but more difficult to 
manage and optimize [11-14]. 
The first objective is to improve the thermal 
capacity of the greenhouse (greenhouse). 
This is, to characterize the behavior of the complex 
system that is the greenhouse with its various 
compartments (ground, culture, cover, indoor and 
outdoor environment). To develop non-stationary 
mathematical models usable for simulation, 
optimization and the establishment of laws and control 
of simple and effective regulation. 
These models must reproduce the essential 
properties of the mechanisms and interactions between 
different compartments. they must be both specific 
enough to obey the dynamic and real behavior of the 
greenhouse system, and fairly small to be easily 
adaptable to the phases of the simulation. 
Good modulation instructions depending on the 
requirements of the plants to grow under shelter and 
outdoor climatic conditions, result in a more rational 
and efficient use of inputs and equip the best production 
performance. 
The greenhouse climate is modified by artificial 
actuators, thus providing the best conditions in the 
immediate environment of energy costs and it requires 
a controller, which minimizes the power consumption 
while keeping the state variables as close as possible 
optimal harvest.  
In this paper, or using fuzzy logic which is a 
powerful way to optimize and facilitate the global 
management of modern greenhouse, while providing 
through simulation interesting and encouraging which 
results in an optimization of favorable state variable 
values for the growth and development of protected 
cultivation [10-12-13]. 
The greenhouse originally conceived as an 
enclosure bounded by a wall transparent to solar 
radiation, as is the case with the conventional 
greenhouse, which is widely used in our country, 
amplifies certain parameters of the surrounding climate 
and shows conditions that are not favorable to Growth 
and the development of protected crops. This type of 
traditional greenhouses answered fairly well in the 
countries of the Mediterranean basin, is confronted to 
the intense nocturnal cooling, which sometimes results 
in the reversal of the internal temperatures and 
complications of overheating and hygrometric 
variations According to the seasons. Extreme variations 
in these parameters, often observed within shelters, 
constitute a nuisance that can hinder growth and crop 
development and, at best, penalize yield and product 
quality. To meet this equation of supply and demand, 
greenhouse systems have developed over time, thus 
imposing a great mastery of management and 
knowledge to achieve a better production [4]. 
This type of greenhouse, equipped and materialized 
by climate support; Are a means of transforming local 
conditions into an operational microclimate favorable 
to the growth and development of sheltered crops [1]. 
Technological progress has made considerable 
progress in the development of agricultural 
greenhouses. They become very sophisticated (heating 
systems, air conditioning, accessories and 
accompanying technical equipment, control computer 
etc.). 
New climate control techniques have emerged, 
including the use of control devices, ranging from the 
classical to the application of artificial intelligence, now 
known as neural networks and / or fuzzy logic [2].  
The air conditioning of modern greenhouses, 
allows to keep the crops under shelters under conditions 
compatible with the agronomic and economic 
objectives. Serrists opt for competitiveness.  
They must optimize their investments, the cost of 
which is becoming more and more expensive [3].  
The agricultural greenhouse can be profitable 
insofar as its structure is improved, the materials of the 
well chosen walls, depending on the nature and type of 
production, the technical installations and 
accompanying equipment must be judiciously defined.  
Many equipment and accessories have appeared to 
regulate and control the state variables [6]  such as 
temperature relative humidity, CO2 concentration etc ...  
At present the climatic computers of the 
greenhouses, solve the problems of regulation and 
ensure the observance of the climatic constraints 
Required by plants. From now on, the climate computer 
is a tool for dynamic production management, able to 
choose the most appropriate climate route, to meet the 
targets set, while minimizing inputs. 
• Physiological aspect: This relatively complex and 
insufficiently developed field requires total 
management and extensive scientific and 
experimental treatment. This allows us to 
characterize the behavior of the plant during its 
evolution, from growth to final development; This 
allows us to establish an operational  model 
• Technical aspects: The greenhouse system is 
subject to a large number of data, decisions and 
actions to be taken on the immediate climatic 
environment of the plant (temperature, hygrometry, 
CO2 enrichment, misting, etc.). The complexity of 
managing this environment requires an analytical, 
operational, numerical and computer-based 
approach to the system 
• Socio-economic aspect: The social evolution, will 
be legitimated by a demanding and pressing demand 
of fresh products throughout the year; This state of 
affairs, involves all socio-economic operators [5], to 
be part of a scientific, technological and kitchen 
Optimization, Modeling and Simulation of… Informatica 41 (2017) 317–331 319 
dynamics. This dynamic demands high 
professionalism. 
New techniques have emerged, including the use of 
climate control devices in a greenhouse (temperature, 
humidity, CO2 concentration, etc.). Up to the 
exploitation of artificial intelligence That the neural 
networks and / or fuzzy logic. 
The application of artificial intelligence in the 
industry has grown considerably, which is not the case 
in the field of agricultural greenhouses, where its 
application remains timid. It is from this state of affairs 
that we initiate research in this field and carry out a 
modeling based on meteorological data through 
MATLAB Simulink (Didi Faouzi, et al., 2016), to 
finally analyze the thermo - energy behavior of the 
greenhouse microclimate.  
In our work we have modeled greenhouse systems 
(heating and cooling system and optimized the use of 
energy in a greenhouse by a defined intelligent 
controller such as fuzzy logic (FLC) using the method 
of Mamdani (Didi Faouzi , Et al, 2016). 
2 Modeling of the greenhouse 
Our model is parameterized (state variables), meaning 
that spatial heterogeneity is ignored and that the internal 
content of flows at the boundary of the system 
boundary is uniformly distributed [1]. 
The model consists of a set of differential equations 
formulated as follows [4]: 
𝑐𝑎𝑝 ∗
∂T
∂t
= ∑(𝑝𝑢𝑖𝑠𝑠𝑎𝑛𝑐𝑒𝑖𝑛 − 𝑝𝑢𝑖𝑠𝑠𝑎𝑛𝑐𝑒𝑜𝑢𝑡) [W]   (1) 
Where: 
T: Is the temperature of the element under 
consideration (C °). 
𝑐𝑎𝑝 (J K-1): Is its thermal capacity and the incoming 
and outgoing thermal power are expressed in 
watts. 
3 Modeling of heating and cooling 
systems 
3.1 Modeling of heating systems 
The modeled heating system consists of two 
independent heating pipes: one under the canopy (lower 
pipes) and the other in the canopy (upper pipes). 
The heating system located under the canopy; 
Whose pipes are installed beneath the benches and on 
the sides of the walking paths must be dimensioned 
correctly, in order to contribute effectively to the 
internal climate of the greenhouse. 
Due to the importance of this heating system, it was 
modeled with a proportional controller described by 
[8]: 
Q𝑓𝑜𝑢𝑟𝑛𝑖𝑒_𝑡𝑢𝑦𝑎𝑢𝑥 = 𝐴sol ∗ ‖𝐾𝑝_𝑡𝑢𝑦𝑎𝑢𝑥 ∗ (𝑇𝑟𝑒𝑓 − 𝑇𝑎𝑖𝑟)‖0
250
                                         
(2) 
Q𝑓𝑜𝑢𝑟𝑛𝑖𝑒_𝑡𝑢𝑦𝑎𝑢𝑥: : Is the thermal power entering the 
pipes [W]. 
𝐾𝑝_𝑡𝑢𝑦𝑎𝑢𝑥 = 125 [W K
-1 m-2] : Is the constant of 
proportionality. 
𝐴sol:  Is the 
ground surface of the greenhouse [m2]. 
𝑇𝑎𝑖𝑟
 [° C] : Is the parameter controlled and 𝑇𝑟𝑒𝑓  [° C] is 
the desired value of the controlled variable. 
The term in parentheses ‖ ‖0
250 is limited to a value 
between zero and 250 [W m-2]. This limitation is made 
using the "Saturation Simulink®" block, which 
indicates the maximum and minimum power, That the 
generator can supply per square meter. 
The desired temperature 𝑇𝑟𝑒𝑓   is 20 ° C during the 
day period and 18 ° C during the night period. 
3.2 Modeling of cooling systems 
There are three common methods for cooling 
greenhouses: (1) natural ventilation (2) mechanical 
ventilation (3) mist cooling (misting). In our work 
mechanical ventilation is used [8]: 
The control system selected is described by:  
Q𝑜𝑢𝑣𝑒𝑟𝑡𝑢𝑟𝑒 = 𝐴sol ∗ 1 ∗ 10
−3 +
𝐴sol‖𝐾𝑝_𝑜𝑢𝑣𝑒𝑟𝑡𝑢𝑟𝑒(C_H2O𝑎𝑖𝑟 − C_H2O𝑟𝑒𝑓)‖0
1∗10−3
   (3) 
Where : 
Q𝐻2𝑂_𝑏𝑟𝑜𝑢𝑖𝑙𝑙𝑎𝑟𝑑  : Is the airflow through the opening 
[W]. 
𝐴sol [m
2] : Is the ground surface of the greenhouse 
[m2]. 
𝐾𝑝_𝑏𝑟𝑜𝑢𝑖𝑙𝑙𝑎𝑟𝑑  = 1 [m s
-1] : Is the minimum air flow. 
𝐾𝑝_𝑜𝑢𝑣𝑒𝑟𝑡𝑢𝑟𝑒 = 0,5 [m
4 s-1 kg-1] : Is the constant of 
proportionality. 
C_H2O𝑎𝑖𝑟  [Kg m-3] : Is the concentration of water 
vapor in air. And C_H2O𝑟𝑒𝑓   , is the desired 
moisture. 
4 Organization of the model 
Our model was developed using a form of organization 
according to the model proposed by (Jamisson M. Hill, 
2006) [7]. The model of the plant used was set for 
Douglas fir planting. [7] The plants were started at 0.57 
g dry weight and harvested at 1.67 g dry weight; A new 
growing season was recorded at each harvest. 
So after we got a complete list of equations that can 
show the relationships between quantities, it does not 
tell me how these equations need to be solved 
numerically on the computer. And even less, how they 
should be expressed and organized as part of the global 
model software. Mathematical equations must be 
translated into computer code, which, when compiled 
and executed, translates the raw input data into 
meaningful data. 
In our model, each block is defined by three sets of 
variable sets: inputs, state variables that describe the 
state of behavior and output that are directly dependent 
on that state. At each time step, the block may be called 
to execute the following commands [7]: 
1. Initialization / reset of outputs and states. 
2. Calculation of state derivatives. 
320 Informatica 41 (2017) 317–331 D. Faouzi et al.  
3. Integrate the state derivatives to calculate the 
next state. 
4. Calculates outputs according to the current 
state. 
This methodology is robust and simple and can be 
applied to a wide range of processes, particularly those 
involving weighted parameters (mass / energy balance) 
or transfer functions, so it is very sensitive to crop 
models . The main impetus for using Simulink is that 
this methodology is built into the program structure, 
allowing the user to focus on the side of the model 
diagram. 
% GUESS.m 
% Core routine GUESS model 
% GUESS (Greenhouse Use of Energy Seedling 
Simulation) 
% GUESS is a dynamic lumped parameter process 
based model of a Douglas Fir 
% seedling production greenhouse. GUESS models 
the dynamics of  
% photosynthesis, and carbon allocation, climate 
control, and energy use. 
 
t1 = cputime; 
orgpath=path; 
try 
path(orgpath,genpath('Subfunctions')); 
guessinit;              % User Defined Parameters 
guessread;              % Load, and process weather data 
guessmodel;             % Execute Simulink model 
guessoutput;            % Display results 
t2 = cputime; 
    t3 = t2-t1; 
catch err 
path(orgpath); 
rethrow(err); 
end 
path(orgpath); 
fprintf('Model took %4.2f seconds to execute\n', t3); 
5 Modeling of the fuzzy controller 
The fuzzy logic control (FLC) is very robust, it is a 
flexible method that can be easily modified, and can use 
several inputs and outputs. It is much simpler than its 
predecessors (linear algebraic equations), and still very 
fast And less costly to implement. Then the controllers 
by fuzzy logic are very simple and easy to use. This 
method basically consists of three parts: an input, a 
processing part and an output part [2]: 
1) The first part is an input: Indeed, it is represented 
in the membership functions. 
2) The second part is a part of treatment, so-called 
rules of decisions. 
3) The third and final part, is the exit step. The 
controller converts the results into specific values, 
which can be managed by another system. 
One of the first questions to ask when designing a 
Fuzzy Logic Controller (FLC) is: What are my inputs 
and outputs? Once this issue is resolved, the next item 
to deal with is the range of inputs and outputs. When 
we speak of fuzzy sets, this range is called universal 
space [4]. 
An output value controlled by the fuzzy logic 
theoretical (FLC) is developed using the MATLAB 
Simulink software. 
FLC is widely used when modeling the system 
implies that information is scarce and inaccurate, or 
when the system is described by a complex 
mathematical model. An example of this type of 
structure is the agricultural greenhouse and its variables 
such as the internal temperature. This state variable 
influences and activates the dynamic behavior of the 
greenhouse, it is non-linear. The internal temperature is 
one of the important and even main variables in the 
control and modeling of greenhouses. 
In addition, a FLC is efficient to deal with 
continuous functions using the membership function 
(MF) and the IF-THEN rules. In general, a FLC 
contains four parts: fuzzifier, rules of decisions. , Fuzzy 
inference engine and defuzzify. 
First, a set of input data is gathered and converted 
to a fuzzy set using fuzzy linguistic variables, fuzzy 
linguistic terms, and membership functions. This step 
is known as Fuzzification. Then, an inference is made 
on the basis of a set of rules. Finally, the resulting fuzzy 
output is matched to a net output using the MF 
(membership functions) in the defuzzification step. 
Mamdani is method of fuzzy inference. Is the 
method we used and applied in our work to optimize 
the management of the microclimate of our agricultural 
greenhouse model. This method has fuzzy rules of form 
(IF-THEN) that have been used to implement the 
modeling of the fuzzy controller (FLC). 
In many fuzzy applications, membership functions 
(MF) have been arbitrarily chosen as trapezoidal, 
triangular or Gaussian curves depending on the selected 
ranges. 
In our model, the sigmoid membership function is 
considered to define the input and triangular variables 
for the output variables (Figure 2). 
All membership functions are defined on the 
normalized domain [-1, 1] in the discourse universe. 
With eight linguistic values, as shown in Figure 1. 
This figure illustrates the fuzzy sets of membership 
functions that contain seven fuzzy sets. The linguistic 
values of the fuzzy sets used are: 
Very cold (TVCOLD), COLD (TCOLD), 
Uncooked (TCOOL), OK (TGOOD), Low warm 
(TSH), Warm (TH), Very hot (HST) Designed on the 
basis of expert knowledge and in specialized literature. 
We added to our model of the greenhouse an 
intelligent regulator using the fuzzy logic and we chose 
the Mamdani method with a single input, we started by 
first defining the input data and the outputs, and by the 
following has been attempted to link the membership 
functions in a logical manner in order to respond to the 
following steps. The characteristic variables of the 
system to be controlled and the instructions define the 
input variables of the fuzzy controller. The 
characteristic variables are in general the output 
Optimization, Modeling and Simulation of… Informatica 41 (2017) 317–331 321 
variables and, where appropriate, other measures the 
dynamic evolution of the process. The output variables 
of the fuzzy controller are the commands to apply to the 
process. The knowledge base consists of a database and 
a rule base. The database includes: 
• Fuzzy sets associated with the input and output 
variables of the fuzzy controller, 
• Scaling factors (input) (normalization) and output 
(demoralization). 
Then the range of variations (the fuzzy sets) and the 
membership functions for the input and the output were 
defined, and each part of the membership function was 
called by a significant name (Figure 3). 
After defining the membership functions, the 
inference rules have been implemented in such a way 
as to achieve optimum control as desired, for example 
if the climate inside the greenhouse becomes lime the 
regulator will automatically Lowering the temperature 
by closing a heating system or opening a cooling 
system or by any other means and in order to keep the 
required instruction which will be translated by the 
following command [5]: 
1. If (Ti is TVCOLD) then (FOG1FAN1 is 
OFF)(FOG2FAN2 is OFF)(FOG3FAN3 is OFF)(NV is 
OFF)(Heater1 is ON)(Heater2 is ON)(Heater3 is ON) 
(1) 
2. If (Ti is TCOLD) then (FOG1FAN1 is 
OFF)(FOG2FAN2 is OFF)(FOG3FAN3 is OFF)(NV is 
OFF)(Heater1 is ON)(Heater2 is ON)(Heater3 is OFF) 
(1)  
3. If (Ti is TCOOL) then (FOG1FAN1 is 
OFF)(FOG2FAN2 is OFF)(FOG3FAN3 is OFF)(NV is 
OFF)(Heater1 is ON)(Heater2 is OFF)(Heater3 is OFF) 
(1)  
 
 
 
Figure 1:  Creating Input and Output. 
322 Informatica 41 (2017) 317–331 D. Faouzi et al.  
4. If (Ti is TGood) then (FOG1FAN1 is 
OFF)(FOG2FAN2 is OFF)(FOG3FAN3 is OFF)(NV is 
OFF)(Heater1 is OFF)(Heater2 is OFF)(Heater3 is 
OFF) (1)  
5. If (Ti is TSH) then (FOG1FAN1 is 
OFF)(FOG2FAN2 is OFF)(FOG3FAN3 is OFF)(NV is 
ON)(Heater1 is OFF)(Heater2 is OFF)(Heater3 is OFF) 
(1)  
6. If (Ti is TH) then (FOG1FAN1 is 
ON)(FOG2FAN2 is OFF)(FOG3FAN3 is OFF)(NV is 
OFF)(Heater1 is OFF)(Heater2 is OFF)(Heater3 is 
OFF) (1)  
7. If (Ti is TVH) then (FOG1FAN1 is 
ON)(FOG2FAN2 is ON)(FOG3FAN3 is OFF)(NV is 
OFF)(Heater1 is OFF)(Heater2 is OFF)(Heater3 is 
OFF) (1)  
8. If (Ti is TEH) then (FOG1FAN1 is 
ON)(FOG2FAN2 is ON)(FOG3FAN3 is ON)(NV is 
OFF)(Heater1 is OFF)(Heater2 is OFF)(Heater3 is 
OFF) (1)  
 
Figure 2:  Membership function of the command. 
 
Figure 3.  Membership functions for input and output variables. 
Optimization, Modeling and Simulation of… Informatica 41 (2017) 317–331 323 
Ti :  Indoor temperature. 
TVCOLD : Temperature very cold. 
TCOLD : Temperature cold. 
TCOOL : Temperature is cool. 
TSH : Temperature increases slowly. 
TH : Temperature is hot. 
TVH : Temperature is very hot. 
HE :  super hot temperature. 
The explanation of the previous decision rules is as 
follows: 
1) If the temperature (TVCOL) inside the 
greenhouse is very cold, then it is lower than the 
set temperature, the fuzzy controller sends a 
signal to automatically trigger all mechanical 
ventilation and ventilation systems and gives the 
order of " Open all heating systems". 
2) If the temperature (TCOLD) inside the 
greenhouse is cold , then it is therefore somewhat 
below the set temperature then the fuzzy 
controller sends a signal to automatically trigger 
all mechanical cooling and ventilation systems 
and gives an order for " Open a single heating 
system". 
3) If the temperature inside the greenhouse is cool 
therefore the controller gives the same previous 
order. 
4) If the temperature (TSH) inside the greenhouse 
increases slowly, then the controller gives the 
order to stop all heating, cooling and mechanical 
ventilation systems and leave the operation of the 
ventilation natural. 
5) If the temperature (TH) inside the greenhouse is 
hot, then it is slightly higher than the temperature 
of the set point, then the fuzzy controller sends a 
signal to automatically trigger all the cooling, 
heating and ventilation systems and gives Order 
to "open mechanical ventilation system" (forced). 
6) If the temperature (HVT) inside the greenhouse is 
very hot, it is higher than the set point 
temperature, the fuzzy controller sends a signal to 
automatically trigger all heating systems and 
allow the operation of two mechanical ventilation 
systems. 
7) If the temperature (TEH) inside the greenhouse is 
super hot, it is therefore superior to the set point 
temperature. Then the blurred controller sends a 
signal to automatically open all mechanical 
cooling and ventilation systems and stop " 
Operation of heating systems" and natural 
ventilation mechanical ventilation. 
We save the file (.fis) to load it into the workspace and 
retrieve it in the Simulink Fuzzy block under the same 
name of the saved file. 
The simulation of our system was done by 
MATLAB SIMULINK. The results of the MATLAB / 
SIMULINK software indicate the high capacity of the 
proposed technique to control the internal temperature 
of the greenhouse even in the event of a rapid change 
of atmospheric conditions. The modeling of the system 
Is defined in the form of this block diagram introduced 
in our Simulink shown in Figure (4 and 5). Its goal is to 
achieve the set temperature of 20 ° C required by the 
internal environment of our greenhouse. Indeed, by 
varying the ranges of inferences, the efficiency of the 
regulator has been increased around this set point. It 
would also be possible to modify the inference rules or 
the forms of the membership functions used. 
For the validation of our model we used the full 
windows version of MATLAB Simulink R2012b 
(8.0.0.783), 64bit (win64). The simulation was 
performed on a TOSHIBA laptop. The latter is 
equipped with a 700 GB hard drive, and 5 GB of RAM. 
Simulink parts of the model were performed in 
"Accelerator" mode which first generated a compact 
representation of C code in the diagram. 
Then compiled and executed. Simulink diagrams 
are obtained in the form of a sub-model integrated into 
blocks, thus decomposing the global model.  
Our model is validated independently. Simulink 
diagrams resulting from the implementation and 
validation of our greenhouse model and its systems 
(heating, ventilation, cooling and misting, etc.) are 
shown in the figures below. 
The informatics code (program) that gives the order 
to start the simulation for the figures below is as 
follows: 
% guessread 
% Didi Faouzi 
% 1-1-2015 -- 5-24-2016 
% Opens, reads, processes, and interpolates (per 
minute basis) hourly 
% weather data from a text file and converts it into 
weather vectors for 
% use by the Simulink model.  Also calculates vapor 
pressures, and 
% Weather data must be in tab, space, or comma 
delimited text format  
% And MUST include in the following columns in this 
order 
% date column 
% time column 
% Running time in days from first data point 
% Outdoor temperature F or C 
% Outdoor Relative Humidity % of VP Saturation 
% Dew point temperature 
% Wind Direction in degrees/radians E of N(azimuth) 
% Wind Speed in mph or m/s 
% Solar radiation in Langley s/hr or W/m2 
%  
% start cool = column with running time in hrs, min 
or seconds 
% start row = column at end of header 
% NOTE:--------- 
% This M-File cannot be executed standalone, and 
must be called 
% from within the main GUESS module.  To run 
GUESS, type GUESS in the  
% command window prompt, and press <enter>. 
324 Informatica 41 (2017) 317–331 D. Faouzi et al.  
% -------------------------------------------------------------
 
Figure 4: Schema Simulink represents our Fuzzy Logic Controller. 
Optimization, Modeling and Simulation of… Informatica 41 (2017) 317–331 325 
-- 
% last modified 4-30-06 
%----------- 
% Read data 
%----------- 
%try                     % Look for errors 
tic                      % Start timer 
warning off 
fprintf('\n Weather Data File Processing\n') 
fprintf('Reading File......'); 
W = dim read(Settings. file. file path, Settings. file. 
delimiter, ... 
    Settings. file. start row, Settings. file. start cool); 
switch Settings. sim. frequency 
    case 'hourly' 
        Hours = W(:,1);              % Convert to hourly data 
set  
    case '15min' 
        Quarters = W(:,1); 
    %case '5min' 
    %case '2min' 
    case '1min' 
        Minutes = W(:,1); 
    otherwise     
        error('\n Invalid time step \n'); 
    return 
end 
Date Raw     = W(:,2);            % Days elapsed since 
start of growing season 
Temp Raw     = W(:,3);            % in deg C or F 
Reel H Raw     = W(:,4);            % rel. Humidity % 
%Dew P Raw     = W(:,5);           % Dew point Temp 
C or F; eliminate dew point 
WindDirRaw1 = W(:,6);            % Wind Direction in 
degrees from South, azimuth 
Wind Raw     = W(:,7);            % Wind Speed 
 
Figure 5.  Simulink representation of the heating system model. 
 
Figure 6: Simulink Representation of the Ventilation and Cooling Systems Model. 
326 Informatica 41 (2017) 317–331 D. Faouzi et al.  
Solar Raw    = W(:,8);            % Solar insulation in 
Langley's or watts/m^2 
fprintf('DONE\n') 
% ---------------- 
% Unit Conversion  
% ---------------- 
% --------------- 
% Convert date into metric for ease of calculations 
% --------------- 
if units. temp == 'F' 
    Temp_C  = conv Fto C(Temp Raw); 
%    Dew P_C  = convFtoC(Dew P Raw); 
else if units. temp == 'C'     
    Temp_C = Temp Raw; 
%    Dew P_C = convFtoC(Dew P Raw); 
else 
    error ('Invalid temperature units'); 
end 
switch units. wind 
    case 'mph' 
        Wind Raw = convmph2mps (Wind Raw); 
    case 'mps' 
        % do nothing 
    otherwise 
        error ('Invalid wind speed units'); 
end 
    %Wind Raw = Power Law Wind Conversion 
switch units. solar 
    case 'Wm2' 
        % do nothing 
    case 'ly' 
        Solar Old = Solar Raw; 
        Solar Raw = convLyhr2Wm2 (Solar Raw); 
    case 'Btuf2h' 
        Solar Raw = convBtuhrft2toWm2(Solar Raw); 
    otherwise 
        error('Invalid solar units'); 
end           
switch units.rel H 
    case '%' 
        Reel H Raw = Reel H Raw / 100; 
    case '1'% do nothing 
    otherwise 
        error('Invalid humidity unit'); 
end 
switch units. wind dir 
    case 'deg' 
        Wind Dir Raw = WindDirRaw1 / 180*pi(); 
    case 'read' %do nothing 
    otherwise 
        error('Invalid wind direction unit'); 
end  
%------------------------- 
% Check sampling rate 
%------------------------- 
switch Settings. sim. frequency               
    case 'hourly'             % 1hr. sampling rate 
         Time Raw = Hours .*(60/Settings.sim.timestep); 
    case '15min'              % 15 min. sampling rate   
         Time Raw = Quarters 
.*(15/Settings.sim.timestep); 
    case '1min'               % 1 min. sampling rate 
         Time Raw = Min;  
end 
EO Time = Time Raw(length(Time Raw)); 
% PAR Conversion 
Light Raw     = conv2PAR('sunlight', Solar Raw, 
'W/m2');  
fprintf('Vapor Pressure, Wet bulb Calculations ......'); 
% Calculate humidity measurements: vapor pressure 
or humidity ratios 
% Calculate Saturation Vapor Pressures 
Sat VP Raw    = p sat(Temp_C);                  % Saturation 
Vapor Pressures 
% modify Sat VP to use Teten's formula (will run 
faster) 
% modify all to run with arrays; add *. and ./ 
VP Raw       = Reel H Raw .* Sat VP Raw;      % Vapor 
Pressures 
Hum Raw  = hum ratio (location. pressure, VP Raw); 
% humidity ratio (mass H2O/mass air) 
Wet Bulb Raw   = (wet bulb (Temp_C, Reel H Raw, 
location. pressure))'; 
Wet Bulb VP    = p sat (Wet Bulb Raw); 
Wet Bulb Hum Raw = hum ratio (location. pressure, 
Wet Bulb VP); 
fprintf('DONE\n'); 
fprintf('Wind Pressure Calculations ......'); 
% Calculate Wind Incidence Angle and Natural 
Ventilation Pressure Coeff. 
IncidenceAngleRaw1 = Wind Dir Raw - Greenhouse. 
Azimuth;   % Inlet 1 
IncidenceAngleRaw2 = Wind Dir Raw + Greenhouse. 
Azimuth;   % Inlet 2 
% Coefficient of Pressures 
CpRaw2 = ones(length(IncidenceAngleRaw1),1); 
CpRaw1 = ones(length(IncidenceAngleRaw1),1); 
for X = 1:length(IncidenceAngleRaw1) 
    CpRaw1(X) = calc Wind Press 
Coeff(IncidenceAngleRaw1(X)); 
    CpRaw2(X) = calc Wind Press   
Coeff(IncidenceAngleRaw2(X)); 
end 
% Ventilation Rate 
Wind Factor Raw = abs(CpRaw1 - 
CpRaw2)./(sort(abs(CpRaw1 - CpRaw2)));  
U Nat V Raw = Ventilation. Natural. CD is charge .* 
Wind Factor Raw .* Wind Raw;    % Ventilation Rate 
% ----------------------------------- 
% Calculate Wind Pressures in Pascals 
% Find Wind speed at eave height 
Wind Raw = WS Convert(Wind Raw, Settings. 
Climate. Wind. measured height, ... 
            Ventilation. Natural. Height, Settings. 
Climate. Wind. exponent); 
% Wind Pressure Inlet 1 
WindPressure1 = 0.5*Air D(Temp_C, location. 
pressure)*WindRaw.^2* CpRaw1; 
% Wind Pressure Inlet 2 
WindPressure2 =  0.5*Air D(Temp_C, location. 
pressure)*WindRaw.^2* CpRaw2; 
Wind Pressure Raw = WindPressure1 - 
Optimization, Modeling and Simulation of… Informatica 41 (2017) 317–331 327 
WindPressure2;  %Flow driven by pressure diff. 
fprintf('DONE\n'); 
% 
fprintf('Solar Radiation Calculations ......'); 
  
% --- Solar Altitude & Clearness Index --- 
% Solar Time 
Hrs = Time Raw/(60/Settings.sim.timestep) + 
Settings. sim. time lag; 
Clock time = mod(Hrs, 24); 
Day Raw = 1+ Hrs ./ 24; 
DayRaw2 = floor (Clock time ./ 24); 
Hr Angle = Hour Angle Correct(Clock time, Day 
Raw,... 
    location. long - location. std long); 
Declination = declination (Day Raw); 
% Solar Altitude 
Altitude = solar altitude (Declination, location.lat, Hr 
Angle); 
Altitude(Altitude < 0) = 0;    % can use -6 deg for civil 
twilight 
Altitude = (pi/180)* Altitude; % convert to radians 
% Clearness Index 
ET Solar = sin(Altitude) * Properties. Solar Constant;  
% Calculate ET Radiation 
K Index(1:length(Solar Raw),1)= 0.8; 
%K Index = Solar Raw./ET Solar;    % clearness index 
for i = 1:length(Time Raw)       % Correct for div 
by/zero  
    if Solar Raw(i) > ET Solar(i) % Can't have clearness 
index > 1          
        K  Index(i,1) = 1; 
    else if ET Solar(i) == 0 && i > 2      % Maintain 
previous value 
        K Index(i,1) = K Index(i-1,1);     % throughout 
the night  
    else 
        K Index(i,1) = Solar Raw(i)./ET Solar(i);    
%clearness index 
    end 
end 
% Diffuse vs. Direct 
[Diffuse Raw Direct Raw] = Split Beam(Solar Raw, 
K Index); %Rad. splitting 
fract Diff Raw = f Diffuse(K Index); 
% f Direct  = 1 - f Diffuse; 
fprintf('DONE\n'); 
%--------------------- 
% Long wave Sky Balance 
%--------------------- 
fprintf('Long wave Calculations ......'); 
e_sky = e Sky B(Temp Raw, VP Raw, K Index);  
T Sky Raw =  e_sky.^(1/4) .* Temp Raw; 
fprintf('DONE\n'); 
fprintf('Interpolating ......'); 
%------------------ 
% Interpolate to 1 minute time step for simulation 
%------------------ 
Settings. sim. Max time = EO Time; 
Time     = ((1:EOTime).'); 
Date     = interp1(Time Raw, Date Raw, Time, 'linear'); 
Temp     = interp1(Time Raw, Temp_C, Time, 
method); 
Reel H     = interp1(Time Raw, Reel H Raw, Time, 
method); 
%  Dew P     = interp1(Time Raw, Dew P_C, Time, 
method); 
Wind     = interp1(Time Raw, Wind Raw, Time, 
method); 
Solar    = interp1(Time Raw, Solar Raw, Time, 
method); 
U Nat V    = interp1(Time Raw, U Nat V Raw, Time, 
method); 
Wind Fact = interp1(Time Raw, Wind Factor Raw, 
Time, method); 
Diffuse  = interp1(Time Raw, Diffuse Raw, Time, 
methods); 
fract Diffuse = Diffuse ./Solar; 
fract Diffuse(~is finit(fract Diffuse))= 0; 
%Direct   = interp1(Time Raw, Direct Raw, Time, 
method); 
Sat VP    = interp1(Time Raw, Sat  V Raw, Time, 
method); 
Humidity = interp1(Time Raw, Hum Raw, Time, 
method); 
Wet bulbs    = interp1(Time Raw, Wet Bulb Raw, 
Time, method); 
WB Humidity  = interp1(Time Raw, Wet Bulb Hum 
Raw, Time, method); 
VP          = interp1(Time Raw, VP Raw, Time, 
method); 
Light       = interp1(Time Raw, Light Raw, Time, 
method);   % Light in PAR 
Angle       = interp1(Time Raw, Altitude, Time, 
method); 
T Sky        = interp1(Time Raw, T Sky Raw, Time, 
method); 
%--- Wind and Natural Ventilation  --- 
Wind Dir  = interp1(Time Raw, Wind Dir Raw, Time, 
method); 
Wind Pressure = interp1(Time Raw, Wind Pressure 
Raw, Time, method); 
fprintf('DONE\n'); 
fprintf('Generating Lookup Tables ......\n'); 
fprintf('Saturation Humidity ......'); 
%--- Saturation Vapor Look up Table --- 
P Sat Look Up Table. T = 0:0.5:55; 
P Sat Look Up Table. P = p sat(P Sat Look Up Table. 
T); 
%--- dP Sat Look Up Table --- 
dP Sat Look Up Table. T = 0:0.5:55; 
dP Sat Look Up Table. P = dp sat(dP Sat Look Up 
Table. T); 
%--- Humidity at Wet Bulb Table --- 
equiv WB Look Up Table. T = linspace(0,50,100); 
equiv WB Look Up Table .H = linspace(0,1,100); 
[T2 H] = mesh grid(equiv WB Look Up Table. T, 
equiv WB Look Up Table .H); 
equiv WB Look Up Table. Values = equiv WB 
humidity(T2, H/100, location. pressure); 
%--- Plant Stuff --- 
Init Plant; 
328 Informatica 41 (2017) 317–331 D. Faouzi et al.  
fprintf('DONE\n'); 
fprintf('Packing and Cleanup ......'); 
%--------------- 
% Create Weather structure in format wanted by 
Simulink 
% and organize into structure for ease of packaging 
% Simulink Format 
% Signal = [time step data];  use column vectors for 
both. 
% ---------------- 
Weather. Temp         = [Time Temp]; 
Weather. Reel H         = [Time Reel H]; 
Weather. Solar        = [Time Solar]; 
Weather. Wind         = [Time Wind]; 
Weather. Wind Fact     = [Time Wind Fact]; 
Weather. U Nat V        = [Time U Nat V]; 
Weather. Sat VP        = [Time Sat VP]; 
Weather. Humidity     = [Time Humidity]; 
Weather. Wet Bulb      = [Time Wet bulbs]; 
Weather. VP           = [Time VP]; 
Weather. Wind Dir      = [Time Wind Dir]; 
Weather. WB Humidity   = [Time WB Humidity]; 
Weather. Wind P        = [Time Wind Pressure]; 
Weather. Angle        = [Time Angle]; 
Weather. fract Diffuse = [Time fract Diffuse]; 
Weather. T Sky         = [Time T Sky]; 
%Weather. Direct      = [Time Direct]; 
%---------------- 
% Create timer object to override Simulink built-in 
clock for output 
% graphing and scoping 
% ------------------ 
%-------------- 
% Cleanup 
% Clear unneeded data 
%-------------- 
clear W Date Raw Reel H Raw  Dew P Raw  T_C Cp 
Raw VP Light Solar Old Temp Raw 
clear Hours Quarters Minutes  Light Reel H Sat VP 
Humidity Wet bulbs Wind 
clear Wind Humidity Dew P Dew P_C Incidence 
Angle  Hum Raw Cp Wet bulb 
clear Wind Dir WindDirRaw1  VP Raw Sat VP Raw 
VP Solar Old Wind Pressure Solar Raw 
clear Incidence Angle Raw Wet Bulb Raw Temp  Wet 
Bulb Hum Raw WB Humidity  Angle 
clear Diffuse Raw Direct Raw  Direct CpRaw1 
CpRaw2 Day Raw Altitude 
clear DayRaw2 WB Humidity Time Raw Temp_C 
Time Fill  K Index pack; 
fprintf('DONE\n\n'); 
disp('Ready for simulation!'); to warning on 
%catch 
%    disp('Corrupt/Invalid Weather data file OR'); 
%    disp('guessread is not a standalone m-file, run 
guessinit first'); 
%end. 
% Plant Growth Diagram  
figure(5) 
% Height Graph, subplot 1 
subplot(2,2,1) 
plot(Guess Output. date, Guess Output. Plant. 
Height); 
x label('Day'); 
y label('Height (cm)'); 
subplot(2,2,2) 
plot(Guess Output. date, Guess Output .Plant. Diam); 
x label('Day'); 
y label('Stem Diameter (mm)'); 
subplot(2,2,3) 
plot(Guess Output. date, Guess Output. Plant. 
Biomass); 
x label('Day'); 
y label('Total Dry Biomass (g)'); 
subplot(2,2,4) 
plot(Guess Output. date, Guess Output. Plant. Crops); 
x label('Day'); 
y label('Crops Harvested(#)'); 
h = top title('Plant Growth Characteristics'); 
set(h, 'Font Size', 14); 
% Indoor Temperature Distribution 
figure 
plot(Guess Output. date, Guess Output. temp) 
title('Temperatures'); 
x label('Day'); 
y label(axis); 
legend('outdoor',' indoor'); 
figure 
hold on 
hits (Guess Output. temp(:,2)); 
title('Indoor Temperature Distribution'); 
x label(axis); 
y label('freq.'); 
mean temp = mean(Guess Output. temp(:,2)); 
disp(sprint('Mean Indoor Temperature: %3.2f', mean 
temp)); 
s tdev = std(Guess Output. temp(:,2)); 
disp(sprint('Standard Deviation Indoor Temperature: 
%3.2f', stdev)); 
% Costs Diagram 
figure(2) 
plot(GuessOutput.date,[GuessOutput.Costs.total, 
Guess Output.Costs.gas,...  
 GuessOutput.Costs.electricity, 
GuessOutput.Costs.water]) 
legend('Total', 'Natural Gas', 'Electricity', 'Water', 
'Orientation', ... 'Horizontal', 'Location', 'Best') 
h = title('Energy Costs'); 
x label('Day') 
y label('Cost ($)') 
set(h, 'Font Size', 14) 
% Quantities Diagram 
figure(4) 
plot(GuessOutput.date,[GuessOutput.Quants.gas*Fue
l Converter,... 
GuessOutput.Quants.electricity, 
GuessOutput.Quants.water]) 
legend('Natural Gas(ft^3)', 'Electricity(kWh)', 
'Water(gal)', 'Orientation', ... 
       'Horizontal', 'Location', 'Best') 
   x label('Day') 
   y label('Energy Quantity') 
Optimization, Modeling and Simulation of… Informatica 41 (2017) 317–331 329 
   h = title('Energy Quantities'); 
   set(h, 'Font Size', 14). 
6 Simulation results  
6.1 Discussions on figures 
The simulation results demonstrate the capabilities and 
performance of the intelligent controller, as well as the 
robustness of the fuzzy control. They illustrate in FIG 
(7) the stability of the variation of internal temperatures 
during the day and at night ranging from 15 ° C. to 25 ° 
C. and a relative humidity ranging from 50 % To 80% 
in the greenhouses of the two regions of Dar El Beida 
(wetland) and Biskra (arid zone). The temperature 
gradient due to the greenhouse effect recorded between 
the interior and the external environment was positive 
and varied from + 2 ° C to + 14 ° C. It should be noted 
that the external moisture content of the Dar El Beida 
zone, varying from 50% to 95%, was higher than that 
of Biskra, which varies from 35% to 80%. Temperature 
disturbances were recorded during the fifth season from 
 
Figure 7. The evolution of the indoor and outdoor temperature in the form of scoop. 
 
Figure 8.  Evolution of indoor and outdoor moisture in the form of scoop. 
330 Informatica 41 (2017) 317–331 D. Faouzi et al.  
mid-autumn to early winter, between 20 November and 
20 January of the year. This was reflected in peaks in 
water temperatures and saturations due mainly to heat 
losses and signal noise caused by repeated cyclic 
activation and deactivation of heating and ventilation 
systems and equipment. On the whole, the application 
of the fuzzy method presents rather satisfactory results. 
The need to improve the thermal insulation of the 
greenhouse is essential to reduce the heat losses that 
generally occur during this cold season in the 
greenhouse. Improving the greenhouse effect means 
stabilizing the microclimate and reducing the number 
of activations and deactivations of the accompanying 
systems that partly pose the problem of climate 
management. 
In the figure (8), the moisture content within the 
greenhouse in the two wetlands and arid areas generally 
remains close to the optimum except in summer, where 
the rate of Humidity sometimes falls below the 
minimum threshold due to the ventilation necessary to 
ventilate, compensate and regulate the internal 
temperature. 
7 Conclusion 
The research work was initiated by a rich and 
interesting bibliographical study, which allowed us to 
discover this area of current affairs. A description of the 
types and models of agricultural greenhouses has been 
developed. Thermo hydric interactions, which occur 
within the greenhouse have been approached. The 
biophysical and physiological state of the plants 
through photosynthesis, respiration and 
evapotranspiration were exposed while taking into 
account their influences on the immediate environment 
and the mode of air conditioning. The models of 
regulation and climatic control have been approached 
from the use of conventional equipment to the use of 
artificial intelligence and / or fuzzy logic. Knowledge 
models and computer techniques have been established 
with a well-defined approach and hierarchy for optimal 
climate management of greenhouse systems, while 
naturally adopting the Mamdani method. 
The aim is to develop a robust and robust air-
conditioning control technology to deal with 
disturbances that may occur in the external and internal 
environments of the greenhouse system 
Our assignments are: 
1) To define the functioning of the complex system of 
the greenhouse with its various components 
(culture-immediate environment) using control 
models that optimally regulate the climate inside 
the greenhouse and reproduce the essential Of the 
properties, mechanisms and interactions between 
culture and its environment 
2) To solve the problem of follow-up by the 
modeling, design and development of an intelligent 
controller able to regulate the system by following 
a desired reference trajectory and a hierarchy of 
control and regulation of the couplings between 
The different inputs and / or outputs 
In this work we develop a climatic control by the 
Mamdani method based on multi-variable models in 
line for an adaptive neural model structure. The 
influence of random perturbations on system 
performance and optimization must be supported. 
We have conceptualized, modeled, configured and 
developed an intelligent controller by fuzzy logic based 
on the Mamdani method. The simulation of the optimal 
climate management of the indoor environment of the 
agricultural greenhouse was carried out for two 
different regions, one wet, and this is Dar El Beida in 
ALGER; The other arid and concerns the region of 
BISKRA. 
The simulation results highlight the capabilities 
and performance of the intelligent controller, as well as 
the robustness of the fuzzy control. They also illustrate 
the intelligibility of this control for optimal 
management during the four production seasons. We 
have also noted deficiencies that have arisen during the 
operation of the system during the fifth season, from 
mid-autumn to early winter (from 20 November to 20 
January) .This is mainly due to heat losses and 
Signaling caused by repeated cyclic activation and 
deactivation of air conditioning equipment (heating and 
ventilation). This is because these state variables are 
highly correlated and influenced by the external 
environment, Solar radiation in the visible and infrared 
and the physiological response of the culture, a reaction 
quite natural and implicit. 
The use of conventional controllers, the 
configuration of which no longer meets our 
expectations, nor the optimum climate regulation, but 
rather to some extent to the artificial intelligence 
technique, easily handled, to the serrists Characterized 
by its reliability and robustness in optimal climate 
management 
The advantages of fuzzy control must be listed and 
treated without ignoring the conventional approaches of 
the classical automatic. All this involves the search for 
a compromise between complexity, human experience, 
systems mastery, model realism, configuration mode, 
and the robustness of the control method for predictive 
performance. 
It should be noted that building robust models and 
practical, robust control methods are not limited by 
computer and / or digital tools, but rather by our 
knowledge and control of the dynamics of the 
ecosystem and its impact on the environment. Optimal 
climatic management of the greenhouse system. This 
implies that it is rather limited by the nature and quality 
of information on the ecosystem and its environment 
and by the faithful reproduction of this information for 
decision-making. 
We remain optimistic in the near future, as regards 
the use of artificial intelligence technique and its fuzzy 
logic branch, which is indicated by: 
1) Optimum climate control and regulation. 
2) The operating efficiency of the energy reserve 
due to the greenhouse effect. 
Optimization, Modeling and Simulation of… Informatica 41 (2017) 317–331 331 
3) Optimal management of the energy input 
necessary for the operation of the 
accompanying systems and equipment. 
4) Better productivity of sheltered crops. 
5) A significant decrease in human intervention. 
In the same way, it is necessary to point out the 
insufficiencies which may arise in the application of the 
fuzzy logic method and which are due at the moment to 
a misunderstanding or insufficient information of the 
ecosystems and their environments, Signal noise 
problems, the robustness of the fuzzy control, the peaks 
of the dominant variables; But for the moment all these 
interference problems and deficiencies can be solved by 
a realistic approach of the system. 
As for the prospects, they are numerous. The 
application of Artificial Intelligence was reserved 
mainly in the fields of industry, robotics and especially 
in the Agri-food industry, whereas it can intervene in 
the management of several systems and processes not 
or can be tackled up to this day. We propose in our field 
to promote these techniques to multiply the harvest 
seasons and to make the exploitation of our agricultural 
land profitable. 
8 References 
[1] Didi Faouzi, N. Bibi Triki and A. Chermitti, 2016. 
Optimizing the greenhouse micro-climate 
management by the introduction of artificial 
intelligence using fuzzy logic. Int. J. Computer 
Eng. Technology, 7: 78-92 , Volume 7, Issue 3, 
May-June 2016, pp. 78–92, Article ID: 
IJCET_07_03_007. 
[2] Didi Faouzi, N. Bibi-Triki, B. Draoui, A. Abène, 
2016 , Modeling, Simulation and Optimization of- 
agricultural greenhouse microclimate by the 
application of-artificial intelligence and/or fuzzy 
logic, International journal of scientific & 
engineering research, volume 7, issue 8, august-
2016 issn 2229-5518. 
[3] Didi Faouzi, N. Bibi-Triki, B. Draoui, A. Abène, 
2016 Comparison of modeling and simulation 
results management micro climate of the 
greenhouse  by fuzzy logic between a wetland and 
arid region, International Journal of 
Multidisciplinary Research and Modern 
Education (IJMRME) ISSN (Online): 2454 - 
6119, Volume II, Issue II, 2016 . 
[4] Didi Faouzi, N. Bibi-Triki, B. Draoui, A. Abène, 
2016, Modeling and Simulation of Fuzzy Logic  
Controller for the purpose of Optimizing the 
Management Micro Climate of the Agricultural 
Greenhouse, MAYFEB Journal of Agricultural 
Science Vol 2 (2016). 
[5] Didi Faouzi  , N. Bibi-Triki  , B. Draoui , A. 
Abène, 2017,  Greenhouse Environmental Control 
Using Optimized, Modeled and Simulated Fuzzy 
Logic Controller Technique in MATLAB 
SIMULINK, Computer Technology and 
Application 7 (2016) 273-286, doi: 
10.17265/1934-7332/2016.06.002. 
[6] Didi Faouzi, N. Bibi-Triki, B. Draoui, A. Abène. 
Dated 10th March 2017, The Optimal 
Management of the Micro Climate of the 
Agricultural Greenhouse through the Modeling of 
a Fuzzy Logic Controller, International 
Knowledge Press, Journal of Global Agriculture 
and Ecology (JOGAE), 7(1): 1-15, 2017, ISSN: 
2454-4205, Ref. No. IKP/JOGAE/17/0102. 
[7] Jamisson M.Hill, dynamic modeling of tree 
growth and energy use in a nursery greenhouse 
using MTLAB and Simulink, Cornell University, 
7/31/2006. 
[8] Marco Binotto (May 2014), "Greenhouse climate 
model an aid to estimate the influence of 
supplemental lighting on greenhouse climate", 
School of Science and Engineering at Reykjavík 
University. 
[9] Babuska, R., & Mamdani, E. H. (2008). Fuzzy 
Control. 
http://www.scholarpedia.org/article/Fuzzy_contr
ol. 
[10] [10] Breemen, A. v., & Vries, T. d. (2000). An 
Agent-Based Framework for Designing Multi- 
Controller Systems. Paper presented at the 
Proceedings of the Fifth International Conference 
on The Practical Applications of Intelligent 
Agents and Multi-Agent Technology, 
Manchester, U.K. 
[11] [11] Tan, V., Yoo, D.-S., & Yi, M.-J. (2008a). A 
Multiagent-System Framework for Hierarchical 
Control and Monitoring of Complex Process 
Control Systems. Paper presented at the 
Proceedings of the 11th Pacific Rim International 
Conference on Multi-Agents: Intelligent Agents 
and Multi-Agent Systems, Hanoi, Vietnam. 
[12] [12] Choi, J., Oh, S., & Horowitz, R. (2009). 
Distributed learning and cooperative control for 
multi-agent systems. Automatica, 45(12), 2802-
2814. doi: 10.1016/j.automatica.2009.09.025. 
[13] [13] McArthur, S. D. J., Davidson, E. M., 
Catterson, V. M., Dimeas, A. L., Hatziargyriou, 
N. D., Ponci, F., & Funabashi, T. (2007). Multi-
Agent Systems for Power Engineering 
Applications-Part I: Concepts, Approaches, and 
Technical Challenges. 22, 1743- 1752.doi: 
10.1109/tpwrs.2007.908471. 
[14] [14] Kelly, I. D., & Keating, D. A. (1998). Faster 
learning of control parameters through sharing 
experiences of autonomous mobile robots. 
International Journal of Systems Science 29(7), 
783-793. 
  
332 Informatica 41 (2017) 317–331 D. Faouzi et al.  
 
Informatica 41 (2017) 333–347 333
Improving Visual Vocabularies: a More Discriminative, Representative and
Compact Bag of Visual Words
Leonardo Chang, Airel Pérez-Suárez and José Hernández-Palancar
Advanced Technologies Application Center, 7A #21406, Siboney, Playa, Havana, Cuba, C.P. 12220
E-mail: {lchang,asuarez,jpalancar}@cenatav.co.cu
Miguel Arias-Estrada and L. Enrique Sucar
Instituto Nacional de Astrofísica, Óptica y Electrónica
Luis Enrique Erro No. 1, Sta. María Tonantzintla, Puebla, México, C.P. 72840
E-mail: {ariasmo,esucar}@inaoep.mx
Keywords: bag of visual words improvement, bag of features, visual vocabulary, visual codebook, object categorization,
object class recognition, object classification
Received: July 21, 2016
In this paper, we introduce three properties and their corresponding quantitative evaluation measures to
assess the ability of a visual word to represent and discriminate an object class, in the context of the
BoW approach. Also, based on these properties, we propose a methodology for reducing the size of the
visual vocabulary, retaining those visual words that best describe an object class. Reducing the vocabulary
will provide a more reliable and compact image representation. Our proposal does not depend on the
quantization method used for building the set of visual words, the feature descriptor or the weighting
scheme used, which makes our approach suitable to any visual vocabulary. Throughout the experiments
we show that using only the most discriminative and representative visual words obtained by our proposed
methodology improves the classification performance; the best results obtained with our proposed method
are statistically superior to those obtained with the entire vocabularies. In the Caltech-101 dataset, average
best results outperformed the baseline by a 4.6% and 4.8% in mean classification accuracy using SVM and
KNN, respectively. In the Pascal VOC 2006 dataset there was a 1.6% and 4.7% improvement for SVM and
KNN, respectively. Furthermore, these accuracy improvements were always obtained with more compact
representations. Vocabularies 10 times smaller always obtained better accuracy results than the baseline
vocabularies in the Caltech-101 dataset, and in the 93.75% of the experiments on the Pascal VOC dataset.
Povzetek: S pomočjo rudarjenja podatkov se prispevek ukvarja z iskanjem besed za razločevanje razredov
objektov.
1 Introduction
One of the most widely used approaches for represent-
ing images for object categorization is the Bag of Words
(BoW) approach [5]. BoW-based methods have obtained
remarkable results in recent years and they even obtained
the best results for several classes in the recent PASCAL
Visual Object Classes Challenge on object classification
[8]. The key idea of BoW approaches is to discretize the en-
tire space of local features (e.g., SIFT [22]) extracted from
a training set at interest points or densely sampled in the
image. With this aim, clustering is performed over the set
of features extracted from a training set in order to identify
features that are visually equivalent. Each cluster is inter-
preted as a visual word, and all clusters form a so-called
visual vocabulary. Later, in order to represent an unseen
image, each feature extracted from the image is assigned to
a visual word of the visual vocabulary; from which a his-
togram of occurrences of each visual word in the image is
obtained, as illustrated in Figure 1.
One of the main limitations of the BoW approach is that
the visual vocabulary is built using features that belong to
both the object and the background. This implies that the
noise extracted from the image background is also con-
sidered as part of the object class description. Also, in
the BoW representation, every visual word is used, regard-
less of its low representativeness or discriminative power.
These elements may limit the quality of further classifica-
tion processes. In addition, there is no consensus about
which is the optimal way for building the visual vocabu-
lary, i.e., the clustering algorithm used, the number of clus-
ters (visual words) that best describe the object classes, etc.
When dealing with relatively small vocabularies, clustering
can be executed several times and the best performing vo-
cabulary can be selected through a validation phase. How-
ever, this becomes intractable for large image collections.
In this paper, we propose three properties to assess the
ability of a visual word to represent and discriminate an
334 Informatica 41 (2017) 333–347 L. Chang et al.
1.) Local Feature 
Extraction
2.) Feature          
Description
3.) Visual       
Vocabulary
)  Ranking 
and Filtering
4.)  Bag of 
Words
1. k5
2. k2
3. k4
4. k1
1. k5
2. k2
3. k4
4. k1
5. k6
6. k3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
*
FilteringRanking
k1
k2
k3
k4
k5
k6
k1 k2 kK. . .
k1 k2 kK. . .
k1 k2 kK. . .
k1 k2 kK. . .
Figure 1: Classical BoW approach overview (steps 1 to 4). First, regions/points of interest are automatically detected and
local descriptors over those regions/points are computed (step 1 and 2). Later in step 3, the descriptors are quantized into
visual words to form the visual vocabulary. Finally, in step 4, the occurrences in the image of each specific word in the
vocabulary for constructing the BoW feature are found. In this work, we propose to introduce step (∗) in order to use only
the most discriminative and representative visual words from the visual vocabulary in the BoW representation.
object class in the context of the BoW approach. We de-
fine three measures in order to quantitatively evaluate each
of these properties. The visual words that best represent a
class, best generalize over intra-class variability and best
differentiate between object classes will obtain the high-
est scores for these measures. A methodology for reducing
the size of the visual vocabulary based on these proper-
ties is also proposed. Our proposal does not depend on
the clustering method used to create the visual vocabulary,
the descriptor used (e.g., SIFT, SURF, etc.) or the weight-
ing scheme used (e.g., tf, tf-idf, etc.) Therefore, it can be
applied to any visual vocabulary to improve its representa-
tiveness, since it does not build a new visual vocabulary, it
rather finds the best visual words of a given visual vocabu-
lary.
Experiments conducted on the Caltech-101 [10] and Pas-
cal VOC 2006 [9] datasets, in a classification task, demon-
strate the improvement introduced by the proposed method.
Tested with different vocabulary sizes, different interest
points extraction and description methods, and different
weighting schemas, the classification accuracies achieved
using the entire vocabulary were always statistically infe-
rior to those achieved by several of the vocabularies ob-
tained by filtering the baseline vocabulary, using our pro-
posed vocabulary size reducing methodology. Moreover,
the best results were obtained with as few as the 13.4%
and 17.2%, in average, of the baseline visual words for the
Caltech-101 and Pascal VOC 2006 datasets, respectively.
Compared with a state-of-the-art mutual information based
method for feature selection our proposal obtains superior
classification accuracy results for the highest compression
rates and comparable results for the other filtering sizes.
The paper is organized as follows: Section 2 gives an
overview on related works for building more discrimina-
tive and representative visual vocabularies. Section 3 intro-
duces the proposed properties and measures for the evalua-
tion of the representativeness and distinctiveness of visual
words. The performance of our proposed method on two
data sets and a discussion of the obtained results are pre-
sented in Section 4. Finally, Section 5 concludes the paper
with a summary of our findings and a discussion of future
work.
2 Related work
Several methods have been proposed in the literature to
overcome the limitations of the BoW approach [25]. These
include part generative models and frameworks that use ge-
ometric correspondence [30, 23], works that deal with the
quantization artifacts introduced while assigning features
to visual words [15, 11], techniques that explore different
features and descriptors [24, 12], among many others. In
this section, we briefly review some recent methods aimed
to build more discriminative and representative visual vo-
cabularies, which are more related to our work.
Kersorn and Poslad [17] presented a framework to im-
prove the quality of visual words by constructing visual
words from representative keypoints. Also, domain spe-
cific non-informative visual words are detected using two
main characteristics for non-informative visual words: high
document frequency and a small statistical association with
all the concepts in the collection. In addition, the vector
space model of visual words is restructured with respect to
a structural ontology model in order to solve visual syn-
onym and polysemy problems.
Zhang et al. [29] proposed to obtain a visual vocabu-
lary comprised of descriptive visual words and descriptive
visual phrases as the visual correspondences to text words
and phrases. Authors state that a descriptive visual element
can be composed by the visual words and their combina-
tions and that these combinations are effective in represent-
Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 335
ing certain visual objects or scenes. Therefore, they define
visual phrases as frequently co-occurring visual word pairs.
Lopez-Sastre et al. [21] presented a method for building
a more discriminative visual vocabulary by taking into ac-
count the class labels of images. The authors proposed a
cluster precision criterion based on class labels in order to
obtain class representative visual words through a Recip-
rocal Nearest Neighbors clustering algorithm. Also, they
introduced an adaptive threshold refinement scheme aimed
to increase vocabulary compactness.
Liu [19] builds a visual vocabulary based on a Gaus-
sian Mixed Model (GMM). After K-Means clusters are ob-
tained, GMM is then used to model the distribution of each
cluster. Each GMM will be used as a visual word of the
visual vocabulary. Also, a soft assignment schema for the
bag of words is proposed based on the soft assignment of
image features to each GMM visual word.
Liu and Shah [20] exploit mutual information maximiza-
tion techniques to learn a compact set of visual words and
to determine the size of the codebook. In their proposal two
codebook entries are merged if they have comparable dis-
tributions. In addition, spatio-temporal pyramid matching
is used to exploit temporal information in videos.
Most popular visual descriptors are histograms of im-
age measurements. It has been shown that with histogram
features, the Histogram Intersection Kernel (HIK) is more
effective than the Euclidean distance in supervised learning
tasks. Based on this assumption, Wu et al. [28] proposed a
histogram kernel k-means algorithm which use HIK in an
unsupervised manner to improve the generation of visual
codebooks.
In [4], in order to use low level features extracted from
images to create higher level features, Chandra et al. pro-
posed a hierarchical feature learning framework that uses
a Naive Bayes clustering algorithm. First, SIFT features
over a dense grid are quantized using K-Means to obtain
the first level symbol image. Later, features from the cur-
rent level are clustered using a Naive Bayes-based cluster-
ing and quantized to get the symbol image at the next level.
Bag of words representations can be computed using the
symbol image at any level of the hierarchy.
Jiu et al. [16], motivated by obtaining a visual vocabu-
lary highly correlated to the recognition problem, proposed
a supervised method for joint visual vocabulary creation
and class learning, which uses the class labels of the train-
ing set to learn the visual words. In order to achieve that,
they proposed two different learning algorithms, one based
on error backpropagation and the other one based on cluster
label reassignment.
In [27], the authors propose a hierarchical visual word
mergence framework based on graph-embedding. Given a
predefined large set of visual words, their goal is to hierar-
chically merge them into a small number of visual words,
such that the lower dimensional image representation ob-
tained based on these new words can maximally maintain
classification performance.
Zhang et al. [31] proposed a supervised Mutual Infor-
mation (MI) based feature selection method. This algo-
rithm uses MI between each dimension of the image de-
scriptor and the image class label to compute the dimension
importance. Finally, using the highest importance values,
they reduce the image representation size. This method
achieve higher accuracy and less computational cost than
feature compression methods such as product quantization
[14] and BPBC [13].
In our work, similarly to [17, 21, 16], we also use the
class labels of images. However, we do not use the class la-
bels to create a new visual vocabulary but for scoring the set
of visual words, according to their distinctiveness and rep-
resentativeness for each class. It is important to emphasize
that our proposal does not depend on the algorithm used for
building the set of visual words, the descriptor used nor the
weighting scheme used. The previously mentioned charac-
teristics make our approach suitable to any visual vocabu-
lary since it does not build a new visual vocabulary, it rather
finds the best visual words of a given visual vocabulary. In
fact, our proposal could directly complement all the above
discussed methods, by ranking their resulting vocabularies
according to the distinctiveness and representativeness of
the obtained visual words, although is out of the scope of
this paper to explore it.
3 Proposed method
Visual vocabularies are commonly comprised by a lot of
noisy visual words due to intra-class variability and the in-
clusion of features from the background during the vocab-
ulary building process, among others. Later, for image rep-
resentation every visual word is used, which may lead to an
error-prone image representation.
In order to improve image representations, we introduce
three properties and their corresponding quantitative eval-
uations to assess the ability of a visual word to represent
and discriminate an object class in the context of the BoW
approach. We also propose a methodology, based on these
properties, for reducing the size of the visual vocabulary,
discarding those visual words that worst describe an object
class (i.e., noisy visual words). Reducing the vocabulary
in such a manner will allow to have a more reliable and
compact image representation.
We would like to emphasize that all the measures pro-
posed in this section are used during the training phase;
therefore, we can use all the knowledge about the data that
is available during this phase.
3.1 Inter-class representativeness measure
A visual word could be comprised of features from differ-
ent object classes, representing visual concepts or parts of
objects common to those different classes. These common
parts or concepts do not have necessarily to be equally rep-
resented inside the visual word because, even when similar,
object classes should also have attributes that differentiate
them. Therefore, we can say that, in order to represent an
336 Informatica 41 (2017) 333–347 L. Chang et al.
object class the best, a property that a visual word must
satisfy is to have a high representativeness of this class. In
order to measure the representativeness of a class cj in vi-
sual word k, the measureM1 is proposed:
M1(k, cj) =
fk,cj
nk
, (1)
where fk,cj represents the number of features of class cj
in visual word k and nk is the total number of features in
visual word k.
Figure 2 shows M1 values for two example visual
words. In Figure 2 a) the ‘blue’ class has a very high value
ofM1 because most of the features in the visual word be-
long to the # class, being the opposite for the classes 2
and 4 that are poorly represented in the visual word. Fig-
ure 2 b) shows an example visual word where every class is
nearly equally represented, therefore every class have sim-
ilarM1 values.
3.2 Intra-class representativeness measure
A visual word could be comprised of features from differ-
ent objects, many of them probably belonging to the same
object class. Even when different, object instances from
the same class should share several visual concepts. Tak-
ing this into account, we can state that a visual word best
describes a specific object class while more balanced are
the features from that object class comprising the visual
word, with respect to the number of different training ob-
jects belonging to that class. Therefore, we could say that,
in order to represent an object class the best, a property that
a visual word must satisfy is to have a high generalization
or intra-class representativeness over this class.
To measure the intra-class representativeness of a visual
word k for a given object category cj , the measure µ is
proposed:
µ(k, cj) =
1
Ocj
Ocj∑
m=1
∣∣∣∣
om,k,cj
fk,cj
− 1
Ocj
∣∣∣∣, (2)
where Ocj is the number of objects (images) of class cj in
the training set. om,k,cj is the number of features extracted
from object m of class cj in visual word k, and fk,cj is the
number of features of class cj that belong to visual word
k. The term 1/Ocj represents the ideal ratio of features
of class cj that guarantees the best balance, i.e., the case
where each object of class cj is equally represented in vi-
sual word k.
The measure µ evaluates how much a given class devi-
ates from its ideal value of intra-class variability balance.
In order to make this value comparable with other classes
and visual words, µ could be normalized using its maxi-
mum possible value, which is
2·Ocj−2
O2cj
.
Taking into account that µ takes its maximum value in
the worst case of intra-class representativeness, the mea-
sureM2 is defined to take its maximum value in the case
of ideal intra-class variability balance and to be normalized
by max(µ(k, cj)):
M2(k, cj) = 1−
Ocj
2 · (Ocj − 1)
Ocj∑
m=1
∣∣∣∣
om,k,cj
fk,cj
− 1
Ocj
∣∣∣∣.
(3)
Figure 3 shows the values ofM2 on two example visual
words. In Figure 3 a), the number of features from the
different images of the # class in the visual word is well
balanced, i.e., the visual word generalizes well over intra-
class variability for the # class, hence this class presents a
highM2 value. In contrast, in Figure 3 b) only one image
from the # class is well represented by the visual word.
As the visual word represents a visual characteristic only
present in one image, it is not able to well represent intra-
class variability, therefore, the # class will have a low value
ofM2 in this visual word.
3.3 Inter-class distinctiveness measure
M1 andM2 provide, under different perspectives, a quan-
titative evaluation of the ability of a visual word to describe
a given class. However, we should not build a vocabu-
lary just by selecting those visual words that best repre-
sent each object class, because this fact does not directly
imply that the more representative words will be able to
differentiate well one class from another, as a visual vo-
cabulary is expected to do. Therefore, we can state that, in
order to be used as part of a visual vocabulary, a desired
property of a visual word is that it should have high val-
ues of M1(k, cj) and M2(k, cj) (represents well the ob-
ject class), while having low values ofM1(k, {cj}C) and
M2(k, {cj}C) (misrepresents the rest of the classes), i.e.,
it must have high discriminative power.
In order to quantify the distinctiveness of a visual word
for a given class, the measure M3 is proposed. M3 ex-
presses how much the object class that is best represented
by visual word k is separated from the other classes in the
M1 andM2 rankings.
Let ΘM(K, cj) be the set of values of a given measure
M for the set of visual words K = {k1, k2, ..., kN} and
the object class cj , sorted in descending order of the value
ofM. Let Φ(k, cj) be the position of visual word k ∈ K
in ΘM(K, cj). Let Pk = mincj∈C(Φ(k, cj)) be the best
position of visual word k in the set of all object classes
C = {c1, c2, ..., cQ}. Let ck = arg mincj∈C(Φ(k, cj)) be
the object class where k has position Pk. Then, the inter-
class distinctiveness (measureM3), of a given visual word
k for a given measureM, is defined as:
M3(k,M) =
1
(|C| − 1)(|K| − 1)
∑
cj 6=ck
(Φ(k, cj)− Pk).
(4)
In Figure 4, the M3 measure is calculated for two vi-
sual words (i.e., k2 and k5) of a six visual words and three
Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 337
(a) (b)
k k
nk = 20
M1(k, ) =
M1(k, ) =
M1(k, ) =
0.85
0.10
0.05
nk = 20
0.35
0.30
0.35
M1(k, ) =
M1(k, ) =
M1(k, ) =
Figure 2: (best seen in color.) Examples ofM1 measure values for a) a visual word with a well-defined representative
class (# class with high M1 value, 2 and 4 classes with low M1 values) and b) a visual word without any highly
representative class (#, 2 and4 classes have low and very similarM1 values).
(a) (b)
k k
O = 4
fk, = 17
M2(k, ) = 0.9559
o‘red’,k, = 5
o‘blue’,k, = 4
o‘yellow’,k, = 4
o‘gray’,k, = 4
O = 4
fk, = 17
M2(k, ) = 0.4853
o‘red’,k, = 1
o‘blue’,k, = 13
o‘yellow’,k, = 1
o‘gray’,k, = 2
Figure 3: (best seen in color.) Examples ofM2 measure values for the # class in a) a visual word where there is a good
balance between the number of features of different images of the # class (highM2 value), and in b) the opposite case
where only one image for the # class is predominantly represented in the visual word (low M2 value). In the figure,
different fill colors of each feature in the visual word represent features extracted from different object images of the same
class.
338 Informatica 41 (2017) 333–347 L. Chang et al.
classes example. Visual word k2 is among the top items of
the representativeness ranking for every class in the exam-
ple. Despite this, k2 has low discriminative power because
describing well several classes makes harder the process
of differentiate one class from another. In contrast, visual
word k5 is highly discriminative because it describes well
only one class.
3.4 On ranking and reducing the size of
visual vocabularies
The proposed measures, provide a quantitative evaluation
of the representativeness and distinctiveness of the visual
words in a vocabulary for each class. The visual words
that best represent a class, best generalize over intra-class
variability and best differentiate between object classes will
obtain the highest scores for these measures. In this sec-
tion, we present a methodology for ranking and reducing
the size of the visual vocabularies, towards more reliable
and compact image representations.
Let ΘM1(K) and ΘM2(K) be the rankings of vocab-
ulary K, using measures M3(K,M1) and M3(K,M2),
respectively. ΘM1(K) and ΘM2(K) provide a ranking of
the vocabulary based on the distinctiveness of visual words
according to inter-class and intra-class variability, respec-
tively.
In order to find a consensus, Θ(K), between both rank-
ings ΘM1(K) and ΘM2(K) a consensus-based voting
method can be used; in our case, we decided to use the
Borda Count algorithm [7] although any other can be used
as well. The Borda Count algorithm obtains a final ranking
from multiple rankings over the same set. Given |K| visual
words, a visual word receive |K| points for a first prefer-
ence, |K| − 1 points for a second preference, |K| − 2 for
a third, and so on for each ranking independently. Later,
individual values for each visual word are added and a final
ranking obtained.
From this final ranking a reduced vocabulary can be ob-
tained by selecting the first N visual words. As pointed in
[20], the size of the vocabulary affects the performance and
there is a vocabulary size which can achieve maximal accu-
racy, which depends on the dataset, the number of classes
and the data nature, among others. In our experiments, we
explore different vocabulary sizes, over different datasets,
different interest points extraction and description methods,
different weighting schemas, and different classifiers.
4 Experimental evaluation
In this work we have presented a methodology for im-
proving BoW-based image representation by using only the
most representative and discriminative visual words in the
vocabulary. As it was stated in previous sections, our pro-
posal does not depend on the algorithm used for building
the set of visual words, the descriptor used nor the weight-
ing scheme used. Therefore, the proposed methodology
could be applied for improving the accuracy of any of the
methods reported in the literature, which are based on a
BoW approach.
The main goal of the experiments we present in this sec-
tion is to quantitatively evaluate the improvement intro-
duced by our proposal to the BoW-based image represen-
tation, over two standard datasets commonly used in object
categorization. The experiments were focused on: a) to
assess the validity of our proposal in a classic BoW-based
classification task, b) to evaluate the methodology directly
with respect to other kinds of feature selection algorithms,
and c) to measure the time our methodology spent in or-
der to filter the visual vocabulary built for each dataset. All
the experiments were done on a single thread of a 3.6 GHz
Intel i7 processor and 64GB RAM PC.
The experiments conducted in order to evaluate our pro-
posal were done in two well-known datasets: Caltech-101
[10] and Pascal VOC 2006 [9].
The Caltech-101 dataset [10] consists of 102 object cat-
egories. There are about 40 to 800 images per category
and most categories have about 50 images. The Pascal
VOC 2006 dataset [9] consists of 10 object categories.
In total, there are 5304 images, split into 50% for train-
ing/validation and 50% for testing. The distributions of
images and objects by class are approximately equal across
the training/validation and test sets.
4.1 Assessing the validity in a BoW-based
classification task
As it was mentioned before, the goal of the first experi-
ment is to assess the validity of our proposal. With this
aim, we evaluate the accuracy in a classic BoW-based clas-
sification task, with and without applying our vocabulary
filtering methodology.
In the experiments presented here, we use for image rep-
resentation the BoW schema presented in Figure 1 with the
following specifications:
– Interest points are detected and described using two
methods: SIFT [22] and SURF [2].
– K-means, with four differentK values, is used to build
the visual vocabularies; these vocabularies constitute
the baseline. For both Caltech-101 and Pascal VOC
2006 datasets we used K=10000, 15000, 20000 and
25000.
– Each of the baseline vocabularies is ranked using our
proposed visual words ranking methodology.
– Later, nine new vocabularies are obtained by filtering
each baseline vocabulary, leaving the 10%, 20%, ...,
90%, respectively, of the most representative and dis-
criminative visual words based on the obtained rank-
ing.
– Two weighting schemas are used for image represen-
tation: tf and tf-idf.
Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 339
ΘM(K, c1) ΘM(K, c2) ΘM(K, c3)
1. k2 k2 k1
2. k1 k5 k2
3. k3 k4 k4
4. k4 k6 k6
5. k6 k1 k5
6. k5 k3 k3
R
an
k
in
g 
p
os
it
io
n
Pk2 = 1, ck2 = 1
0.1
M3(k2,M) = 1(|C|−1)(|K|−1)
∑C
cj 6=ck2
(Φ(k2, cj) − Pk2)
M3(k2,M) =
Pk5 = 2, ck5 = 2
0.7
M3(k5,M) = 1(|C|−1)(|K|−1)
∑C
cj 6=ck5
(Φ(k5, cj) − Pk5)
M3(k5,M) =
Figure 4: (best seen in color.) Example ofM3 measure for two visual words. k2 has low discriminative power because it
represents well several classes, while k5 has high discriminative power because it describes well only one class.
– For both datasets we randomly selected 10 images
from all the categories for building the visual vocab-
ularies. The rest of the images were used as test im-
ages; but, as in [18], we limited to 50 the number of
test images per category.
After that, we tested the obtained visual vocabularies in
a classification task, using SVM (with a linear kernel) and
KNN (where K is optimized with respect to the leave-one-
out error) as classifiers. For each visual vocabulary, test
images are represented using this vocabulary and, a 10-fold
10-times cross-validation process is conducted, where nine
of the ten partitions are used for training and the other one
for testing the trained classifier. The mean classification
accuracy along the ten iterations is reported.
Figures 5 and 6 show the mean classification accuracy
results over the cross-validation process using SVM and
KNN, respectively, on the Caltech-101 dataset. Figures 7
and 8 show the same for the Pascal VOC 2006 dataset. In
Figures 5 to 8, subfigures (a) and (b) show the results us-
ing SIFT descriptor; results for SURF are shown in sub-
figures (c) and (d). Results for the two different weighting
schemas, i.e., tf and tf-idf, are shown in subfigures (a) (c),
and (b) (d), respectively.
It can be seen that in both datasets, for every config-
uration, our proposed methodology allows to obtain re-
duced vocabularies that outperformed the classical BoW
approach (baseline).
Table 2 summarizes the results presented in Figures 5
and 6 for the Caltech-101 dataset. The results in Figures
7 and 8 are summarized in Table 3. For every experiment
configuration, Tables 2 and 3 show the baseline classifica-
tion accuracy against the best result obtained by the pro-
posed method with both SVM and KNN classifiers. The
size of the filtered vocabulary in which the best result was
obtained is also showed.
4.1.1 Discussion
The experimental results presented in this section validate
the claimed contributions of our proposed method. As it
can be seen in Tables 2 and 3, the best results obtained with
our proposed method outperform those obtained with the
whole vocabularies. For the experiments conducted in the
Caltech-101 dataset, our average best results outperformed
the baseline by a 4.6% and 4.8% in mean classification ac-
curacy using SVM and KNN, respectively. In the Pascal
VOC 2006 dataset there was a 1.6% and 4.7% improvement
for SVM and KNN, respectively. As noticed on Figures 5
to 8, there is a trend of the performance with respect to the
filter size in the two considered datasets, i.e., for smaller
filter sizes higher accuracy.
In order to validate the improvement obtained by the pro-
posed method, the statistical significance of the obtained
results was verified. For testing the statistical significance
we used the Mann-Whitney test, with a 95% of confidence.
A detailed explanation about Mann-Whitney test, as well
as an implementation, can be found in [1]. As a result of
this test, it has been verified that the results obtained in both
datasets, by the proposed method, are statistically superior
to those obtained by the baseline.
In addition, the best results using the filtered vocabular-
ies were obtained with vocabularies several times smaller
than the baseline vocabularies, i.e., 6 and 10 times smaller
in average using SVM and KNN, respectively, for the
Caltech-101 datasets, and 8 and 5 times smaller in aver-
age for the Pascal VOC 2006 dataset with SVM and KNN,
respectively. Furthermore, vocabularies 10 times smaller
always obtained better accuracy results than the baseline
vocabularies in the Caltech-101 dataset, and in the 93.75%
of the experiments on the Pascal VOC 2006 dataset. Ob-
taining smaller vocabularies implies more compact image
representations, that will have a direct impact on the effi-
ciency of further processing based on these image repre-
sentations, and less memory usage.
Also, the conducted experiments provide evidence that a
large number of visual words in a vocabulary are noisy or
little discriminative. Discarding these visual words allows
for a better and more compact image representations.
4.2 Comparison with other kinds of feature
selection algorithms
The aim of the second experiment is to compare our pro-
posal with respect to other kind of feature selection algo-
rithm. With this purpose, we compare the accuracy of our
vocabulary filtering methodology with respect to the accu-
340 Informatica 41 (2017) 333–347 L. Chang et al.
Table 1: Computation time (in seconds) of visual vocabulary ranking compared to vocabulary building.
Dataset K
Vocabulary
building
(K-means)
computation time (s)
Vocabulary
ranking
(proposed method)
computation time (s)
Vocabulary
ranking
(MI-based method [31])
computation time (s)
Caltech-101
(188 248
training
features)
10000 4723.452 8.111 5.754
15000 6711.089 18.622 16.825
20000 7237.885 33.890 27.478
25000 9024.024 54.338 49.963
Pascal VOC 2006
(114 697
training
features)
10000 4126.487 7.985 5.285
15000 5922.134 18.320 15.879
20000 6441.980 30.259 28.458
25000 8563.692 51.743 49.132
(a) (b)
SIFT + tf + SVM
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
K = 25000 
K = 10000 
K = 15000 
K = 20000 
SIFT + tf-idf + SVM
K = 25000 
K = 10000 
K = 15000 
K = 20000 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SURF + tf + SVM
K = 25000 
K = 10000 
K = 15000 
K = 20000 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SURF + tf-idf + SVM
K = 25000 
K = 10000 
K = 15000 
K = 20000 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
(c) (d)
20.00
21.75
23.50
25.25
27.00
23.00
24.50
26.00
27.50
29.00
19.00
21.75
24.50
27.25
30.00
26.00
27.25
28.50
29.75
31.00
Figure 5: Mean classification accuracy results for SVM cross-validation on the Caltech-101 dataset. As can be seen, using
the reduced vocabularies always resulted in better classification accuracies.
Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 341
(a) (b)
SIFT + tf + KNN
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
K = 25000 
K = 10000 
K = 15000 
K = 20000 
SIFT + tf-idf + KNN
K = 25000 
K = 10000 
K = 15000 
K = 20000 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SURF + tf + KNN
K = 25000 
K = 10000 
K = 15000 
K = 20000 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SURF + tf-idf + KNN
K = 25000 
K = 10000 
K = 15000 
K = 20000 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
(c) (d)
2.00
3.50
5.00
6.50
8.00
2.00
3.50
5.00
6.50
8.00
2.00
4.25
6.50
8.75
11.00
2.00
4.25
6.50
8.75
11.00
Figure 6: Mean classification accuracy results for KNN cross-validation on the Caltech-101 dataset. As can be seen, using
the reduced vocabularies always resulted in better classification accuracies.
342 Informatica 41 (2017) 333–347 L. Chang et al.
SIFT + tf-idf + SVM
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SURF + tf + SVM
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
(a) (b)
(c) (d)
SURF + tf-idf + SVM
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SIFT + tf + SVM
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
33
33.9
34.8
35.6
36.5
36.8
37.4
37.9
38.5
39
35
35.9
36.8
37.6
38.5
37
38
39
40
41
K = 25000 
K = 10000 
K = 15000 
K = 20000 
K = 25000 
K = 10000 
K = 15000 
K = 20000 
K = 25000 
K = 10000 
K = 15000 
K = 20000 
K = 25000 
K = 10000 
K = 15000 
K = 20000 
Figure 7: Mean classification accuracy results for SVM cross-validation on the Pascal VOC 2006 dataset. As can be seen,
using the reduced vocabularies always resulted in better classification accuracies.
Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 343
SIFT + tf-idf + KNN
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SURF + tf + KNN
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
(a) (b)
(c) (d)
SURF + tf-idf + KNN
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
SIFT + tf + KNN
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
10 20 30 40 50 60 70 80 90 100
(baseline)
Filtering size (%)
7
10.3
13.5
16.8
20
18.5
19.3
20
20.8
21.5
4.5
7.9
11.3
14.6
18
19.2
20
20.9
21.7
22.5
K = 25000 
K = 10000 
K = 15000 
K = 20000 
K = 25000 
K = 10000 
K = 15000 
K = 20000 
K = 25000 
K = 10000 
K = 15000 
K = 20000 
K = 25000 
K = 10000 
K = 15000 
K = 20000 
Figure 8: Mean classification accuracy results for KNN cross-validation on the Pascal VOC 2006 dataset. As can be seen,
using the reduced vocabularies always resulted in better classification accuracies.
344 Informatica 41 (2017) 333–347 L. Chang et al.
Table 2: Summarized results for the Caltech-101 dataset.
SVM KNN
Descriptor Weighting K Base- Best Best filter Base- Best Best filterschema line result size (%) line result size (%)
SIFT
tf
10000 24.05 26.54 20 3.60 7.86 10
15000 22.22 26.66 10 3.06 6.74 10
20000 21.22 25.28 20 2.54 5.84 10
25000 20.41 25.85 10 2.34 5.30 10
tf-idf
10000 23.96 27.87 40 3.41 7.28 10
15000 24.05 28.55 10 2.91 6.29 10
20000 24.45 27.53 20 2.60 6.13 10
25000 24.18 27.87 20 2.45 5.59 10
SURF
tf
10000 24.63 29.81 10 3.53 10.70 10
15000 22.43 28.82 10 3.26 8.77 10
20000 20.75 28.08 10 2.80 9.54 10
25000 19.56 27.66 10 3.17 8.55 10
tf-idf
10000 26.48 30.74 20 3.42 10.50 10
15000 26.39 30.21 20 2.97 8.70 10
20000 26.50 29.78 10 2.72 8.69 10
25000 26.62 30.47 30 2.57 7.61 10
Average 23.62 28.23 16.8 2.96 7.76 10
Table 3: Summarized results for the Pascal VOC 2006 dataset.
SVM KNN
Descriptor Weighting K Base- Best Best filter Base- Best Best filterschema line result size (%) line result size (%)
SIFT
tf
10000 34.97 36.11 10 13.16 18.74 10
15000 34.37 36.38 10 10.22 17.79 10
20000 33.69 36.30 10 9.31 16.89 10
25000 33.66 35.84 10 8.31 17.35 10
tf-idf
10000 37.86 38.27 10 19.25 21.12 10
15000 37.27 38.77 10 19.25 20.14 10
20000 37.86 38.84 10 19.37 19.72 90
25000 36.97 38.48 30 19.31 19.76 70
SURF
tf
10000 36.36 38.15 10 6.96 17.72 10
15000 36.49 37.54 10 6.37 16.17 10
20000 36.05 37.82 20 6.11 14.15 10
25000 35.39 36.78 20 5.91 14.49 10
tf-idf
10000 37.77 40.85 10 20.56 22.20 10
15000 37.72 39.74 10 20.19 21.81 10
20000 37.28 39.21 10 20.39 20.81 60
25000 37.63 38.31 10 20.22 21.28 10
Average 36.33 37.96 12.5 14.06 18.76 21.88
Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 345
racy of the MI-based method proposed in [31], in a classifi-
cation task; the experiment was done over the Caltech-101
dataset. As it was mentioned in Section 2, the MI-based
method proposed in [31] obtains the best results among the
feature selection and compression methods of image repre-
sentation for object categorization.
In the experiments presented here, we use for image rep-
resentation a BoW-based schema with the following speci-
fications:
– PHOW features (dense multi-scale SIFT descriptors)
[3].
– Spatial histograms as image descriptors.
– Elkan K-means [6], with five different K values (K=
256, 512, 1024, 2048 and 4096), is used to build the
visual vocabularies; these vocabularies constitute the
baseline.
– Each of the baseline vocabularies is ranked using the
MI-based method proposed in [31] and our proposed
visual vocabulary ranking methodology.
– Later, nine new vocabularies are obtained by filtering
each baseline vocabulary, leaving the 10%, 20%, ...,
90%, respectively.
– We randomly selected 15 images from each of the 102
categories of Caltech-101 dataset, in order to build
the visual vocabularies. For each category, 15 images
were randomly selected as test images.
We tested the obtained visual vocabularies in a classifi-
cation task, using a homogeneous kernel map to transform
a χ2 Support Vector Machine (SVM) into a linear one [26].
The classification accuracy is reported in Figure 9.
As it can be seen in Figure 9, for each value ofK used in
the experiment, our proposal obtains the best classification
accuracy results for the highest compression rates. Besides,
for the other filtering sizes our proposal and the MI-based
method attains comparable results.
4.3 Computation time of the visual
vocabulary ranking
The computation time of the visual vocabulary ranking
methodology has also been evaluated. Table 1 shows the
time in seconds taken for the ranking method in different
size vocabularies, for the Caltech-101 and the Pascal VOC
2006 dataset. In Table 1, the ranking time is compared with
the time needed to build the visual vocabulary.
As can be seen in Table 1, the proposed methodology can
be used to improve visual vocabularies without requiring
much extra computation time.
5 Conclusion and future work
In this work we devised a methodology for reducing the
size of visual vocabularies that allows to obtain more dis-
criminative and representative visual vocabularies for BoW
image representation. The vocabulary reduction is based
on three properties and their corresponding quantitative
measures that express the inter-class representativeness,
the intra-class representativeness and inter-class distinc-
tiveness of visual words. The experimental results pre-
sented in this paper showed that, in average, with only 25%
of the ranked vocabulary, statistically superior classifica-
tion results can be obtained, compared to the classical BoW
representation using the entire vocabularies. Therefore, the
proposed method, in addition to providing accuracy im-
provements, provides a substantial efficiency improvement.
Also, compared with a mutual information based method
our proposal obtained superior results for the highest com-
pression rates and comparable results for the other filtering
sizes.
As future work, we aim to propose a weighting schema
that takes advantage of the proposed measures, in order
to improve image representation. Also we would like to
explore the use of hierarchical classifiers for dealing with
inter-class variability. Finally, we also aim to define a mea-
sure that help us to automatically choose the filter size.
References
[1] Concepts and applications of inferential statistics,
2013.
[2] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc
Van Gool. Speeded-Up Robust Features (SURF).
Comput. Vis. Image Underst., 110(3):346–359, 2008.
[3] A Bosch, Andrew Zisserman, and X Munoz. Image
classification using random forests and ferns. IEEE
11th International Conference on Computer Vision
(2007), 23(1):1–8, 2007.
[4] Siddhartha Chandra, Shailesh Kumar, and C. V. Jawa-
har. Learning hierarchical bag of words using naive
bayes clustering. In Asian Conference on Computer
Vision, pages 382–395, 2012.
[5] Gabriella Csurka, Christopher R. Dance, Lixin Fan,
Jutta Willamowski, and Cédric Bray. Visual catego-
rization with bags of keypoints. In Workshop on Sta-
tistical Learning in Computer Vision, ECCV, pages
1–22, 2004.
[6] Charles Elkan. Using the triangle inequality to ac-
celerate k-means. In Tom Fawcett and Nina Mishra,
editors, ICML, pages 147–153. AAAI Press, 2003.
[7] Peter Emerson. The original borda count and partial
voting. Social Choice and Welfare, 40(2):353–358,
2013.
346 Informatica 41 (2017) 333–347 L. Chang et al.
56
58.5
61
63.5
66
59
61
63
65
67
61
63
65
67
69
64
65.5
67
68.5
70
64
65.25
66.5
67.75
69
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
M
ea
n
 A
cc
u
ra
cy
 (
%
)
 
K = 256 K = 512
K = 1024
K = 4096
K = 2048
(a) (b)
10 20 30 40 50 60 70 80 90 100
(whole 
vocabulary)Filtering size (%)
10 20 30 40 50 60 70 80 90 100
(whole 
vocabulary)Filtering size (%)
(c) (d)
10 20 30 40 50 60 70 80 90 100
(whole 
vocabulary)Filtering size (%)
(e)
10 20 30 40 50 60 70 80 90 100
(whole 
vocabulary)Filtering size (%)
10 20 30 40 50 60 70 80 90 100
(whole 
vocabulary)Filtering size (%)
Our method 
MI [Zhang et. al, 2014] 
Figure 9: Comparison of mean classification accuracy results, on the Caltech-101 dataset, between the proposed method-
ology and the MI-based method proposed in [31].
Improving Visual Vocabularies: A More Discriminative. . . Informatica 41 (2017) 333–347 347
[8] M. Everingham, L. Van Gool, C. K. I.
Williams, J. Winn, and A. Zisserman. The
PASCAL Visual Object Classes Challenge
2011 (VOC2011) Results. http://www.pascal-
network.org/challenges/VOC/voc2011/ work-
shop/index.html.
[9] M. Everingham, A. Zisserman, C. K. I.
Williams, and L. Van Gool. The PAS-
CAL Visual Object Classes Challenge 2006
(VOC2006) Results. http://www.pascal-
network.org/challenges/VOC/voc2006/ results.pdf.
[10] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learn-
ing generative visual models from few training ex-
amples: An incremental bayesian approach tested on
101 object categories. Comput. Vis. Image Underst.,
106(1):59–70, April 2007.
[11] Basura Fernando, Élisa Fromont, Damien Muselet,
and Marc Sebban. Supervised learning of gaussian
mixture models for visual vocabulary generation. Pat-
tern Recognition, 45(2):897–907, 2012.
[12] Peter V. Gehler and Sebastian Nowozin. On feature
combination for multiclass object classification. In
ICCV, pages 221–228. IEEE, 2009.
[13] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik.
Learning binary codes for high-dimensional data us-
ing bilinear projections. In CVPR 2013, 2013.
[14] H. Jégou, M. Douze, and C. Schmid. Product quan-
tization for nearest neighbor search. Pattern Analysis
and Machine Intellingence, 33(1):117?128, 2011.
[15] Hervé Jégou, Matthijs Douze, and Cordelia Schmid.
Product quantization for nearest neighbor search.
IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–
128, 2011.
[16] Mingyuan Jiu, Christian Wolf, Christophe Garcia,
and Atilla Baskurt. Supervised learning and code-
book optimization for bag of words models. Cogni-
tive Computation, 4:409–419, December 2012.
[17] Kraisak Kesorn and Stefan Poslad. An enhanced bag-
of-visual word vector space model to represent visual
content in athletics images. IEEE Transactions on
Multimedia, 14(1):211–222, 2012.
[18] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce.
Beyond Bags of Features: Spatial Pyramid Match-
ing for Recognizing Natural Scene Categories. 2006
IEEE Computer Society Conference on Computer Vi-
sion and Pattern Recognition Volume 2 CVPR06,
2(2169-2178):2169–2178, 2006.
[19] Gang Liu. Improved bags-of-words algorithm for
scene recognition. Journal of Computational Infor-
mation Systems, 6(14):4933 – 4940, 2010.
[20] Jingen Liu and Mubarak Shah. Learning human ac-
tions via information maximization. 2013 IEEE Con-
ference on Computer Vision and Pattern Recognition,
0:1–8, 2008.
[21] R.J. Lopez-Sastre, T. Tuytelaars, F.J. Acevedo-
Rodriguez, and S. Maldonado-Bascon. Towards
a more discriminative and semantic visual vocabu-
lary. Computer Vision and Image Understanding,
115(3):415 – 425, 2011. Special issue on Feature-
Oriented Image and Video Computing for Extracting
Contexts and Semantics.
[22] David G. Lowe. Distinctive image features from
scale-invariant keypoints. International Journal of
Computer Vision, 60(2):91–110, 2004.
[23] Zhiwu Lu and Horace Ho-Shing Ip. Image catego-
rization with spatial mismatch kernels. In CVPR,
pages 397–404. IEEE, 2009.
[24] Jianzhao Qin and Nelson Hon Ching Yung. Feature
fusion within local region using localized maximum-
margin learning for scene categorization. Pattern
Recognition, 45(4):1671–1683, 2012.
[25] Chih-Fong Tsai. Bag-of-words representation in im-
age annotation: A review. ISRN Artificial Intelli-
gence, 2012, 2012.
[26] A. Vedaldi and A. Zisserman. Efficient additive ker-
nels via explicit feature maps. Pattern Analysis and
Machine Intellingence, 34(3), 2011.
[27] Lei Wang, Lingqiao Liu, and Luping Zhou. A graph-
embedding approach to hierarchical visual word mer-
gence. IEEE Trans. Neural Netw. Learning Syst.,
28(2):308–320, 2017.
[28] Jianxin Wu, Wei-Chian Tan, and James M. Rehg. Ef-
ficient and effective visual codebook generation us-
ing additive kernels. Journal of Machine Learning
Research, 12:3097–3118, 2011.
[29] Shiliang Zhang, Qi Tian, Gang Hua, Qingming
Huang, and Wen Gao. Generating descriptive visual
words and visual phrases for large-scale image ap-
plications. IEEE Transactions on Image Processing,
20(9):2664–2677, 2011.
[30] Shiliang Zhang, Qi Tian, Gang Hua, Wengang Zhou,
Qingming Huang, Houqiang Li, and Wen Gao. Mod-
eling spatial and semantic cues for large-scale near-
duplicated image retrieval. Computer Vision and Im-
age Understanding, 115(3):403–414, 2011.
[31] Y. Zhang, J. Wu, and J. Cai. Compact representation
for image classification: To choose or to compress?
In CVPR 2014, 2014.
348 Informatica 41 (2017) 333–347 L. Chang et al.
 Informatica 41 (2017) 349–362 349 
An Output Instruction Based PLC Source Code Transformation 
Approach For Program Logic Simplification 
Arup Ghosh and Shiming Qin 
Unified Digital Manufacturing Laboratory, Department of Industrial Engineering, Ajou University 
Suwon 443-749, Republic of Korea 
E-mail: arupghosh22.3.89@gmail.com, taihejiang11@ajou.ac.kr 
 
Jooyeoun Lee and Gi-Nam Wang  
Department of Industrial Engineering, Ajou University, Suwon 443-749, Republic of Korea 
E-mail: jooyeoun325@ajou.ac.kr, gnwang@ajou.ac.kr 
 
Keywords: programmable logic controller, reengineering, ladder logic diagram, instruction list, XML 
Received: August 26, 2016 
 
Due to the growing size and complexity of the PLC (Programmable Logic Controller) programs used for 
controlling the industrial processes, there is an increasing need for an approach that can help the users 
to understand the control logics of the PLC programs easily, and can assist them to analyze the 
programming errors effectively. In this paper, we propose an approach that takes the source code file of 
PLC program as the input; and transforms it into a hierarchical-structured XML (extensible markup 
language) file. The XML file format is based on the PLC output instructions and their corresponding 
conditions. It helps the users to identify the actual cause of a programming error quickly. In addition, a 
novel technique is applied that decomposes the PLC program into several smaller and modular sub-logic 
blocks. This makes the control logic simpler and easier to follow. An additional software application has 
also been developed for state-based graphical visualization of the XML file. 
Povzetek: Prispevek opisuje metodo za poenostavitev PLC programov za industrijske procese.
1 Introduction 
The PLCs are a special type of computers that are used for 
automation of the industrial processes. A PLC controls an 
industrial process according to the control program 
embedded in its controller. In each execution of the PLC 
program, it takes the sensor signals as the inputs and 
produces a set of output control signals to the actuators. 
So, the program outputs and their corresponding 
conditions (which must be satisfied in order to receive that 
particular output) are the basis of a PLC program. The 
PLCs can be programmed by using several programming 
languages under the international standard IEC 61131-3, 
such as Ladder Logic Diagram (LLD), Function Block 
Diagram (FBD), Structured Text (ST), Instruction List 
(IL) etc. [1]. Among these languages, the LLD is the most 
popular PLC programming language in industries; and the 
IL is the most commonly used PLC programming 
language in Europe [1]–[3]. On various occasions, the 
programmers use a combination of these languages to 
write a PLC program. With the growing size and 
complexity of the PLC programs, it becomes more 
difficult to understand the program logics because of the 
low-level PLC programming languages. Moreover, if an 
error is detected in any PLC output, then the programmers 
have to analyze the complete program manually to find out 
the conditions that can cause such an error. It is very 
complicated and time-consuming job for a programmer to 
determine all the conditions that can affect a particular 
output. The situation becomes more critical when many 
programmers work together to develop the project. 
Moreover, if the routines of the PLC program are written 
in different languages, then understanding the control 
logics and/or determining the conditions associated with a 
program output become even harder. 
Our main aim is to transform the PLC program source 
code into a programming language and vendor 
independent XML file format that can help the users to 
understand the program logic easily; and can assist the 
programmers to analyze the programming errors quickly. 
In this paper, we present a PLC program source code 
reengineering approach, called Program Output based 
Source-code Transformation (POST) approach that takes 
the source code file of the PLC program saved in the IL 
language as the input, and produces a hierarchical-
structured XML file as the output. In that XML file, the 
program logic is interpreted in terms of the program output 
instructions and their corresponding conditions, thus the 
programmers can analyze the programming errors easily. 
In addition, POST applies a novel technique that 
subdivides the program logic blocks into several smaller 
and modular sub-logic blocks in order to make the 
program logic more simpler, clearer and well-organized. 
POST is applicable to all the programming languages and 
the PLC software vendors where the program source code 
can be saved in the IL language. For example, in case of 
Siemens PLC software [4], the programs written in the 
LLD, IL and FBD languages can automatically be saved 
350 Informatica 41 (2017) 349–362 A. Ghosh et al.  
in the IL language format. We have implemented and 
tested POST for Siemens and Allen-Bradley PLC software 
[5]. It can easily be extended for other types of PLC 
software as well. 
2 Problem Description 
In Figure 1, an example rung of a LLD program (written 
using Siemens Simatic Step 7 software [4]) is given. Each 
rung of a LLD program characterizes a specific rule or a 
set of rules. As can be seen in Figure 1, the rung has two 
output instructions and those outputs are dependent on 
three input instructions (or conditions) i.e., two Normally 
Open (NO) contacts and one Normally Closed (NC) 
contact. The NO and NC contacts actually represent the 
AND and AND-NOT boolean logic operations, 
respectively. These conditions are evaluated at the time of 
program execution in order to determine the data values of 
the output addresses. In practice, a LLD program can have 
thousands of such rungs partitioned into several program 
blocks, such as the Organization Blocks (OBs), Functions 
(FCs), Function Blocks (FBs) etc. It is very time-
consuming and laborious task for the programmers to 
identify the real cause of a programming error. This is 
because, if an error is found in any PLC output signal, then 
the programmers have to examine the complete program 
(i.e., each rung of every program blocks) manually to find 
out the exact conditions that can affect the value of the 
corresponding output address. In that condition candidate 
set, if an erroneous data value is found in the address field 
of an input instruction which is not a direct sensor input, 
then the programmers have to search again for the 
conditions that can affect the value of that address. This 
process continues until the root causes of the error (in 
other words, the faulty sensor inputs and/or the flaws in 
the program logic) are identified. In order to overcome this 
kind of difficulties, an attempt is given to transform the 
source code file of the PLC program into a well-organized 
and well-structured XML file, thus all the conditions 
attached to an output address can be determined 
automatically. This can help the users to fix the 
programming errors very quickly. 
The PLC programs are often written in a combination 
of different languages. The input-output instructions of a 
particular programming language also vary depending on 
the PLC software vendors. So, it is necessary to transform 
the PLC code into a vendor and language independent 
format, thus the users can understand the program 
instructions quickly and easily. An automated approach 
for program logic simplification is another important 
requirement for industries. Often PLC programs are 
written in a very low-level, non-graphical language such 
as the IL language. An IL language code equivalent to the 
LLD rung of Figure 1 is given in Figure 2. As can be seen, 
it is very hard to understand the rung logic from this kind 
of non-graphical PLC programs. Even for the programs 
written in graphical languages such as the LLD, FBD etc., 
it becomes difficult to understand the program logic with 
the growing size and complexity of the rung diagram 
(especially if the rung has several outputs, parallel 
branches and sub-branches). An example of such complex 
LLD rung is presented in Figure 3. It is easy to perceive, 
identifying the conditions or understanding the program 
logic behind a particular output is very difficult from this 
kind of ladder rungs. Therefore, an automated, systematic 
approach is required that can simplify the program logic 
of this kind of complex ladder rungs in an efficient way, 
thus the users can understand the program logic behind a 
 
Figure 1: A rung of a LLD program. 
 
Figure 2: The IL language representation of the LLD rung of Figure 1. 
An Output Instruction Based PLC Source...  Informatica 41 (2017) 349–362 351 
particular output easily. In this work, we have successfully 
addressed those needs. The rest of this paper is organized 
as follows: an overview about the existing works in this 
field is presented in Section 3. In Section 4, we have 
discussed POST approach in details. Section 5 contains 
our conclusive remarks of the work followed by a list of 
relevant references. 
3 Background Study 
Several approaches have been proposed in literatures for 
reengineering the PLC programs. They can broadly be 
classified into the following three categories: 
 Approaches focused on source-to-source translation: 
this type of approaches transform the PLC programs 
written in a particular programming language into 
another programming language. For example, the 
approaches proposed in [6]–[9] transform the LLD 
program into the IL language code. The main 
objective of these research works is to convert the 
PLC program into the IL language code thus it can be 
executed directly by the PLCs. In [10], a different 
type of approach was proposed that converts the LLD 
program into the ST language code. The main aim of 
this research work is to promote a particular type of 
technology and hardware.  
 Approaches focused on vendor interoperability: these 
research works propose an approach that can 
accomplish transferring the program source code 
among different vendors of PLC programming tools. 
In these works, the interoperability between the PLC 
programming tools is achieved by means of a 
middleware. In most of the cases, the XML 
technologies have been used for developing the 
interoperability middleware. Examples of such works 
include: [11]–[15]. 
 Approaches focused on alternative visualization: this 
type of approaches transforms the PLC program 
source code file into another file format for more 
efficient graphical visualization. For example, in 
earlier works [3] and [16], an approach was proposed 
that transforms the PLC program into a vendor and 
platform independent XML file format. In [17] and 
[18], an approach was proposed that transforms the 
PLC programs into the Finite State Machines (FSMs). 
In another article [19], the UML (Unified Modelling 
Language) state diagrams are used in place of FSMs 
for more efficient graphical visualization of the rung 
logic. 
 
Figure 3: An example complex ladder rung. 
352 Informatica 41 (2017) 349–362 A. Ghosh et al.  
Unfortunately, all the mentioned approaches are focused 
on an efficient graphical representation of the PLC 
program and/or the vendor and platform interoperability. 
None of these approaches fulfils all the requirements 
stated in Section 1 and Section 2, and hence, a completely 
different type of approach is needed. Our proposed 
approach POST can solve all those needs effectively. 
4 POST approach 
This section is divided into five subsections. In Subsection 
4.1, the overall structure of the output XML file (produced 
by POST approach) is given and in Subsection 4.2, the 
program logic simplification procedure of POST is 
discussed. The program error analysis procedure is 
presented in Subsection 4.3. In Subsection 4.4, the 
implementation details of POST approach is discussed and 
in Subsection 4.5, the output XML file format for a special 
instruction i.e., the block call instruction is given. In this 
paper, we discuss POST approach using the ladder rung 
diagrams of Siemens PLC programs (just for 
exemplification purpose). 
4.1 The Overall XML file structure 
POST takes the source code file of a PLC program saved 
in the IL language as the input and produces a well-
structured and well-organized XML file. It gives an 
efficient tree-based representation of the program logic to 
the users. The XML file structure outputted by POST is 
based on the output instructions and their corresponding 
conditions of each rung of the PLC program. The overall 
structure of the output XML file is given in Figure 4 (some 
XML nodes are not expanded in order to maintain the 
clarity of the image). As we can see, under the root node 
i.e. Program node, the Routine nodes are defined. A 
routine actually refers to a block of the PLC program. The 
Type attribute of the Routine nodes specifies the type of 
that routine i.e., OB or FB or FC etc. In a LLD program, 
the ladder rungs are always declared inside a routine and 
hence, under the Routine node, the Rung nodes are 
defined. The Number attribute represents the 
corresponding rung number in the routine. As can be seen 
in Figure 4, under the Rung node, the Output nodes are 
characterized. Each Output node basically represents a 
separate output of the corresponding rung. The Type 
attribute of the Output nodes refers to the type of that 
output instruction such as the Output Coil, Convert BCD 
to Integer (CBI), Move, Set or Reset Coil instruction etc. 
(see for instance: [20] and [21]). The Move and the CBI 
type instructions have the following two additional 
attributes: i) Source_Address or Source_Value attribute: 
represents the address or the value specified at the IN 
input; and ii) Target_Address attribute: represents the 
address specified at the OUT output (see Figure 3 and 
Figure 4). Similarly, the Output Coil type instructions 
have one additional attribute i.e., the Address attribute 
which characterizes the output address of the 
corresponding instruction (the same is also true for the Set 
and Reset Coil instructions). The additional attributes 
associated with an instruction (or an Output node) actually 
represent the addresses and the data values associated with 
that instruction. As can be seen in Figure 4, POST 
determines the number of additional attributes and their 
names (or formats) based on that particular type of 
instruction (also see [20] and [21]). 
In our original PLC program, the ladder rung of 
Figure 3 is actually the second rung of the function block 
FB 421. In Figure 4, we can find the Output nodes 
corresponding to the ladder rung of Figure 3 under the 
rung number 2 node. As can be seen in Figure 3, the rung 
consists of ten output instructions and hence, ten Output 
nodes are created under the rung number 2 node in the 
XML file of Figure 4. Actually, in the output XML file, a 
rung diagram is characterized on the basis of its output 
instructions and hence, under each Output node, we can 
find its corresponding conditions. For example, as can be 
 
Figure 4: The output XML file format. 
An Output Instruction Based PLC Source...  Informatica 41 (2017) 349–362 353 
seen in Figure 3, the first Move type output instruction 
(see the first branch) has only one corresponding condition 
(the condition type AND with address argument M1.5). 
As can be seen from Figure 4, this condition is correctly 
placed right below the corresponding Output node in the 
XML file. It is easy to perceive from the format of the 
output XML file, it follows exactly the same logic 
structure as in the original PLC program. However, if any 
rung has more than one output instruction, then the rung 
logic is split based on the corresponding output 
instructions. This is the first logic simplification measure 
taken by POST (this also simplifies the programming error 
analysis task – we will discuss on it later). In addition, as 
can be seen in Figure 4, the instructions and their 
corresponding properties are described by using simple 
descriptive language. This makes the program logic easy 
understandable, and programming language and platform 
independent. Please note that POST can also produce the 
output XML file based on the symbolic names (see Figure 
1). 
4.2 Program Logic Simplification by 
utilizing the Local Memory Definition 
The program output based source code transformation 
method simplifies the rung structure to a great extent. 
However, further logic simplification measures are needed 
to be taken particularly for the rungs with a large number 
of parallel and high depth sub-branches. The parallel 
branches and sub-branches of a LLD rung represent the 
OR boolean logic operations. This means that the Result 
of Logic Operation (RLO) of the parallel branches 
(respectively, sub-branches) is true, if RLO of any of those 
branch (respectively, sub-branch) is true [20, 21]. POST 
simplifies the logic of a rung diagram with parallel 
branches and sub-branches by using a bottom-up 
hierarchical decomposition procedure. More specifically, 
it characterizes a certain portion of the complete rung 
diagram (a sub-logic block) by utilizing the Local 
Memory Definition (LMD) [a local memory can be 
thought of as a virtual memory location where the RLO of 
its corresponding sub-logic block is stored]. The LMDs 
are then used successively (reused in a modular fashion) 
to define the complete rung logic. In Figure 5, the relevant 
part of the rung diagram of Figure 3 that depicts only the 
conditions associated with the first Output Coil instruction 
(address Q215.0) is given. As can be seen from Figure 5, 
even after the above mentioned logic simplification step 
(as stated in Subsection 4.1), the rung diagram has several 
parallel branches and sub-branches. For this reason, the 
program logic behind the output is difficult to follow and 
hence, is needed to be simplified further. 
 
Figure 5: The rung diagram (or rung logic) associated with the first Output Coil instruction (address Q215.0). 
354 Informatica 41 (2017) 349–362 A. Ghosh et al.  
From our practical experience, we have seen that the 
conventional procedure to understand this kind of 
complicated rung structure (as in Figure 5) is to analyze 
the rung diagram starting from its highest depth sub-
branches, and then consecutively proceeding towards the 
main branch and its parallel branches (in a bottom-up 
fashion). The LMD based logic simplification procedure 
of POST exactly follows this natural bottom-up modular 
decomposition approach. As we can see from Figure 5, the 
(relatively straightforward) rung logic corresponding to 
the highest depth sub-branches (branches inside the red 
colour box) will be defined by using the local memory 
LM0. Similarly, the rung logic corresponding to the next 
highest depth sub-branches (branches inside the green 
colour box) will be characterized by using the local 
memory LM1. It is easy to perceive, the definition of local 
memory LM0 can successively be utilized in the definition 
of local memory LM1. As can be seen from Figure 5, this 
LMD formulation procedure will be repeated in a bottom-
up fashion until all the parallel branches and sub-branches 
are characterized by using the LMDs. In Figure 5, the 
boxes and its associated local memory names represent 
how the rung structure can further be simplified by using 
the LMDs [The LMDs are restricted to maximum three 
parallel branches. As an example, see the branches inside 
the blue colour box. We will discuss more on it later.]. For 
simplicity, we can suppose that the RLO of the parallel 
sub-branches of a branch (respectively, the parallel 
branches of the main branch) is stored in a virtual memory 
location of the type local memory, and is used 
successively to evaluate the RLO of that branch 
(respectively, the main branch) by applying an AND 
boolean logic operation. 
The output XML file shown in Figure 6 depicts the 
condition set (or the rung logic) corresponding to the first 
Output Coil instruction of Figure 3 (also see the simplified 
rung diagram of Figure 5). As can be seen in Figure 6 (a), 
the rung logic or the rung diagram associated with the first 
Output Coil instruction is characterized based on the 
definition of local memory LM4. The definition of local 
memory LM0, LM1, LM2, LM3 and LM4 are shown 
 
Figure 6: XML file format for the rung diagrams with parallel branches and sub-branches. (a) The condition set 
corresponding to the first Output Coil instruction. (b) Local memory LM0 definition. (c) Local memory LM1 
definition. (d) Local memory LM2 definition. (e) Local memory LM3 definition. (f) Local memory LM4 definition. 
An Output Instruction Based PLC Source...  Informatica 41 (2017) 349–362 355 
separately in Figure 6 (b), Figure 6 (c), Figure 6 (d), Figure 
6 (e) and Figure 6 (f), respectively. As can be seen in those 
figures, under the Local-Memory node, the definition (or 
the rung logic) of the corresponding local memory is 
given. The Address attribute of the Local-Memory node 
represents the virtual address (or name) of the local 
memory. It is easy to see from Figure 6, the definitions of 
the local memories characterize the rung logic of exactly 
the same branches (or the sub-logic blocks) as depicted in 
Figure 5. For example, the definition of local memory 
LM0 (presented in Figure 6 (b)) covers the rung logic of 
the highest depth parallel sub-branches of the rung 
diagram of Figure 5 (see the red colour box). As can be 
seen in Figure 6 (b), the condition set of each parallel 
branches are presented under a separate Option node. The 
Option nodes basically represent the OR boolean logic 
operations. So, if the condition set under any Option nodes 
associated with a particular local memory is true, then the 
RLO value stored in that local memory is also true. In the 
same way, the local memory LM1 is defined (shown in 
Figure 6 (c)). Please note that the definition of local 
memory LM1 is characterized by using the definition of 
 
Figure 7: The state-based graphical representation of the rung logic corresponding to the first Output Coil instruction. 
356 Informatica 41 (2017) 349–362 A. Ghosh et al.  
local memory LM0 (in a modular fashion). The RLO value 
saved in the local memory LM1 can easily be determined 
by performing an AND boolean logic operation between 
the RLO value saved in the local memory LM0 and the 
RLO value of the other conditions (for simplicity, we can 
assume that the Local-Memory type attribute represents 
the AND boolean logic operation). It is easy to realize, this 
bottom-up hierarchical logic decomposition procedure 
provides an easy, systematic, step-by-step interpretation 
of the rung logic to the users. As can be seen in Figure 6 
(a), the overall rung logic corresponding to the first Output 
Coil instruction is characterized by using only a very few 
conditions (the same is also true for the LMDs – see Figure 
5 and Figure 6). This indeed makes the program logic 
behind an output easier to follow. 
The rung diagram connected with a particular output 
can have many same-depth parallel branches and sub-
branches. It is easy to perceive, if a large number of 
parallel branches or sub-branches are characterized by 
using a single local memory, then it can generate a very 
complex local memory definition. In order to avoid such 
issues, POST allows the users to explicitly bind the 
complexity of the LMDs through restricting (or setting) 
the number of parallel branches that the definition can 
cover. For example, as can be seen from Figure 5 and 
Figure 6, the LMDs are always restricted to maximum 
three parallel branches. We have also developed a 
software interface module (with the help of Graphviz 
Software [22]) that can provide an efficient graphical 
representation of the rung logic. As an example, in Figure 
7, the graphical representation of the rung logic associated 
with the first Output Coil instruction is shown. It is 
actually the state-based graphical representation of the 
XML file presented in Figure 6. A state or a node in the 
graph of Figure 7 actually represents a particular condition 
(in other words, indicates a program instruction and the 
associated addresses or data values). Please note that the 
RLO values corresponding to the state nodes of a 
particular path are needed to be ANDed in order to get the 
resultant RLO value of that path; and the RLO values of 
the different paths are needed to be ORed in order to 
determine the resultant RLO value of the corresponding 
sub-logic block (see Figure 5 and Figure 7). 
4.3 Program Error Analysis Procedure 
The program output instructions and their corresponding 
conditions based output XML file format not only 
provides an efficient program logic interpretation, but also 
makes it possible for a software module to accumulate all 
the conditions corresponding to an output automatically. 
If an incorrect data value is found in any output address, 
then the user has to pass that address (or any other output 
address) as the query input value to the condition search 
engine of POST. The condition search engine of POST 
analyzes the output address attribute values of all the 
Output nodes of the above stated XML file (as shown in 
Figure 4 and Figure 6), and generates a query output XML 
file that contains all the conditions (i.e., the program 
instructions and the associated addresses or data values) 
that can directly affect the value stored at that particular 
input address. Recall that the output address attribute of 
the Output nodes refers to the attribute that denotes the 
output address of the corresponding program instruction. 
For example, the Target_Address attribute is the output 
address attribute of the Move type instructions (see Figure 
3 and Figure 4). 
For the convenience of the readers, an example query 
output XML file is presented in Figure 8. The condition 
search engine of POST produced that XML file for the 
query input address MD3650 (see Figure 3 and Figure 4). 
As we can see, under the root node i.e. Query-Output 
node, the Routine and the Rung nodes are defined. It helps 
the users to identify the routine and the rung in which the 
output instruction is declared. The Output nodes and their 
corresponding Local-Memory and Condition nodes help 
the users to explicitly determine the output instructions 
and the conditions that can affect the value stored in that 
output address. As can be seen in Figure 8, two Move type 
output instructions (and their corresponding condition 
sets) can directly alter the value stored in the query input 
address MD3650. The first Move instruction is located in 
the second rung of the function block FB421 (shown in 
Figure 3 – see the second branch of the rung diagram), and 
the second Move instruction is located in the tenth rung of 
the function block FB423 (not illustrated for the space 
 
Figure 8: The format of the query output XML file (retrieved as a result of the query input: address MD3650). 
An Output Instruction Based PLC Source...  Informatica 41 (2017) 349–362 357 
reasons). The users then have to thoroughly inspect these 
condition sets to find out: i) if any sensor has failed or is 
transmitting an inaccurate reading; and ii) if there is any 
logical or conceptual flaw in the rung diagram (i.e., the 
output instructions and their corresponding conditions). If 
not then the users have to search again for the conditions 
corresponding to the output address (or addresses) from 
where an erroneous value is obtained as the input (please 
note that this address must have to the output address of 
an instruction that belongs to the above stated condition 
sets). This process continues until the actual cause of the 
error is detected (as mentioned in Section 2). It is easy to 
realize, POST makes this entire condition search process 
automatic and oversight easy; and hence, the program 
error analysis process becomes simple and fast. Please 
note that this becomes possible only because of the 
program output instructions and their corresponding 
conditions based source code transformation approach of 
POST. 
4.4 Implementation Details of POST 
Approach 
We have implemented POST approach in C++ language 
and tested it on the program source codes of Siemens and 
Allen-Bradley PLC software. POST takes the PLC 
program source code file saved in the IL language format 
as the input, and converts it into the above stated XML file 
format (as discussed in Subsection 4.1 and Subsection 
4.2). Whenever the starting tag of a routine (respectively, 
rung) is encountered in the program source code file, 
POST enters the name and type (respectively, number) 
information of that routine (respectively, rung) into the 
output XML file, following the same format as shown in 
Figure 4. Similarly, when the ending tag is detected, it 
 
Figure 9: A simple example ladder rung diagram. 
 
Figure 10: Step 1 – Identifying the program output instructions and their corresponding condition sets. 
358 Informatica 41 (2017) 349–362 A. Ghosh et al.  
closes the corresponding XML file node. The rung logic 
defined inside the starting and ending tag of a rung is 
copied into the computer memory for further processing. 
The rung logic (or IL code) to XML file conversion is a 
three-phase procedure. We discuss this three-phase 
procedure with the help of a simple ladder rung given in 
Figure 9. 
The first phase of the above stated three-phase source 
code transformation process is illustrated in Figure 10. As 
we can see, a string array, named Instruction Set array is 
used to store the source code of the ladder rung presented 
in Figure 9. In the first phase, as can be seen in Figure 10, 
the output instructions and their corresponding condition 
sets are determined. In order to accomplish this, at first, 
the program instructions stored in the Instruction Set array 
are converted into the specific instruction name format of 
POST (as discussed in Subsection 4.1 and Subsection 4.2). 
As can be seen in Figure 10, the IL language instructions 
are converted into the descriptive language instructions 
and are stored in a string array, called the Modified 
 
Figure 11: Step 2 – Formulating the local memory definitions. 
 
Figure 12: Step 3 – Transforming the results into the output XML file format. 
An Output Instruction Based PLC Source...  Informatica 41 (2017) 349–362 359 
Instruction Set array. In case of the source code of 
Siemens PLC software, the complex (or compound) 
instructions are decomposed into the core or basic 
instructions. For example, a Move type instruction is 
decomposed into the L, T, (optional) JNB instructions etc. 
[20, 21] (also see Figure 2). For this reason, POST has to 
inspect whether a set of core instructions in the Instruction 
Set array is actually equivalent to any such complex 
program instruction or not. If so, then POST replaces that 
instruction set with its corresponding descriptive language 
instruction and stores it in the Modified Instruction Set 
array (see Figure 10). The same measure is also taken for 
the instructions that have multiple inputs and outputs 
(such as the block call instructions – we will discuss more 
on it later). This sub-phase is skipped for the source code 
of Allen-Bradley PLC software. This is because, in that 
case, the complex instructions are not broken down into 
the core or basic instructions. 
In the next sub-phase of the first phase, the outputs 
and their corresponding condition sets are formulated 
from the Modified Instruction Set array. As we can see in 
Figure 10, POST first identifies all the output instructions 
in the Modified Instruction Set array, and then copies them 
into another array, named the Output Array. The condition 
set corresponding to each output is stored in a separate 
array (by using a pointer), named the Condition Array. As 
can be seen in Figure 10, all the instructions (except the 
output instructions) prior to an output instruction form the 
condition set corresponding to that output instruction, and 
are stored in the corresponding Condition Array. 
However, this axiom is not correct for the output 
instructions that are declared in a parallel branch or a sub-
branch. For example, as can be seen in Figure 9, the Move 
output instruction is declared in a parallel branch of the 
main branch. For this reason, the conditions declared in its 
previous branches that appear prior to it in the Modified 
Instruction Set array, cannot be considered as its 
corresponding conditions. As can be seen from Figure 10, 
in this case, we get an inequality in the number of open 
and close parentheses (three open and one close 
parentheses). So, all the conditions up to the second open 
parenthesis are needed to be eliminated in order to get the 
equality in the number of open and close parentheses (in 
other words, in order to obtain the actual condition set). 
As can be seen in Figure 10, after performing this 
elimination operation, the condition set corresponding to 
an output is stored in its corresponding Condition Array 
for further processing. 
In the second phase, POST further simplifies the 
program logic stored in each Condition Array by 
formulating the LMDs following the same procedure as 
discussed in Subsection 4.2. This phase in details is 
illustrated in Figure 11 (for the interest of space, the LMD 
formulation procedure is shown only for the Output Coil 
instruction). If the Condition Array does not hold any OR 
instruction, then this phase is skipped. As can be seen in 
Figure 11, POST first identifies the condition set 
associated with the highest depth parallel branches present 
in the Condition Array (the branch depth can easily be 
calculated from the number of open and close 
parentheses); and then replaces it with a new local 
memory definition. Recall from Subsection 4.2 that the 
RLO value stored in the local memory address associated 
with a LMD is needed to be ANDed with the RLO value 
of the other conditions present in the Condition Array in 
order to get the resultant RLO value of the Condition 
Array. It is easy to see, the exact same principle is 
followed in Figure 11. As can be seen in Figure 11, the 
local memory addresses are saved in a separate array, 
called the Local Memory Array and in the second 
dimension of that array, a pointer to another array, called 
the LMD Array is stored. In the LMD Array, the definition 
(or the condition set) of its corresponding local memory is 
stored. This process is repeated until all the parallel 
branches of the main branch are defined by using the local 
memories (in other words, until all the OR instructions are 
eliminated from the Condition Array – see Subsection 
4.2). Recall that in POST, the LMDs can be restricted to a 
limited number of parallel branches. If we set that number 
to two (implies that at most one OR instruction can be 
present in a particular LMD Array), then only the 
conditions inside the red colour arrows (see Figure 11) are 
characterized by using the local memory LM0 (and so on). 
In the third phase of the source code transformation 
process, the outputs and their corresponding condition sets 
are mapped into the output XML file. This phase is 
depicted in Figure 12. As can be seen, the condition sets 
associated with the local memories and the outputs are 
respectively written into the XML file following the same 
format as discussed throughout Subsection 4.1 and 
Subsection 4.2. In the output XML file, the OR 
instructions in the LMD Arrays are represented by using 
the Option nodes (as stated earlier). However, as can be 
seen in Figure 12, an OR (respectively, OR-NOT) 
instruction with an address argument is characterized by 
using the Condition node that has the Type attribute value 
AND (respectively, AND-NOT) and is defined under a 
separate Option node (this type of mapping is shown by 
using the red colour arrows). This is because, in the 
program source code of Siemens PLC software, an OR 
(respectively, OR-NOT) instruction with an address 
argument actually represents a parallel branch where only 
one AND (respectively, AND-NOT) instruction is 
declared (see Figure 9 and Figure 12). In the output XML 
file, POST always keeps the exact same logic structure as 
in the original PLC program (as stated earlier). Also note 
that in the output XML file, a Local-Memory type 
condition basically indicates an AND boolean logic 
operation and hence, the corresponding RLO value is 
needed to be ANDed with the RLO value of the other 
conditions in order to determine the resultant RLO value 
of the corresponding sub-logic block. 
In the above, we have discussed the implementation 
details of POST based on Siemens and Allen-Bradley PLC 
software. However, the proposed three-phase procedure is 
logically applicable to the source code of other PLC 
software as well (a little modification may be needed 
based on that particular software). 
360 Informatica 41 (2017) 349–362 A. Ghosh et al.  
4.5 Dealing with the Block Call 
Instructions 
In this subsection, we discuss the output XML file format 
for a special instruction i.e., the block call instruction. The 
block call instructions are used to call (or invoke) the 
program blocks, such as the FCs, FBs, System FCs 
(SFCs), System FBs (SFBs) etc. [20, 21, 23]. Unlike other 
instructions, the block call instructions have two types of 
parameters, namely the formal and the actual parameter. 
In addition, a program block can have multiple inputs and 
outputs. An example of such program block call is 
presented in Figure 13 (a) [the SFC20 block is used to 
copy the contents of a memory area given at input 
SRCBLK to another memory area given at output 
DSTBLK. If an error occurs, the returned error code is 
stored at the address given at output RET_VAL.]. As can 
be seen in Figure 13 (a), the SFC20 block has only one 
input and two outputs. However, some program blocks 
can have dozens of inputs and outputs. If POST 
characterizes all the inputs and outputs associated with a 
program block inside the Output node, it may generate a 
very complex XML node structure. In addition, POST has 
to define both the formal and actual parameters inside the 
Output node. In order to overcome this type of issues, 
under the Output node, an additional XML file node, 
called the Parameters node is created, inside which all the 
information related to the parameters of a program block 
is put together. In Figure 13 (b), the output XML file 
 
Figure 13: The output XML file format for the block call instructions. (a) The calling or invocation of the System 
Function SFC20. (b) The output XML file format for SFC20 block. 
 
Figure 14: An example of a block call instruction with an input parameter that is defined based on a set of conditions. 
(a) The calling or invocation of the System Function Block SFB4. (b) The output XML file format for SFB4 block. 
An Output Instruction Based PLC Source...  Informatica 41 (2017) 349–362 361 
format for SFC20 block is shown. As we can see, only the 
name and type information of the block call instruction is 
specified inside the Output node. The corresponding 
condition set is defined under the Output node following 
the same way as done before. As can be seen in Figure 13 
(b), under Parameters node, all the parameters of SFC20 
block are characterized. The Block_Input and the 
Block_Output nodes are created to define the inputs and 
the outputs of the program block. The Formal and the 
Actual attributes represent the formal and the actual 
parameters, respectively (the Actual attribute of the 
Block_Output node is the output address attribute of the 
block call instruction – also see Subsection 4.3). Please 
note that inside the Block_Input and the Block_Output 
nodes, an additional attribute i.e., the Comment attribute 
is incorporated in order to define the objectives of the 
parameters. However, the Comment attribute is an 
optional attribute and is generated only for the system 
library blocks (since the objectives of the parameters are 
known in advance). 
Another example of a program block call (a SFB4 
block call) and its corresponding XML file format are 
shown in Figure 14 (a) and Figure 14 (b), respectively. As 
can be seen in Figure 14 (b), the output XML file follows 
exactly the same structure as discussed above. In the 
Output node, an additional attribute i.e., the 
Instance_Data_Block_Address attribute is included thus 
the address of the corresponding instance data block can 
be incorporated (for more information, see [21] and [23]). 
However, this change is instruction specific (recall that 
POST determines the format of the XML node according 
to the functional specification of the corresponding 
instruction). A distinctive feature of this SFB4 block is 
that it has an input i.e., the input IN which is defined on 
the basis of a set of conditions (see Figure 14 (a)). 
Actually, at the time of program execution, the RLO value 
of the condition set is passed as the input IN value to the 
SFB4 block. If we define all these conditions inside the 
corresponding Block_Input node, then it generates a very 
complex XML node structure. In order to avoid this sort 
of problems, POST utilizes the concept of the local 
memory definitions. The local memory definitions are 
used to characterize the inputs that are defined on the basis 
of multiple conditions (following the same way as 
discussed in Subsection 4.2). As can be seen from Figure 
14 (a) and Figure 14 (b), the complete condition set 
corresponding to the actual parameter of input IN is 
defined by using the local memory LM1. Please note that 
the parallel branches of the main branch are characterized 
by using the local memory LM0 following exactly the 
same procedure as described in Subsection 4.2. 
It is easy to perceive from the above discussions: 
 the output XML file format is designed very 
carefully in such a way that the condition search 
engine of POST can accumulate all the conditions 
associated with a program output automatically 
and in a straightforward way 
 the rung logic associated with a program output 
is simplified further whenever it gets complicated 
(in other words, whenever the number of parallel 
branches and sub-branches exceeds a certain 
limit) 
 each type of node in the output XML file is 
designed keeping in mind the objective and the 
functional specification of the corresponding 
instruction 
These features of the output XML file indeed make the 
programming error analysis task simple, fast and oversight 
easy (because, there is no need to inspect each rung of 
every program blocks manually). In addition, the above 
discussed XML file format provides an easy, systematic 
and step-by-step interpretation of the program logic to the 
users which makes the error analysis task even more 
simpler. 
5 Conclusion 
This work is motivated by the need of an approach that 
can help the users to understand the PLC programs easily, 
and can assist them to analyze the programming errors in 
an efficient manner. In this paper, we have proposed a new 
approach, called POST that can satisfy all the mentioned 
needs effectively. POST takes the PLC program source 
code file as the input, and converts it into a program output 
instruction and its corresponding conditions based well-
structured XML file. In the XML file, the rung logic 
corresponding to an output is further simplified by using a 
novel local memory based technique, and is presented in a 
programming language and platform independent format. 
The proposed XML file format provides a systematic and 
step-by-step interpretation (in a bottom-up fashion) of the 
program logic to the users. In addition, the XML file 
format is designed in such a way that the condition search 
engine of POST can accumulate all the conditions that can 
affect the value stored at a given output address 
automatically. These features of POST indeed help the 
users to identify the actual cause of a programming error 
quickly and reliably. A software interface module has also 
been developed in order to provide an efficient state-based 
graphical representation of the rung logic to the users. 
Acknowledgement 
The authors wish to thank UDMTEK Co., Ltd. for 
supplying all the required software, tools and PLC 
programs used in this research work. This work was 
supported in part by the Ministry of Trade, Industry and 
Energy (MOTIE), Republic of Korea, under Grant 
10051146 and 10065737; in the part by the Small and 
Medium Business Administration, Republic of Korea, 
under Grant S2408982; and in part by the MOTIE and the 
Korea Institute for Advancement of Technology, Republic 
of Korea, under Grant N0001083. 
References 
[1] Liu, J. & Darabi. H. (2002). Ladder Logic 
Implementation of Ramadge-Wonham Supervisory 
Controller. Proceedings of the 6th International 
Workshop on Discrete Event Systems (WODES’02), 
Zaragoza, Spain, pp. 383–389. 
362 Informatica 41 (2017) 349–362 A. Ghosh et al.  
[2] Du, D., Liu, Y., Guo, X., Yamazaki, K., & 
Fujishima, M. (2009). Study on LD-VHDL 
conversion for FPGA-based PLC implementation. 
The International Journal of Advanced 
Manufacturing Technology, vol. 40, no. 11-12, pp. 
1181–1190. 
[3] Bani Younis, M. & Frey, G. (2004). Visualization of 
PLC Programs Using XML. Proceedings of the 
American Control Conference (ACC’04), Boston, 
USA, pp. 3082–3087. 
[4] Siemens Simatic Step S7 Software. Website: 
http://w3.siemens.com/mcms/simatic-controller-
software/en/step7/pages/default.aspx, last retrieved 
on 6th July, 2017. 
[5] Allen-Bradley RSLogix Software. Website: 
http://www.rockwellautomation.com/rockwellsoftw
are/products/rslogix.page, last retrieved on 6th July, 
2017. 
[6] Fen, G. & Ning, W. (2006). A Transformation 
Algorithm of Ladder Diagram into Instruction List 
Based on AOV Digraph and Binary Tree. 
Proceedings of the IEEE Region 10 International 
Conference (TENCON’06), Hong Kong, China, pp. 
1–4. 
[7] Hu, F., Fu, L., Liu, L., & Zhang, G. (2008). An 
Algorithm about Transforming PLC Ladder 
Diagram to Instruction List Based on Series-Parallel 
Merging Method. Proceedings of the Pacific-Asia 
Workshop on Computational Intelligence and 
Industrial Application (PACIIA’08), Wuhan, China, 
pp. 812–816. 
[8] Tan, A. & Ju, C. (2011). The Application of Maze 
algorithm in Translating Ladder Diagram into 
Instruction Lists of Programmable Logical 
Controller. Procedia Engineering, vol. 15, no. 1, pp. 
264–268. 
[9] Yan, Y. & Zhang, H. (2010). Compiling ladder 
diagram into instruction list to comply with IEC 
61131-3. Computers in Industry, vol. 61, no. 5, pp. 
448–462. 
[10] Huang, L., Liu, W., & Liu, Z. (2009). Algorithm of 
transformation from PLC ladder diagram to 
structured text. Proceedings of the 9th International 
Conference on Electronic Measurement & 
Instruments (ICEMI’09), Beijing, China, pp. 4-778–
4-782. 
[11] Estevez, E., Marcos, M., Iriondo, N., & Orive, D. 
(2007). Graphical Modelling of PLC-based 
Industrial Control Applications. Proceedings of the 
26th American Control Conference (ACC’07), New 
York City, USA, pp. 220–225. 
[12] Estevez, E., Marcos, M., Orive, D., Irisarri, E., & 
Lopez, F. (2007). XML based Visualization of the 
IEC 61131-3 Graphical Languages. Proceedings of 
the 5th IEEE International Conference on Industrial 
Informatics (INDIN’07), Vienna, Austria, pp. 279–
284. 
[13] Estevez, E., Marcos, M., Irisarri, E., Lopez, F., 
Sarachaga, I., & Burgos, A. (2008). A novel 
Approach to attain the true reusability of the code 
between different PLC programming Tools. 
Proceedings of the 7th IEEE International Workshop 
on Factory Communication Systems (WFCS’08), 
Dresden, Germany, pp. 315–322. 
[14] Estevez, E., Marcos, M., Orive, D., Lopez, F., 
Irisarri, E., & Perez, F. (2008). Middleware based on 
XML technologies for achieving true 
interoperability between PLC programming tools. 
Proceedings of the 17th World Congress of the 
International Federation of Automatic Control 
(IFAC’08), Seoul, Republic of Korea, pp. 8461–
8466. 
[15] Marcos, M., Estevez, E., Perez, F., & Der Wal, E. 
(2009). XML exchange of control programs. IEEE 
Industrial Electronics Magazine, vol. 3, no. 4, pp. 
32–35. 
[16] Lopez, F., Irisarri, E., Estevez, E., & Marcos, M. 
(2008). Graphical representation of factory 
automation Markup Languages. Proceedings of the 
13th IEEE International Conference on Emerging 
Technologies and Factory Automation (ETFA’08), 
Hamburg, Germany, pp. 29–32. 
[17] Frey, G. & Bani Younis, M. (2004). A Re-
Engineering Approach for PLC Programs using 
Finite Automata and UML. Proceedings of the IEEE 
International Conference on Information Reuse and 
Integration (IRI’04), Las Vegas, USA, pp. 24–29. 
[18] Bani Younis, M. & Frey, G. (2005). Formalization 
and Visualization of Non-binary PLC Programs. 
Proceedings of the 44th IEEE Conference on 
Decision and Control and European Control 
Conference (CDC-ECC’05), Seville, Spain, pp. 
8367–8372. 
[19] Bani Younis, M. & Frey, G. (2006). UML-based 
approach for the re-engineering of PLC programs. 
Proceedings of the 32nd Annual Conference on 
IEEE Industrial Electronics (IECON’06), Paris, 
France, pp. 3691–3696. 
[20] Siemens Simatic Ladder Logic (LAD) for S7-300 
and S7-400 Programming (Reference Manual). 
URL: 
https://cache.industry.siemens.com/dl/files/822/455
23822/att_82001/v1/s7kop__b.pdf, last retrieved on 
6th July, 2017. 
[21] Siemens Simatic Statement List (STL) for S7-300 
and S7-400 Programming (Reference Manual). 
URL: 
https://cache.industry.siemens.com/dl/files/446/455
23446/att_79269/v1/s7awl__b.pdf, last retrieved on 
6th July, 2017. 
[22] Graphviz Software. Website: 
http://www.graphviz.org, last retrieved on 6th July, 
2017. 
[23] Siemens Simatic System Software for S7-300/400 
System and Standard Functions (Volume 1/2, 
Reference Manual). URL: 
https://cache.industry.siemens.com/dl/files/574/121
4574/att_44504/v1/SFC_e.pdf, last retrieved on 6th 
July, 2017. 
 
Informatica 41 (2017) 363–377 363
An Effective Meta-Heuristic Cuckoo Search Algorithm for Test Suite
Optimization
Manju Khari
Department of Computer Engineering,
Ambedkar Institute Of Advanced Communication Technologies and Research, Delhi, India
E-mail: manjukhari@yahoo.co.in
Prabhat Kumar
Department of Computer Engineering,
National Institute of Technology Patna, Bihar, India
E-mail: prabhat@nitp.ac.in
Keywords: optimization, nature inspired algorithm, cuckoo search, test suite, test data
Received: March 12, 2016
Automation testing is the process of generating test data without any human interventions. In recent times,
nature-inspired solutions are planned, tested and validated successfully in many areas for the purpose of
optimization. One such meta heuristic technique is Cuckoo Algorithm (CA) that receives its sole inspi-
ration from the behavior of cuckoo, who has the ability to resolve complex issues using simple initial
conditions and limited knowledge of the search space. This paper presents a cost effective and time effi-
cient algorithm inspired from cuckoo for optimizing the test data. On comparing the proposed algorithm
with existing Firefly Algorithm (FA) and Hill Climbing (HC) algorithms, it was found that CA outperforms
both FA and HC in terms of the test data optimization process. The work done in the current study would
be helpful to testers in generating optimized test data which would result in saving of both testing cost and
time.
Povzetek: V prispevku je razvita izpopolnjena meta-hevristična metoda kukavičjega algoritma (Cockoo
algorithm), ki temelji na reševanju zapletenih problemov z reševanjem več lokalnih preprostih.
1 Introduction
Software testing is done with the intention of finding bugs
and enhancing the quality before delivering it to the client
[1]. As testing is very tedious and time consuming, it is
highly desirable that this cost be controlled as much as pos-
sible [2]. One way to control this cost is to reduce the data
which is used for testing. Test data optimization is a pro-
cess of reducing the test data sets and it can be successfully
applied to black box as well as white box testing [3].
Optimization mainly focuses on reducing the number
of available solutions depending upon certain parameters.
Merely reducing the number of test data does not accom-
plish the purpose of software testing. The test data should
effectively uncover all potential lapses that exist in the soft-
ware or product. For carrying out test data optimization,
mathematical functions are applied that can effectively fil-
ter out the required test data. The methods of finding the
optimal solutions among the various available options can
be regarded as the optimization problem [20]. This prob-
lem aims at maximizing or minimizing a real value function
from the allowed set of solutions.
With the help of optimization, we can filter out the fittest
test data that can easily test various properties related to
a software or product. Generally, test data optimization is
carried out with the help of various meta-heuristic algo-
rithms. They offer better solutions in comparison to the
traditional algorithm which finds a solution based on hit
and trial method. These algorithms have two components,
exploration (intensification) and exploitation (diversifica-
tion). Exploration means to search solution at a global scale
while exploitation aims at finding a good solution in a lo-
cal region. The combination of these two ensures a global
optimal solution which is achievable.
This paper deals with implementation of a meta-heuristic
CA for optimizing test data. In the year 2009, the behavior
of a cuckoo was initially modeled by Yang [25]. Since then
CA has been successfully used for solving various opti-
mization problems [26][27][28][31][33][35][37], but to the
best of author’s knowledge, no study has explored the ap-
plication of CA for test data optimization at such a large
scale. In order to reduce the testing efforts, CA is applied
for test data generation and compared with existing tradi-
tional algorithms such as FA and HC. The algorithms were
applied to 50 JAVA programs. Various parameters, such
as time taken by each algorithm to perform optimization,
were compared using a number of iterations. It was found
that CA outperformed FA and HC in each and every aspect.
364 Informatica 41 (2017) 363–377 M. Khari et al.
The rest of the paper is organized as follows: Section 2
describes the related work that has been done with context
to CA, Section 3 discusses various concepts related to test
data optimization, Section 4 explains the proposed method-
ology, Section 5 describes that has been collected for show-
ing that proposed methodology outperformed FA and HC,
Section 6 provides a detailed case study of the proposal
along with the statistics applied to JAVA programs for all
three algorithms, Section 7 describes the potential threats
and precautions that need to be taken while implementing
this algorithm and finally, Section 8 concludes the study
with a discussion of the future scope.
2 Related work
Developments in the field of ’automated test data genera-
tion’ were initiated in the early 70’s. Articles like ’Test-
ing large software with automated software evaluation sys-
tems’ by Ramamoorthy et al., in the year 1975 [4] and ’Au-
tomatic generation of floating-point test data’ by Miller et
al., in the year 1976 [5], are a few examples of the early
work in this field. Nevertheless, Clarke’s examines [6] of
the year 1976 is considered to be the first of its kind to pro-
pose an algorithm for automated test data generation which
is written in FORTRAN. The automated test-data genera-
tors can be divided into three classes - random, static and
dynamic. Random test data generation is easy to automate,
but problematic [7][9]. Firstly, it produces a sample of the
possible paths through the software under test (SUT). Sec-
ondly, it may be expensive to generate the expected output
data for a large amount of input data produced. Finally,
provided exceptions occur only rarely, the input domain
which causes an exception is likely to be small. Random
test-data generators may never hit on this small area of the
input domain.
During 70’s up to mid-80’s, people did research on test
data generation using symbolic execution. Researchers
have been working on test data generation since 1975, but
unfortunately, there is hardly any fully automated test data
generation working tool available in the current software
industry. At that time the language used for test data gen-
eration was FORTRAN. In the year 1987, Parther [7] pro-
posed a new idea for test data generation called path prefix
method. In the year 1990, Korel [8] provided a revolu-
tionary change by generating test data dynamically based
on actual value using pattern and explanatory search. In
the year 1996, Ferguson [9] examined an assertion oriented
and chaining approach i.e. goal-oriented test data genera-
tion [10]. Test data generation from dynamic data structure
[5] has encouraged many authors to work on hybrid test-
ing methods for detecting infeasible paths, thereby saving
computational time. Mutation testing technique is used to
improve the reliability of object-oriented software, authors
understand the traditional mutation testing method and ap-
ply on object-oriented programs that are known as class
mutation [12]. Xiao et al. [13] proposed a technique for
test data generation even if more or less path predicate is
unsolvable. However, the proposed method could not pro-
vide a good coverage.
A better way can be the combination of evolutionary
methods with dynamic symbolic execution [15][18], which
would wipe away the disadvantages of both the approaches.
Search-based methods have previously been applied for
testing object-oriented software by making use of method
sequences [19][20] or by the help of strongly typed genetic
programming [21][22]. While creating test data for object-
oriented software, since the early work of Tonella [23], the
authors experimented on issues of handling the techniques
that can be used to reverse engineer several design views
from the source code. These techniques aim at generat-
ing test suites to achieve high quality test data using mu-
tation testing [16][46][47][48] deployed in a search-based
test generation environment. Lakhotia et al. [24] estab-
lished a search based multi-objective approach in which a
random search, Pareto GA and weighted GA algorithm is
used for branch coverage. Various algorithms were also
proposed to optimize and prioritize the test suite. Evolu-
tionary algorithms (like CA) were among them. Amir et
al. used cuckoo search for solving structural optimization
tasks. It has been used in non-linear constrained optimiza-
tion problems [44].
In the year 2010, Yang et al. [25] examined CA to re-
solve the issue of engineering design optimization. The
results that were obtained proved to be better than particle
swarm optimizer. In the year 2011, Rajabioun [26] ana-
lyzed CA to solve nonlinear optimization issues which are
used to solve difficult problems. Chandrasekaran et al. in
the year 2011 [27], substantiated a hybrid CA integrated
with fuzzy logic for solving multi-objective unit commit-
ment problems. Yildiz [28] used this algorithm select op-
timal machining parameters for mining operations in the
year 2013. Valian et al. [29] verified CA to the forward
network for two classification problems.
Sur et al. [33] compared CA to particle swarm optimiza-
tion, differential evolution, and artificial bee colony algo-
rithm and establish CA on optimizing vehicle route in the
graph-based network. Gandomi et al. [31] proposed CA for
the purpose of truss structure optimization. Bulatovic et al.
[32] interpreted CA to solve the issues related to optimum
synthesis of a six bar double dwell linkage. Swain et al.
in the year 2014 [34] critiques bio-inspired CA for a neu-
ral network and applied it on noise cancellation. Prakash
et al. [35] examined CA for finding out optimal schedul-
ing in computational grid. Walton et al. [27] proved CA
for gradient-free optimization. Bhandaria et al. [37] ex-
plores CA and wind driven optimization for satellite image
segmentation for multilevel thresholding using Kapur’s en-
tropy. In the year 2015 Ahmed et al. [43] verified a combi-
natorial test suite technique using CA. This method is only
applicable to the functional testing process. Wang et al.
[36] investigates a robust algorithm used to enhance the
searching process of CA. Elazim et al. [45] maintained CA
in the multi-machine power system. Table 1 summarizes
An Effective Meta-Heuristic Cuckoo Search Algorithm for. . . Informatica 41 (2017) 363–377 365
the literature work.
The above studies show that CA has been applied for
finding optimal solutions to various problems. It has been
used in various applications from mining to finding routes.
However, it has not been used for generating test data
which can be used in testing the software products thereby
reducing the efforts of the testers and developers. We pro-
pose an approach to generate test data that is optimal for
performing testing in software development life cycle by
using the CA.
3 Key research concepts
The concepts involved in developing the proposed algo-
rithm along have been described in this section. This sec-
tion discusses the concepts of nature inspired CA, FA, HC,
and the objective function.
3.1 Cuckoo biological algorithm
It is an optimization algorithm developed by Yang et al. in
the year 2009. Cuckoos are by and large known for their
sweet voices, yet they have a forceful proliferation method.
They lay their eggs in the nest of other host birds. If the
host bird discovers that the eggs are alien then the host bird
shall either discard them or abandon the nest. CA has been
applied to various optimization problems. It is a nature in-
spired meta-heuristic algorithm which supports the theory
of ’survival of the fittest’[38][41]. A number of algorithms
work by beginning with an essential result and progres-
sively adding more and more data that prompts creation of
the best result from the pool of results. On the other hand, a
few algorithms begin with a pool of results and come down
to the best result, via disposing of the most exceedingly
bad results from the pool, by thinking about among the re-
sults. CA falls into the second category of algorithms. The
biological algorithms have been discussed in [39].
Each egg in a nest represents a solution and cuckoo egg
represents a new solution. The algorithm aims at using the
new and better solutions to replace the less good solutions.
It is based on three idealized rules which are given as:
1. Cuckoo chooses one egg at a time which has to be
dumped into a randomly chosen nest.
2. The nests containing high quality of eggs has to be
carried forward to the next generation.
3. Available host nests is of fixed quota and laid eggs are
discovered with the probability of (0, 1).
CA consists of two search capabilities: local search and
global search which is controlled by discovery probability.
In relation to testing, using CA, we can represent the eggs
present in nest, as test data giving a solution. CA will re-
place high quality with the low-quality eggs already present
in the nest. It increases the potential of getting good quality
test data for the SUT.
The advantage of using CA is that it uses only a single
parameter for optimization, unlike other algorithms which
makes it easier to implement and since it consists of local
search and a global search, it tends to give global optimal
solution to the problem under test or SUT.
3.2 Firefly algorithm
This algorithm is inspired by the nature of fireflies. It is
based on the collective behavior of fireflies. Fireflies are
known for their flashing behavior. The variation of the light
intensity and the formulation of attractiveness are the two
major factors affecting the behavior of the fireflies. They
use their flashlights to attract other fireflies. The algorithm
was developed by Yang in the year 2008.The main purpose
of flashing lights is to attract mating partners and warn off
potential predators [26][27].
The FA algorithm is based on the idealized rules. First,
all the fireflies are unisexual and the attraction between is
irrespective of gender. Attraction is based on the bright-
ness of fireflies such that low brightness firefly will move
towards high brightness firefly. Since light intensity tends
to decrease with the increase in distance between fireflies;
hence, the attractiveness is inversely proportional to the dis-
tance between the fireflies. Brightness and intensity of the
fireflies are determined using the objective function.
3.3 Hill climbing algorithm
This algorithm aims at finding superior results in an incre-
mental way. It transforms its state by software under test
and if the change delivers a superior result then the addi-
tion is carried out for performing the further evaluation. It
aims at finding local optimal solutions, thereby achieving
a result that is globally optimal. It is an iterative algorithm
that starts with a random solution to the problem with the
aim of finding out a better solution. The solution is changed
if an improvement is found. This process continues until no
further improvement can be made in the solution. It is used
widely where you want to reach goal state from starting
node. It also aims at maximizing or minimizing the tar-
get function [40]. Many variants of HC are available like
steepest ascent HC, stochastic HC, random restart HC [20].
HC starts with the process of assigning the random co-
ordinates to each test data. Then, first test data is taken as
input and its neighbors are discovered. The objective func-
tion values are compared to the neighbors and the selected
test data and best one is chosen as optimal test data. This
process is iterated for every original test data. Finally, the
optimized test data is produced by HC for the given SUT.
3.4 Objective function
This function is used to maximize / minimize some numeri-
cal values and is often used for finding the optimal solution
for a given problem. In the provided domain the objective
function’s best value is selected by evaluating its objective
366 Informatica 41 (2017) 363–377 M. Khari et al.
Year Author Key points
1975 Ramamoorthy et al.
[4]
• The paper explores the main features of automated software tools and various software evaluation system, which were available.
• Automated software tools were chosen because it has been found to be valuable to improve the reliability of a software.
1976 Miller et al. [5] • Two examples i.e. a matrix factorization subroutine and a sorting method are used to describe the types of data generation problems.
They are used instead of symbolic execution to generate test data.
• The programs with floating-point data are used for large savings of time and storage are made possible.
1976 Clarke [6] • The system proposed to generate test data for programs written in ANSI Fortran.
• System symbolically executes the path and creates a set of constraints on the program’s input variables.
• It uses linear programming when the set of constraints are linear.
1987 Prather et al. [7] • The novel technique i.e. "adaptive" is analysed for selection of subsequent paths and offers considerable advantages over existing
strategies in its computational requirements.
•Method ensures branch coverage and offers a considerable advantage in its computational requirements.
1990 Korel [8] • The approach for generating test data is extended to programs with dynamic data structures and a search based method on dynamic
data-flow analysis, along with backtracking is presented.
• In the approach, values of array indexes and pointers are used.
1996 Ferguson et al. [9] • The chaining approach for automated software test data generation, which is based on the theory of execution-oriented test data genera-
tion and also used for the search process.
• The approach used significantly improves the test data generation compared to the existing methods.
1996 Korel et al. [10] • The assertions are used to generate test data and is considered a tool for automatic runtime detection of software errors.
• The assertion is violated reducing to the problem of finding program input on which a selected statement is executed. It is done with the
help of white box testing.
2000 Frohlich et al. [11] • Authors experiments, how test suites with a given coverage level can be automatically generated from state chart diagrams.
• It is done by mapping the state chart elements to the STRIPS planning language.
2000 Kim et al. [12] • The authors examine the Class Mutation technique that assesses the quality of test data distinguish between mutated programs from the
original program.
• It is complimented with the help of the results of the case study, which were tested to investigate the applicability of the technique.
2001 Ernst et al. [15] Authors explore three results
• Describes techniques for dynamically discovering invariants.
• Reports on the Daikin’s application to two sets of the program.
• Analyzes scalability issues.
2004 McMinn [20] • The authors reviewed Meta-heuristic search techniques are high-level frameworks.
• It uses heuristics to seek solutions for combinatorial problems at a reasonable computational cost.
2005 Tonella [23] • This proceeding describes some of the most advanced techniques that can be employed to reverse engineer several design views from
the source code.
2006 Wappler et al. [21] • The authors investigate a tree-based representation of method call sequences that search for numeric test data.
• It automatically generates test programs that represent object-oriented unit test data.
2007 Xiao et al. [13] • The paper experiments various automated test generation techniques but chooses goal oriented approach.
• The goal-oriented approach as a promising approach to devising automated test-data generators using optimization techniques.
2007 Harman [14] • Authors examine optimization techniques on seven application of software engineering.
• Optimization techniques evolved from the operational research and metaheuristic research.
2008 Sofokleous et al. [18] • Authors prove dynamic test data generation framework based on genetic algorithms.
• They are the Batch-Optimistic and the Close-Up that provide an optimum set of test data with respect to the condition coverage criterion.
2010 Papadakis et al. [16] • Authors compares an approach conjoins program transformation and dynamic symbolic execution techniques in order to automate
successfully the test generation process.
2010 Yang et al. [25] • Authors provide extensive comparison study using some standard test functions and newly designed stochastic test functions.
• Examines CA to solve engineering design optimization problems.
2011 Fraser et al. [17] • EVOSUITE is critiqued, a search-based approach that optimizes test suites for satisfying distinct coverage goals.
• It achieves up to 18 times the coverage of a traditional approach.
2011 Rajabioun [26] • Authors observe CA which is suitable for continuous nonlinear optimization problems.
2011 Valian [29] • Authors interpret an algorithm which is employed for training feedforward neural networks for two benchmark classification problems.
2012 Walton et al. [27] • New modified CA robust algorithm has been analysed, modification of CA involves the addition of information exchange between the
best solutions.
2012 Prakash et al. [35] • CA examines an Optimal Job Scheduling in Grid computational resource allocation and Resource Discovery.
2013 Gandomi et al. [44] • The CA proves for solving structural optimization tasks with Leivy flights is first verified using a nonlinear constrained optimization
problem and also validation against structural engineering optimization problem.
2013 Civicioglu et al. [30] • The numerical optimization problem-solving successes of the proposed CA has been compared statistically over 50 different benchmark
functions.
2013 Yildiz et al. [28] • Authors examine a hybrid optimization approach based on differential evolution algorithm and also having receptor editing property of
the immune system.
2013 Gandomi et al. [31] • Authors examine CA to solve truss optimization problems.
• It also adds unique search features used in the proposed CA.
2013 Bulatovic et al. [32] • CA scrutinizes the procedure of optimum synthesis of mechanism parameters.
• The paper also highlights the dimensional synthesis of a six-bar linkage with turning kinematic pairs.
2014 Sur et al. [33] •Modified CA analyzes for discrete problem domain like that of the graph based problem and other combinatorial optimization problems.
2014 Swain et al. [34] • The authors verify CA for noise removal from a signal.
• It uses trained network to remove noise from sine signal, which was contaminated, with white Gaussian noise.
2014 Bhandaria et al. [37] • Authors examine on image segmentation is to extract meaningful objects by CA and wind driven optimization using entropy.
2015 Ahmed et al. [43] • CA maintains to construct optimized combinatorial sets.
• The strategy consists of different algorithms for construction.
2016 Wang et al. [36] • In robust CS, the pitch adjustment operation in harmony search (HS) that can be considered as a mutation operator is added to the
process of the cuckoo updating to speed up convergence.
• It is used for enhancing the capability of CS.
2016 Elazim et al. [45] • CS verifies for optimal Power System Stabilizers (PSSs) design in a multi-machine power system.
• The design problem of PSS is formulated as an optimization problem.
Table 1: List of research paper included for literature work.
An Effective Meta-Heuristic Cuckoo Search Algorithm for. . . Informatica 41 (2017) 363–377 367
function [42].The objective function is used to describe the
closeness of a given design solution to achieving its aims
in terms of numerical value. It works as a single figure of
merit for determining the quality of the test data which can
be compared to the objective function value of other test
data. By this, we can compare the results and choose the
better quality test data, hence optimal test data.
In this paper, we have considered the Sphere objective
function [43]. The two major types of objective functions
are:
– For Multi-Objective Optimization Problems: Some
problems have multiple criteria to fulfill a particular
objective[42]. They are: “Chakong and Haimes Func-
tion, Binh and Korn Function, Fonseca and Fleming
Function, Test Function, Kursave Function, Schaffer
Function N.1, Poloni’s Two Objective Function”.
– For Single Objective Optimization Problems: Some
problems have single criteria to fulfill a particular ob-
jective hence it is not guaranteed that there may exist
a single solution that will satisfy a particular objective
[42]. They are: “Levi Function, Rosenbrock Func-
tion, Booth Function, Beale Function, Sphere Func-
tion, Three Hump Camel Function, Goldstein Price
Function, and Cross in Tray Function”.
4 Proposed algorithm
This section explains the proposed methodology that has
been applied for optimizing test data. The technique is
based upon the natural behavior of cuckoo for producing
optimized test data that can be used for carrying out soft-
ware testing efficiently thereby, reducing the cost of the
product. The proposed technique uses the concept of ex-
ploration and exploitation for carrying out test data analysis
and finding out the result which is optimal in nature.
The approach devised is mainly applicable to path test-
ing. It ensures that proper path and code coverage is
achieved with the help of optimized test data. Initially, am-
ple amount of test data for the program or software under
test is generated. These initial test data are analogous to the
eggs which are laid by cuckoo. The paths on which testing
needs to be performed are equivalent to the nests where the
fitness of the eggs needs to be judged. Just as a cuckoo
searches for a random nest for disposing of its eggs, in the
same way, a test data is picked up randomly which is dis-
posed of in a random path. The host bird can discard the
egg or the path itself depending upon the conditions. Simi-
larly, the chosen test data is validated with the chosen path.
If they are found to be compatible then the test data is kept
in that path else it is kept in another path named buffer.
In simple words, a test data and a path are chosen at ran-
dom. If the chosen test data belongs to that path then the
test data is kept there itself, else it is kept in another path
called buffer. Initially, each test data undergoes this pro-
cess unless the initial list of test data becomes empty. Once
this initial list of test data becomes empty, we consider an-
other list called buffer. This is an extra path or nest which
keeps those test data for which a compatible path was not
found in first go. Before further picking of nests and eggs,
the test data with a minimum value of objective function is
discarded from each path. For further evaluations, random
picking of test data is done from the buffer list and not from
the initial list. A test data is picked up randomly along with
a path. If the test data belongs to that path then that test
data is kept in that path only, else it remains in the buffer.
This process continues until the buffer list becomes empty.
Every time when a match is found, the objective function
is applied to the available test data. The test data with a
minimum value of the objective function is discarded ev-
ery time. At the end, when the list empties, the optimized
test data belonging to a particular path are obtained. There
might be a case that a path might be discarded which means
that no test data might be available for evaluation purpose.
In such data, the path becomes obsolete and it becomes in-
eligible for performing testing.
Depending upon the type of test data, different values
are fed into the objective function for reaching optimized
results for a particular path. If the test data is of a numeri-
cal type, then the values are directly fed into the objective
function for carrying out analysis and finding the value. For
string type of test data, the value that is fed into the objec-
tive function is the average of the ASCII values of individ-
ual characters. Characters can be alphabetical or special
characters. For example, let the test data which has been
given for evaluation be ’mam’ then the value that we will
be opting for putting into objective function can evaluated
as mentioned in equation 1
m=109 (ASCII value of m), a=97 (ASCII value of a)
mam = (109 + 97 + 109)/3 = 105. (1)
105 is given as input to the objective function which
helps in determining the optimality of the test data which
are of type string. The pseudo code for the CA is depicted
below:
368 Informatica 41 (2017) 363–377 M. Khari et al.
Pseudo code of CA
Declaration
initlist: initial list.
buff: buffer or temporary list.
Input:Program under test.
Output:Optimize test data.
1. Initialize test data for the whole program in a path or nest in a list
called initlist.
2. Separate paths or nests for the program.
3. Initialize objective function.
4. While (initlist!=empty )
(a) Pick a random test data or egg
(b) Pick a random path or nest
i. If ( the chosen test data belong to the chosen path )
A. Keep it in the path or nest
B. Remove it from the initlist or path
ii. Else
A. Keep it in a list or nest call buff
B. Remove it from the initlist
iii. End if
5. End while
6. For each path or nest
(a) Find objective function for each test data using formula in
equation 1
(b) Find out test data or egg with minimum value of objective
function and remove it
7. End for
8. While (buff!=empty )
(a) Select a test data or egg randomly
(b) Select a path or nest randomly
i. If ( test data or egg belong to that path or nest )
A. Keep it in that path or nest
B. Remove it from buff list or nest
ii. Else
A. Put it back in the buff list or nest
iii. End if
(c) Calculate value of the objective function for each test or egg
data for the path or nest opted.
(d) Remove the test data or egg with a minimum value of the
objective function from the path or nest.
9. End while
10. Optimized test data or eggs for each path or nest.
5 Data collection
Evaluating the performance of any technique requires se-
lecting certain subject programs which form the basis for
evaluation. To evaluate the performance of our proposed
algorithm and to compare it with other algorithms, we have
selected fifty real time programs written in Java language.
The subject programs were chosen for test data generation
and optimization activity has been discussed in the follow-
ing subsection. The size of programs ranges from 30 to 250
lines of source code. A diversified range of programs were
chosen including mathematical problems such as finding
roots of quadratic equation, triangle classification problem,
computing the median of the triangle; general logical prob-
lems such as checking for the Armstrong number, magic
number, palindrome number; Pythagorean triplet, convert
a number to hexadecimal, octal or binary. All these pro-
grams are written in standard Java language that makes it
easier to work with.
To discuss the advantage of our algorithm over other
techniques, we have developed an analytical framework.
This framework evaluates our algorithm and compares it
with other algorithms on the basis of four parameters: op-
timized test data, iteration, duration, and complexity.
5.1 List of programs
Table 2 lists the JAVA programs used for the analysis of the
algorithms.
Program Description
P1. Find greatest of three given numbers.
P2. Find if a number is even or odd.
P3. Find if a number is positive or negative.
P4. Find if a year is a leap or not.
P5. Find if a number is multiple of 2,3,5,7 or not.
P6. Find smallest of three numbers.
P7. Find the area of figures depending upon choice entered by the
user.
P8. Find if a number id perfect square or not.
P9. Find if a number is in powers of two or not.
P10. Check if the number entered is palindrome or not.
P11. Find if a number is Armstrong number or not.
P12. Find if a number is perfect or not.
P13. Find if a number is a magic number or not.
P14. Find if a number is prime or not.
P15. Find the type of character entered by the user (character, spe-
cial character, number).
P16. Find if a string entered is palindrome or not.
P17. Perform a calculation on the entered numbers depending
upon the choice.
P18. Find if a triangle is a scalene, equilateral or isosceles.
P19. Find if the given three numbers entered from a Pythagorean
triplet or not.
P20. Find if the number entered is a single digit number or not.
P21. Calculate bonus depending upon the number of extra days
entered by the user.
P22. Find the quadrant of the points entered by the user.
P23. Find types of roots entered by the user.
P24. Implement linear search.
P25. Implement a binary search
P26. Convert a number to binary, hexadecimal or octal entered by
the user.
P27. Find greatest common divisor of the entered two numbers.
P28. Find least common multiple of entered two numbers.
P29. Find if the number entered is strong or not.
P30. Calculate bill depending upon the units entered by the user.
P31. Find if a triangle is right-angled or not.
P32. Find the volume of the given figure.
P33. Find the total surface area of a given figure.
P34. Find the total surface area of a given figure.
P35. Convert the temperature from Celsius to Fahrenheit, Kelvin
to Celsius and vice-versa.
P36. Find S.T.D codes or states depending upon the value entered
by the user.
P37. To allot section to a student depending upon the marks ob-
tained.
P38. Calculate profit or loss.
P39. Find details of any element depending upon the element en-
tered by the user.
P40. Find if the string entered by the user starts with a vowel or
not.
P41. Find the strength of a password.
P42. Find the sum of three numbers entered by the user.
Continued on next page
An Effective Meta-Heuristic Cuckoo Search Algorithm for. . . Informatica 41 (2017) 363–377 369
Continued from previous page
Program Description
P43. Interconvert currencies.
P44. Calculate B.M.I after entering height and weight.
P45. Convert interconvert seconds, minutes and hours.
P46. Calculate simple interest.
P47. Calculate compound interest.
P48. Calculate factors of given number.
P49. Determine if the given series are in A.P or G.P.
P50. Find the capital of a given state or vice versa.
Table 2: List of JAVA program used as a program under
test
5.2 Metrics for evaluation
The following metrics were considered for the evaluation
and comparison of the proposed work with the existing FA
and HC algorithms:
1. Number of Optimized Test data: This is the main fo-
cus of this research paper as the proposed algorithm
aims at reducing the number of test data for any given
path of any particular program. FA, HC and pro-
posed algorithm using CA were implemented on the
programs and it was observed that good results were
given by proposed algorithm. The number of opti-
mized test data was counted by embedding a counter
in the source code of each program. For each opti-
mized test data optCount (OC) can be defined as in
equation 2:
OC = OC + 1 (2)
2. Iterations: Number of iterations while applying the
algorithm and finding out test data was calculated for
FA, HC and proposed CA. A counter for counting a
number of lines of execution was embedded in each
program for all three algorithms. After the execution
of each program under test, the counter was incre-
mented for finding out a number of lines of execution.
For each optimized test data iterations Count (IC) can
be defined as in equation 3:
IC = IC + 1 (3)
3. Duration:This metric refers to the execution time of
the program. The code for calculating the time of ex-
ecution was embedded in every program. Initial time
(IT) and final time (FT) after completion were noted
and the two were subtracted for finding out the time
period of execution. Therefore, Time of execution/
Duration (ET) can be represented in equation 4:
ET = FT − IT (4)
4. Cyclomatic Complexity: The complexity of the codes
were calculated using a plugin of Eclipse IDE tool,
Metrics. It takes the programs for which the complex-
ity has to be generated as the input. It was found that
CA gave better results as compared to FA and HC.
McCabe Cyclomatic Complexity (CMC) where, num-
ber of edges (E) in control flow graph and number of
nodes (N) as in equation 5:
CMC = E −N + 2 (5)
6 Analytical evaluation and
comparison
6.1 Case study
In this section, we consider a case study of calculating
bonus of an employee depending upon the number of days
he has worked extra. The detailed results are generated
after evaluating a particular case study with the help of
CA. The example shows how CA can be employed to
find out optimized test data. The case study has three
paths; one in which the number of days is in between 1-
5, second in which the number of days is in between 6-
10, and the third path in which the number of days is in
between 11-15. The following sections show the detailed
view of each and every iteration involved while evaluat-
ing the case study. The initial list (initlist ) of test data
is given as 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16. There
are three paths in the program that are named as Path 1,
Path 2, and Path 3.The test data in terms of path are given
as: Path 1 = 1,2,3,4,5,6 ; Path 2 = 7,8,9,10 ; Path 3 =
11,12,13,14,15,16.
Now in the first iteration, the cuckoo bird picks up a test
data and a particular nest which is termed as a path. The
lists of the three paths are given as one, two and three. If
both of them are compatible then the chosen test data is put
into that path else it is kept in a nest called buff (buffer/
temporary list) which does not discard them completely.
The first iteration result is shown in the following Table 3,
which gives the chosen test data, chosen path and the path
to which the test data has been added.
Table 4 shows the lists after the first iteration along with
the objective function value.
After the first iteration, the test data / eggs with a min-
imum value of objective functions are removed from the
lists. The modified lists are mentioned in table 5.
Now the test data are again picked randomly from the
buff list and again a random path is chosen. If both are
compatible then the test data is added to the list of the cor-
responding path else it is kept in the buff list. This process
continues until the buff list becomes empty. Every time the
nest and the egg are compatible then the egg with the least
value of the objective function will be removed from the
nest or path. The egg was chosen, the nest was chosen, sta-
tus if it has been added or not in the list, and value removed
from the list has been shown in the following table 6.
Once the buff list is empty then the optimized test data
are obtained for each path. The number of optimized test
data for each and every path is 1. Table 7 shows the path-
wise total number of test data along with optimized test
370 Informatica 41 (2017) 363–377 M. Khari et al.
Chosen
egg/test data
Chosen
path/nest
Nest to which the
egg/test data has
been added
8 Two Two
11 One Buff
3 Three Buff
2 One One
4 Two Buff
13 Three Three
15 One Buff
5 One One
6 Two Buff
16 Three Three
12 One Buff
14 One Buff
9 One Buff
1 Two Buff
7 Two Two
10 Two Two
Table 3: Results after Iteration 1 of CA
data, Iteration and Complexity results obtained after apply-
ing FA, HC, and CA on the case study mentioned in section
6.
6.2 Results for 50 programs
Table 7 shows comparative results for JAVA programs after
applying FA, HC, and CA on them. The results for various
parameters have been depicted in Table 8. The obtained
results show that CA achieves best results in terms of opti-
mized test data, duration, iteration, and complexity. Fig1 -
Fig4 provides graphical comparisons of the algorithms us-
ing the aforementioned metrics.
Figure 1 shows the number of optimized test data using
FA, HC, and CA along with the total number of test data. In
this figure line plots the number of optimized test data and
a number of programs showing the maximum optimized
result of CA. The square line represents test data using FA,
triangle line shows the number of optimized test data using
HC, cross line represent a number of test data using CA and
the total number of test data are represented by diamond
lines.
Through statically analysis, we can check whether the
performance of all the algorithms is similar or not. As the
theory of ANOVA states that for the larger value of com-
puted value of F proves its better capability. To fulfill this
objective ANOVA two-factor without replication for that
we have to set null hypothesis, which states: H0: There is
not a significant difference between the various programs
among different algorithms. To check the significance of
the algorithms, authors are using ANOVA for optimization
of test suite. In Table 9, 10, 11 and 12 Rows are Numbers of
programs and Columns indicated all algorithms. The result
Figure 1: The comparison of FA, HC, and CA on number
of optimized test data
is shown in Table 9.
Since the computed values of F test statistic (21) is
greater than that of critical value (1.48) in terms of number
of programs and in terms of algorithms computed values of
F test statistic (15.7) is greater than that of critical value (3)
at 5 % level of significance, H0 is not accepted and hence
the result of optimized test suite differs significantly.
Similarly, other performance parameters like execution
time and H0 rejected their respective null hypothesis at 5%
level of significance. For execution time the computed val-
ues of F test statistics (3.14) is greater than that of critical
value (1.48) for number of programs and for different algo-
rithms computed values of F test statistics (18.5) is greater
than that of critical value (3) at 5% level of significance, so
H0 cannot accept and hence the duration of execution time
of the test suite differs significantly. Table 10 shows the
result.
Figure 2 line plots show the comparison between FA,
HC, and CA based on the time taken by the respective algo-
rithms to optimize the test suite of the JAVA programs. The
diamond line represents time taken by FA and the square
line shows time taken by HC and triangle line plots in their
presentation of CA.
Figure 2: The comparison of FA, HC, and CA on duration
of test data
Figure 3 gives the graphical representation of compared
result in terms of iteration according to the result is shown
in table 8. We analyze that the CA performs the best due
to its cumulative response. FA always calculate iteration
depending on each path separately. The diamond line rep-
resents FA, the square line represents HC, and the triangle
line plotting represents CA.
An Effective Meta-Heuristic Cuckoo Search Algorithm for. . . Informatica 41 (2017) 363–377 371
List
one
Objective
Function
value
List
two
Objective
function
value
List
three
Objective
function
value
Buff
2 4 8 64 13 169 11
5 25 7 49 16 256 3
- - 10 100 - - 4
- - - - - - 15
- - - - - - 6
- - - - - - 12
- - - - - - 14
- - - - - - 9
- - - - - - 1
Table 4: lists which include objective function
List one List two List three
5 8 16
- 10 -
Table 5: Updated lists after removing minimum value ob-
jective function test data.
Test
Data
Path Status Value re-
moved
9 One Not compatible -
4 Three Not compatible -
12 Three Compatible 12
9 Two Compatible 8
6 Two Not compatible -
3 Three Not compatible -
15 One Not compatible -
3 Two Not compatible -
6 Three Not compatible -
4 Two Not compatible -
14 Two Not compatible -
15 Two Not compatible -
15 Three Compatible 15
6 Three Not compatible -
11 Three Compatible 11
6 One Compatible 6
1 One Compatible 1
14 One Not compatible -
3 One Compatible 2
14 Three Compatible 14
Table 6: Status of egg and nest chosen
Figure 3: The comparison of FA, HC, and CA on Iteration
of test data
For the Iteration, statically the computed value of F test
statistics (5.5) is greater than that of critical value (1.48) for
Rows and in Columns the computed value of F test statis-
tics (71.7) is greater than that of critical value (3) at 5%
level of significance, H0 is not accepted and hence, the it-
eration differs significantly. Table 11 shows the result.
Figure 4 gives the graphical representation of running
time complexity of JAVA programs. CA run time complex-
ity is based on Big O notation. CA outperforms as com-
pared to rest of the algorithms. The diamond line repre-
sents time taken by FA, the square line shows time taken
by HC and triangle line plotting represents CA.
Figure 4: The comparison of FA, HC, and CA on complex-
ity of test data
Statically analyzing the cyclomatic Complexity, we find
372 Informatica 41 (2017) 363–377 M. Khari et al.
Main Iteration Optimized test data Iterations Complexity
Path number Test data HC CA FA HC CA FA HC CA FA
1 6 6 1 2 204 114 47 3.25 1.88 2.38
2 4 4 1 1 76 27 3.25 2.38
3 6 6 1 2 211 47 3.25 2.38
Table 7: Final compared result using the three algorithms on a case study
that, the computed value of F test statistics (3.8) is greater
than that of critical value (1.48) for Rows and in Columns
the computed value of F test statistics (113) is greater than
that of critical value (3) at 5% level of significance, H0 is
not accepted and hence, the iteration differs significantly.
Table 12 shows the result.
In Table 13, descriptive statistics for the result of test
data of FA, HC, and CA algorithm are been mentioned.
The given descriptive statistics include: mean, standard er-
ror, median, mode, and standard deviation. At 95 % confi-
dence, authors can say that the result of optimized test data
of CA which shows much-improved performance as com-
pared to FA and HC.
The descriptive statistics for the execution time of test
data using FA, HA, and CA algorithm is mentioned in Ta-
ble 14. At 95% confidence, authors can say that the dura-
tion time of CA is shown much-improved performance as
compared to FA and HA.
The descriptive statistics for the Iteration of test data us-
ing FA, HC, and CA algorithm is mentioned in Table 15. At
95% confidence, authors can say that the execution time of
CA is testifying much-improved performance as compared
to FA and HC.
The descriptive statistics for the Cyclomatic complexity
of test data using FA, HC, and CA algorithm is mentioned
in Table 16. At 95% confidence, authors can say that the
execution time of CA is testifying much-improved perfor-
mance as compared to FA and HC.
The performance of the algorithms has been analyzed
statistically. ANOVA two-factor without replication has
been implemented to check the significant difference
among the various algorithms. Descriptive statistics also
be performed and the confidence interval is built on the ba-
sis of average and standard error. Statistically, it was found
that the performance of CA is better than that of FA and
HA algorithms.
7 Threats
7.1 Internal threats
– The initial test data were generated by making use of
the worst case analysis.
– The optimized test data were produced although the
approach that we have adopted is purely randomized.
– This algorithm might take some amount of time to
make the buff list empty and hence, the process needs
to be continued till it becomes empty.
– It may happen that a test data which does not belong
to any path will be there in the buff list which would
mean that the desired results might not be obtained.
External threats
– This algorithm is only applied to Java based programs
and has also not been applied to any other application.
Construction threats
– The above program needs Java development kit and
Java runtime environment.
– The calculated computation time might change every
time as it could depend on the operating system and
other system features.
8 Conclusion and future scope
CA has been proved to be effective in producing optimized
test data. A large number of test data for all possible paths
in a program under test were reduced without any redun-
dancy. The reduced number of test data were effective in
minimizing the overall cost, effort, and time of the test-
ing phase in software development life cycle. In future,
attempts can be made to extend the work by applying the
algorithm to large datasets and for real-time applications
as well. This potentially powerful optimization technique
can be extended to study multi-objective optimization ap-
plications with various constraints, even to NP-hard prob-
lems. The studies can also focus on hybridization of CA
with other metaheuristic algorithms.
References
[1] Sommerville, I. (2004) Software engineering, Inter-
national computer science series. ed: Addison Wes-
ley.
[2] Mathur, A. P. (2008) Foundations of Software Testing,
Pearson Education India.
[3] Mall, R. (2014) Fundamentals of software engineer-
ing, PHI Learning Pvt. Ltd.
An Effective Meta-Heuristic Cuckoo Search Algorithm for. . . Informatica 41 (2017) 363–377 373
No. Total Data Optimization Duration(in ms) Iterations Complexity
- - FA HC CA FA HC CA FA HC CA FA HC CA
P1 27 6 83 4 266 395 3 183 1715 214 11.2 14.48 2.11
P2 12 4 11 5 33 93 1 106 264 249 5.42 6.75 1.5
P3 12 2 12 4 58 4 2 68 396 48 5.34 6.24 1.5
P4 50 18 50 24 197 443 3 962 3000 280 4.4 6.5 2
P5 75 29 75 8 247 324 10 764 3000 436 12.5 16.25 2.25
P6 27 9 89 5 109 137 6 246 1715 179 9.32 15 2.11
P7 18 10 18 4 29 43 4 109 618 85 6.57 10.74 1.89
P8 200 190 200 98 290 1729 8 19 3000 91 5.14 6.54 1.5
P9 15 10 16 8 10 7 1 119 400 70 5.34 6.5 1.5
P10 500 492 500 241 677 772 20 2420 3000 1840 5.37 7 1.62
P11 1000 977 1000 491 975 1446 620 1600 2500 53 5.5 7.22 1.86
P12 1000 994 1000 509 1056 1695 61 2300 3000 1001 6 6.54 2
P13 1000 994 1000 489 659 1694 47 2200 3000 1577 6.5 6.58 1.62
P14 100 86 100 49 406 202 4 2341 3000 812 5.2 7.82 1.62
P15 70 69 70 25 234 246 7 2141 3000 419 7.25 12 1.75
P16 8 2 8 2 117 8 1 38 172 24 4.58 7.5 1.5
P17 16 8 38 4 28 69 2 80 305 78 8.56 15 2.11
P18 27 11 157 7 34 267 3 434 3000 205 6.42 9.82 1.89
P19 27 6 126 11 48 503 2 137 3000 142 4.28 7.25 1.67
P20 50 35 50 22 384 304 4 1800 3000 267 4.76 6.76 1.62
P21 16 5 16 3 82 87 6 121 491 114 7.14 9.75 1.88
P22 16 16 52 4 51 299 2 104 264 83 9.16 14.48 2
P23 27 1 143 9 153 298 3 361 3000 148 6.87 10.86 1.89
P24 10 3 10 1 2 100 5 123 738 538 2.12 3.25 1.88
P25 13 7 13 3 3 5 5 124 1527 578 2.12 3.25 1.88
P26 39 16 13 11 62 173 5 377 3000 260 6.36 9.75 1.88
P27 6 6 6 2 2 3 1 39 210 29 2.14 3.62 1.56
P28 6 6 6 2 2 2 1 12 210 31 2.14 3.62 1.56
P29 450 442 450 253 681 599 17 2400 3000 2300 5.29 6.63 2
P30 600 593 600 297 1130 776 23 2200 3000 1300 4.76 6.76 1.88
P31 27 6 126 10 22 279 2 60 3000 125 4.28 7.24 1.67
P32 29 18 29 3 70 93 4 164 650 187 12.01 17.36 2.22
P33 29 18 29 3 70 93 4 164 650 187 12.01 17.38 2.22
P34 29 18 29 3 70 93 4 164 650 187 12.01 17.37 2.22
P35 60 22 60 24 128 158 7 682 3000 124 10.68 13.52 1.62
P36 14 12 6 6 117 24 2 145 1762 66 2.43 4 1.62
P37 10 74 100 18 264 255 6 1874 3000 786 9.26 13 2.25
P38 100 100 100 50 442 344 4 2975 3000 645 4.86 7.5 1.75
P39 118 77 120 13 489 602 49 1684 3000 1325 17.01 28 2.78
P40 7 4 7 2 49 11 7 36 162 125 4.86 8 1.62
P41 15 5 14 6 22 7 2 88 330 137 6.75 9.36 1.75
P42 8 3 16 4 3 3 2 82 464 137 2.33 3.62 1.56
P43 60 22 60 24 128 158 7 682 3000 124 10.68 13.52 1.62
P44 23 23 23 5 113 73 3 161 1126 141 8.56 14.48 2
P45 30 30 30 14 131 77 7 540 3000 130 5 7.04 1.62
P46 24 13 24 7 20 89 87 385 3000 1712 2.43 3.38 1.75
P47 24 13 24 7 20 89 87 385 3000 1712 2.43 3.38 1.75
P48 500 496 500 51 1007 193 60 2246 3000 1518 2.43 3.38 2
P49 20 4 20 7 8 51 7 154 1419 92 5 6.5 1.56
P50 14 10 14 4 65 24 6 94 624 61 4.86 8 1.67
Table 8: Comparison results for JAVA programs
374 Informatica 41 (2017) 363–377 M. Khari et al.
Source of Variation SS df MS F P-value F crit
Rows 6715135 49 137043.6 21 7.31 1.48
Columns 204886.9 2 102443.4 15.7 1.15 3.08
Error 636560.4 98 6495.515 - - -
Total 7556583 149 - - - -
Table 9: ANOVA results based on Optimization
Source of Variation SS df MS F P-value F crit
Rows 8844974 49 180509.7 3.14 7.18 1.48
Columns 2132012 2 1066006 18.56 1.46 3.08
Error 5627039 98 57418.77 - - -
Total 16604025 149 - - - -
Table 10: ANOVA Result Of Execution Time/ Duration
Source of Variation SS df MS F P-value F crit
Rows 88715082 49 1810512 4.52 1.08 1.48
Columns 57403410 2 28701705 71.74 6.43 3.08
Error 39204947 98 400050.5 - - -
Total 185000000 149 - - - -
Table 11: ANOVA Result Of Iteration
Source of Variation SS df MS F P-value F crit
Rows 1160.181 49 23.67 3.88 5.7 1.48
Columns 1386.396 2 693.19 113.63 2.96 3.08
Error 597.8448 98 6.1 - - -
Total 3144.421 149 - - - -
Table 12: ANOVA Result Of Cyclomatic Complexity
An Effective Meta-Heuristic Cuckoo Search Algorithm for. . . Informatica 41 (2017) 363–377 375
- FA HC CA
Mean 120.3 144.86 57.12
Standard Er-
ror
36.80852 36.27629 18.16194
Median 14.5 44 7
Mode 6 16 4
Standard De-
viation
260.2755 256.5121 128.4243
Minimum 1 6 1
Maximum 994 1000 509
Sum 6015 7243 2856
Count 50 50 50
Table 13: Descriptive Statistics For The Result Of Opti-
mized Test data
- FA HC CA
Mean 225.26 308.78 24.68
Standard Er-
ror
42.72953 62.64793 12.50474
Median 95.5 147.5 4.5
Mode 2 93 2
Standard De-
viation
302.1434 442.9877 88.42186
Minimum 2 2 1
Maximum 1130 1729 620
Sum 11263 15439 1234
Count 50 50 50
Table 14: Descriptive Statistics For The Duration Of Test
data
- FA HC CA
Mean 733.86 1887.24 459.44
Standard Er-
ror
127.2218 170.9619 82.45547
Median 173.5 2750 183
Mode 164 3000 187
Standard De-
viation
899.5941 1208.883 583.0482
Minimum 12 162 24
Maximum 2975 3000 2300
Sum 36693 94362 22972
Count 50 50 50
Table 15: Descriptive Statistics For The Iteration
- FA HC CA
Mean 6.3526 9.2098 1.8256
Standard Er-
ror
0.466785 0.705869 0.037717
Median 5.355 7.375 1.75
Mode 2.43 14.48 1.62
Standard De-
viation
3.300668 4.991246 0.266697
Minimum 2.12 3.25 1.5
Maximum 17.01 28 2.78
Sum 317.63 460.49 91.28
Count 50 50 50
Table 16: Descriptive Statistics For Complexity
[4] Ramamoorthy, C. V., and Ho, S. B. F. (1975) Test-
ing large software with automated software evalua-
tion systems, In ACM SIGPLAN Notices, ACM, pp.
382-394.
[5] Miller, W., and Spooner, D. L. (1976) Automatic gen-
eration of floating-point test data, IEEE Transactions
on Software Engineering, IEEE, pp. 223.
[6] Clarke, L. A. (1976) A system to generate test data
and symbolically execute programs, IEEE Transac-
tions on Software Engineering, IEEE, pp. 215-222.
[7] Prather, R. E., and Myers Jr, J. P. (1987) The path
prefix software testing strategy, IEEE Transactions on
Software Engineering, IEEE, pp. 761-766.
[8] Korel, B. (1990) The path prefix software testing
strategy, IEEE Transactions on Software Engineer-
ing, IEEE, pp. 870-879.
[9] Ferguson, R., and Korel, B. (1996) The chaining ap-
proach for software test data generation, IACM Trans-
actions on Software Engineering and Methodology
(TOSEM), ACM, pp. 63-86.
[10] Korel, B., and Al-Yami, A. M. (1996) Assertion-
oriented automated test data generation, 18th inter-
national conference on Software engineering, IEEE,
pp. 71-80.
[11] Fröhlich, P., and Link, J. (2000) Automated test case
generation from dynamic models, In ECOOP 2000
Object-Oriented Programming, Springer Berlin Hei-
delberg, pp. 472-491.
[12] Kim, S., Clark, J. A., and McDermid, J. A. (2000)
Class mutation: Mutation testing for object-oriented
programs, Proc. Net. ObjectDays, Citeseer, pp. 9-12.
[13] Xiao, M., El-Attar, M., Reformat, M., and Miller,
J. (2007) Empirical evaluation of optimization algo-
rithms when used in goal-oriented automated test data
generation techniques, Empirical Software Engineer-
ing, Springer, pp. 183–239.
376 Informatica 41 (2017) 363–377 M. Khari et al.
[14] Harman, M. (2007) The current state and future of
search based software engineering, Future of Soft-
ware Engineering, IEEE, pp. 342-357.
[15] Ernst, M. D., Cockrell, J., Griswold, W. G., and
Notkin, D. (2001) The current state and future of
search based software engineering, IEEE Transac-
tions on Software Engineering, IEEE, pp. 99-123.
[16] Papadakis, M., and Malevris, N. (2010) Automatic
mutation test case generation via dynamic symbolic
execution, IEEE 21st international symposium on
Software reliability engineering (ISSRE), IEEE, pp.
121-130.
[17] Fraser, G., and Arcuri, A. (2011) Evolutionary gener-
ation of whole test suites, 11th International Confer-
ence on Quality Software (QSIC), IEEE, pp. 31-40.
[18] Sofokleous, A. A., and Andreou, A. S. (2008) Auto-
matic, evolutionary test data generation for dynamic
software testing, Journal of Systems and Software,
IEEE, pp. 1883-1898.
[19] Fraser, G., and Arcuri, A. (2011) Evosuite: automatic
test suite generation for object-oriented software, In
Proceedings of the 19th ACM SIGSOFT symposium
and the 13th European conference on Foundations of
software engineering, ACM, pp. 416-419.
[20] McMinn, P. (2004) Search-based software test data
generation: A survey, Software Testing Verification
and Reliability, Wiley, pp. 105-156.
[21] Wappler, S., and Wegener, J. (2006) Evolutionary unit
testing of object-oriented software using strongly-
typed genetic programming, In Proceedings of the 8th
annual conference on Genetic and evolutionary com-
putation, ACM, pp. 1925-1932.
[22] Ribeiro, J. C. B. (2008) Search-based test case gener-
ation for object-oriented java software using strongly-
typed genetic programming, In Proceedings of the
10th annual conference companion on Genetic and
evolutionary computation, ACM, pp. 1819-1822.
[23] Tonella, P. (2005) Reverse engineering of object ori-
ented code, In Proceedings of the 27th international
conference on Software engineering, ACM, pp. 724-
725.
[24] Lakhotia, K., Harman, M., and McMinn, P. (2007)
A multi-objective approach to search-based test data
generation, In Proceedings of the 9th annual con-
ference on Genetic and evolutionary computation,
ACM, pp. 1098-1105.
[25] Yang, X. S., and Deb, S. (2010) Engineering op-
timisation by cuckoo search, International Journal
of Mathematical Modelling and Numerical Optimisa-
tion, Inderscience, pp. 330-343.
[26] Rajabioun, R. (2011) Cuckoo optimization algorithm,
Applied soft computing, Elsevier, pp. 5508-5518.
[27] Walton, S., Hassan, O., Morgan, K., and Brown,
M. R. (2011) Modified cuckoo search: a new gradi-
ent free optimisation algorithm, Chaos, Solitons and
Fractals, Elsevier, pp. 710-718.
[28] Yildiz, A. R. (2013) Cuckoo search algorithm for the
selection of optimal machining parameters in milling
operations, The International Journal of Advanced
Manufacturing Technology, Springer, pp. 55-61.
[29] Valian, E., Mohanna, S., and Tavakoli, S. (2011) Im-
proved cuckoo search algorithm for feedforward neu-
ral network training, International Journal of Artifi-
cial Intelligence and Applications, Elsevier, pp. 36-
43.
[30] Civicioglu, P., and Besdok, E. (2013) A concep-
tual comparison of the Cuckoo-search, particle swarm
optimization, differential evolution and artificial bee
colony algorithms, Artificial Intelligence Review,
Springer, pp. 315-346.
[31] Gandomi, A. H., Talatahari, S., Yang, X. S., and Deb,
S. (2013) Design optimization of truss structures us-
ing cuckoo search algorithm, The Structural Design
of Tall and Special Buildings, Wiley, pp. 1330-1349.
[32] Bulatovic, R. R., Ðordevic, S. R., and Ðordevic, V. S.
(2013) Cuckoo search algorithm: a metaheuristic ap-
proach to solving the problem of optimum synthesis
of a six-bar double dwell linkagem, Mechanism and
Machine Theory, Elsevier, pp. 1-13.
[33] Sur, C., and Shukla, A. (2014) Discrete Cuckoo
Search Optimization Algorithm for Combinatorial
Optimization of Vehicle Route in Graph Based Road
Network, In Proceedings of the Third International
Conference on Soft Computing for Problem Solving,
Springer, pp. 307-320.
[34] Swain, K. B., Solanki, S. S., and Mahakula, A. K.
(2014) Bio inspired cuckoo search algorithm based
neural network and its application to noise cancella-
tion, International Conference on Signal Processing
and Integrated Networks, IEEE, pp. 632-635.
[35] Prakash, M., Saranya, R., Jothi, K. R., and Vignesh-
waran, A. (2012) An optimal job scheduling in grid
using cuckoo algorithm, International Journal Com-
puter Science Telecomm, pp. 65-69.
[36] Wang, G. G., Gandomi, A. H., Zhao, X., and Chu,
H. C. E. (2016) Hybridizing harmony search algo-
rithm with cuckoo search for global numerical opti-
mization, Soft Computing, Springer, pp. 273-285.
[37] Bhandari, A. K., Singh, V. K., Kumar, A., and Singh,
G. K. (2014) Cuckoo search algorithm and wind
An Effective Meta-Heuristic Cuckoo Search Algorithm for. . . Informatica 41 (2017) 363–377 377
driven optimization based study of satellite image
segmentation for multilevel thresholding using Ka-
pur’s entropy, Expert Systems with Applications, El-
sevier, pp. 3538-3560.
[38] Mala, D. J., and Mohan, V. (2010) Quality improve-
ment and optimization of test cases: a hybrid genetic
algorithm based approach, ACM SIGSOFT Software
Engineering Notes, ACM, pp. 1-14.
[39] Lotem, A., Nakamura, H., and Zahavi, A. (1992) Re-
jection of cuckoo eggs in relation to host age: a pos-
sible evolutionary equilibrium, Behavioral Ecology,
ISBE, pp. 128-132.
[40] Yildiz, A. R. (2009) An effective hybrid immune-
hill climbing optimization approach for solving de-
sign and manufacturing optimization problems in in-
dustry, Journal of Materials Processing Technology,
Elsevier, pp. 2773-2780.
[41] Devadas, S., Ma, H. K. T., Newton, A. R., and
Sangiovanni-Vincentelli, A. (1988) Synthesis and op-
timization procedures for fully and easily testable se-
quential machines, In International Conference on
New Frontiers in Testing, IEEE, pp. 621-630.
[42] Chun, J. S., Jung, H. K., and Hahn, S. Y. (1998) A
study on comparison of optimization performances
between immune algorithm and other heuristic algo-
rithms, IEEE Transactions on Magnetics, IEEE, pp.
2972-2975.
[43] Ahmed, B. S., Abdulsamad, T. S., and Potrus, M. Y.
(2015) Achievement of minimized combinatorial test
suite for configuration-aware software functional test-
ing using the Cuckoo search algorithm, Information
and Software Technology, Elsevier, pp. 13-29.
[44] Gandomi, A. H., Yang, X. S., and Alavi, A. H. (2013)
Cuckoo search algorithm: a metaheuristic approach
to solve structural optimization problems, Engineer-
ing with computers, Springer, pp. 17-35.
[45] Elazim, S. A., and Ali, E. S. (2016) Optimal Power
System Stabilizers design via Cuckoo Search algo-
rithm, International Journal of Electrical Power and
Energy Systems, Elsevier, pp. 99-107.
[46] Papadakis, M., and Malevris, N. (2012) Mutation
based test case generation via a path selection strat-
egy, Information and Software Technology, 54(9), pp.
915-932.
[47] Papadakis, M., and Malevris, N. (2011) Automati-
cally performing weak mutation with the aid of sym-
bolic execution, concolic testing and search-based
testing, Software Quality Journal, 19(4), pp. 691-723.
[48] Papadakis, M., and Malevris, N. (2013) Search-
ing and generating test inputs for mutation testing,
SpringerPlus, 2(1), pp. 1.
378 Informatica 41 (2017) 363–377 M. Khari et al.
Informatica 41 (2017) 379–380 379
Classification of Vegetation in Aerial LiDAR Data
Denis Horvat
Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia
E-mail: denis.horvat@um.si, Web: https://gemma.feri.um.si/
Thesis summary
Keywords: LiDAR, classification, vegetation, mathematical morphology, surface fitting
Received: April 12, 2017
This contribution summarises a doctoral dissertation which proposes an algorithm for the classification of
vegetation points in aerial LiDAR data. The algorithm characterizes vegetated areas based on statistically
large dispersion in elevations of points, and the context in which the points are located. The algorithm is
able to classify vegetation in both rural and urban areas with an average F1 score of 97.9% and 91.0%,
respectively. The point-clouds can contain different types of vegetation and various degrees of canopy
densities.
Povzetek: V predlaganem prispevku povzamemo doktorsko disertacijo, ki predlaga algoritem za klasi-
fikacijo točk vegetacije iz podatkov LiDAR. Algoritem ovrednoti območja vegetacije na podlagi statistično
visoke razpršenosti višin točk v kombinacji s kontekstom v katerem se točke nahajajo. Algoritem klasifi-
cira točke vegetacije v urbanih in neurbanih področjih s 97% in 91% povprečnim rezultatom F1. Oblaki
točk lahko vsebujejo različne tipe vegetetacije z različno gostoto olistanosti.
1 Introduction
The potential of the data obtained by the aerial LiDAR
(Light Detection and Ranging) systems has been utilised
increasingly by a variety of scientific and industrial appli-
cations. Data acquisition using LiDAR produces a point-
cloud, in which the individual points are calculated based
on the time delay between the emitted and detected laser
beam. While the obtained point-clouds from a plane-
mounted LiDAR can represent the underlying Earth’s sur-
face accurately, the entity to which a given point belongs
(e.g. ground, building, vegetation) is not known. This con-
textual knowledge is crucial for a variety of applications,
such as environmental simulations, urban planning, or the
generation of a canopy height model. Thus, preprocess-
ing, using a classification algorithm is usually applied, in
which each point is correlated with one of the predeter-
mined classes.
This paper is a summary of a PhD thesis [1] (and the
corresponding paper [4]), which proposed an algorithm for
classification of vegetation points within LiDAR data. Veg-
etation can be particularly challenging to identify, as the
classification model should be universal enough to cover
the various sizes, shapes and canopy densities of differ-
ent vegetation but, at the same time, still differentiate it
from other surface objects (e.g. houses, cars, fences). The
following section outlines the proposed classification algo-
rithm, while section 3 evaluates it. Section 4 concludes the
paper.
2 Classification algorithm
Clusters of points that represent vegetation are, in most
cases, defined by statistically large dispersions in eleva-
tion. This is caused by the vegetation’s non-linear shape
and porosity, as the laser beam can usually penetrate the
canopy and capture many points within, or even under, the
vegetation. The mentioned properties can be character-
ized efficiently by modifying the LoFS (Local Fitting of
Surfaces) method [2]. Namely, by locally fitting planes on
the LiDAR-derived surface and evaluating the fitting error
using all points in the fitted area, a distinction can be made
between vegetation and most of other man-made objects.
Larger fitting errors are expected in the former case, while
the latter ones usually produce errors that are identifiably
smaller.
The remaining non-vegetation that does not conform to
this definition (i.e. also produces a larger fitting error) is
handled using contextual analysis which defines: 1) At-
tached objects 2) Overgrowing vegetation and 3) Small ob-
jects. Attached objects represents a transition between ar-
eas (e.g. a wall, balcony or chimney). Such objects can
be identified (and subsequently removed) using spatially-
variant morphological dilation [3] where all non-ground
areas with small fitting errors are dilated. The extent of
the dilation is controlled locally, using a structuring ele-
ment with the radius equal to the distance from the near-
est ground. Spatially-variant dilation is used similarly on
areas with high fitting errors to identify overgrown vegeta-
tion. However, the radius, in this case, is dependent on the
height difference between a given area and the nearest non-
380 Informatica 41 (2017) 379–380 D. Horvat
ground area with a small fitting error. Lastly, small objects
are removed using connected operators. The final result
that includes the described fitting error evaluation and con-
textual analysis is then mapped onto the individual LiDAR
points to get the classification.
3 Results
The algorithm was tested on multiple rural and urban
datasets which contained different types of vegetation and
degrees of canopy densities. The results were evaluated by
counting the false positives, false negatives, true positives
and true negatives which served as an input for the calcula-
tion of the well-established F1 score. An average F1 score
of 97.9% was achieved for rural and 91.0% for urban en-
vironments which, because of the more complex geometry,
tend to be more difficult to classify. The use of contextual
analysis improves the results in urban areas significantly.
Namely, by removing the attached object, small object and
handling overgrown vegetation, the F1 score is improved,
on average, by 8.8%, 1.1% and 1.0%, respectively.
4 Conclusion
The PhD thesis presented an algorithm for the classification
of LiDAR data, which classifies vegetated areas with high
accuracy in characteristically different point-clouds (rural
environment, urban environment, different types of vege-
tation, various leaf-on conditions). Additionally, the al-
gorithm can be used in most classification scenarios that
contain aerial point-clouds, as it relies only on geometrical
features of the point-cloud and the subsequent contextual
analysis.
Acknowledgement
The authors acknowledge the financial support from the
Slovenian Research Agency (research core funding No.
P2-0041 and project J2-6764)
References
[1] D. Horvat (2017) Algorithm for classification of veg-
etation in LiDAR data, Doctoral dissertation, Fac-
ulty of Electrical Engineering and Computer Science,
University of Maribor (in Slovene).
[2] D. Mongus, N. Lukač and B. Žalik (2014) Ground
and building extraction from LiDAR data based on
differential morphological profiles and locally fitted
surfaces, ISPRS Journal of Photogrammetry and Re-
mote Sensing, Vol. 93, pp. 145–156.
[3] N. Bouaynaya and Mohammed Charif-Chefchaouni
(2008) Theoretical Foundations of Spatially-Variant
Mathematical Morphology Part I: Binary Images,
IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 30, No. 5, pp. 145–156.
[4] D. Horvat, B. Žalik and D. Mongus (2016) Context-
dependent detection of non-linearly distributed points
for vegetation classification in airborne LiDAR, IS-
PRS Journal of Photogrammetry and Remote Sens-
ing, Vol. 116, pp. 1–14.
Informatica 41 (2017) 381
JOŽEF STEFAN INSTITUTE
Jožef Stefan (1835-1893) was one of the most prominent
physicists of the 19th century. Born to Slovene parents,
he obtained his Ph.D. at Vienna University, where he was
later Director of the Physics Institute, Vice-President of the
Vienna Academy of Sciences and a member of several sci-
entific institutions in Europe. Stefan explored many areas
in hydrodynamics, optics, acoustics, electricity, magnetism
and the kinetic theory of gases. Among other things, he
originated the law that the total radiation from a black
body is proportional to the 4th power of its absolute tem-
perature, known as the Stefan–Boltzmann law.
The Jožef Stefan Institute (JSI) is the leading indepen-
dent scientific research institution in Slovenia, covering a
broad spectrum of fundamental and applied research in the
fields of physics, chemistry and biochemistry, electronics
and information science, nuclear science technology, en-
ergy research and environmental science.
The Jožef Stefan Institute (JSI) is a research organisation
for pure and applied research in the natural sciences and
technology. Both are closely interconnected in research de-
partments composed of different task teams. Emphasis in
basic research is given to the development and education of
young scientists, while applied research and development
serve for the transfer of advanced knowledge, contributing
to the development of the national economy and society in
general.
At present the Institute, with a total of about 900 staff,
has 700 researchers, about 250 of whom are postgraduates,
around 500 of whom have doctorates (Ph.D.), and around
200 of whom have permanent professorships or temporary
teaching assignments at the Universities.
In view of its activities and status, the JSI plays the role
of a national institute, complementing the role of the uni-
versities and bridging the gap between basic science and
applications.
Research at the JSI includes the following major fields:
physics; chemistry; electronics, informatics and computer
sciences; biochemistry; ecology; reactor technology; ap-
plied mathematics. Most of the activities are more or
less closely connected to information sciences, in particu-
lar computer sciences, artificial intelligence, language and
speech technologies, computer-aided design, computer ar-
chitectures, biocybernetics and robotics, computer automa-
tion and control, professional electronics, digital communi-
cations and networks, and applied mathematics.
The Institute is located in Ljubljana, the capital of the in-
dependent state of Slovenia (or S♥nia). The capital today
is considered a crossroad between East, West and Mediter-
ranean Europe, offering excellent productive capabilities
and solid business opportunities, with strong international
connections. Ljubljana is connected to important centers
such as Prague, Budapest, Vienna, Zagreb, Milan, Rome,
Monaco, Nice, Bern and Munich, all within a radius of 600
km.
From the Jožef Stefan Institute, the Technology park
“Ljubljana” has been proposed as part of the national strat-
egy for technological development to foster synergies be-
tween research and industry, to promote joint ventures be-
tween university bodies, research institutes and innovative
industry, to act as an incubator for high-tech initiatives and
to accelerate the development cycle of innovative products.
Part of the Institute was reorganized into several high-
tech units supported by and connected within the Technol-
ogy park at the Jožef Stefan Institute, established as the
beginning of a regional Technology park "Ljubljana". The
project was developed at a particularly historical moment,
characterized by the process of state reorganisation, privati-
sation and private initiative. The national Technology Park
is a shareholding company hosting an independent venture-
capital institution.
The promoters and operational entities of the project are
the Republic of Slovenia, Ministry of Higher Education,
Science and Technology and the Jožef Stefan Institute. The
framework of the operation also includes the University of
Ljubljana, the National Institute of Chemistry, the Institute
for Electronics and Vacuum Technology and the Institute
for Materials and Construction Research among others. In
addition, the project is supported by the Ministry of the
Economy, the National Chamber of Economy and the City
of Ljubljana.
Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Tel.:+386 1 4773 900, Fax.:+386 1 251 93 85
WWW: http://www.ijs.si
E-mail: matjaz.gams@ijs.si
Public relations: Polona Strnad
Informatica 41 (2017)
INFORMATICA
AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS
INVITATION, COOPERATION
Submissions and Refereeing
Please register as an author and submit a manuscript at:
http://www.informatica.si. At least two referees outside the au-
thor’s country will examine it, and they are invited to make as
many remarks as possible from typing errors to global philosoph-
ical disagreements. The chosen editor will send the author the
obtained reviews. If the paper is accepted, the editor will also
send an email to the managing editor. The executive board will
inform the author that the paper has been accepted, and the author
will send the paper to the managing editor. The paper will be pub-
lished within one year of receipt of email with the text in Infor-
matica MS Word format or Informatica LATEX format and figures
in .eps format. Style and examples of papers can be obtained from
http://www.informatica.si. Opinions, news, calls for conferences,
calls for papers, etc. should be sent directly to the managing edi-
tor.
SUBSCRIPTION
Please, complete the order form and send it to Dr. Drago Torkar,
Informatica, Institut Jožef Stefan, Jamova 39, 1000 Ljubljana,
Slovenia. E-mail: drago.torkar@ijs.si
Since 1977, Informatica has been a major Slovenian scientific
journal of computing and informatics, including telecommuni-
cations, automation and other related areas. In its 16th year
(more than twentythree years ago) it became truly international,
although it still remains connected to Central Europe. The ba-
sic aim of Informatica is to impose intellectual values (science,
engineering) in a distributed organisation.
Informatica is a journal primarily covering intelligent systems in
the European computer science, informatics and cognitive com-
munity; scientific and educational as well as technical, commer-
cial and industrial. Its basic aim is to enhance communications
between different European structures on the basis of equal rights
and international refereeing. It publishes scientific papers ac-
cepted by at least two referees outside the author’s country. In ad-
dition, it contains information about conferences, opinions, criti-
cal examinations of existing publications and news. Finally, major
practical achievements and innovations in the computer and infor-
mation industry are presented through commercial publications as
well as through independent evaluations.
Editing and refereeing are distributed. Each editor can conduct
the refereeing process by appointing two new referees or referees
from the Board of Referees or Editorial Board. Referees should
not be from the author’s country. If new referees are appointed,
their names will appear in the Refereeing Board.
Informatica web edition is free of charge and accessible at
http://www.informatica.si.
Informatica print edition is free of charge for major scientific, ed-
ucational and governmental institutions. Others should subscribe.
Informatica WWW:
http://www.informatica.si/
Referees from 2008 on:
A. Abraham, S. Abraham, R. Accornero, A. Adhikari, R. Ahmad, G. Alvarez, N. Anciaux, R. Arora, I. Awan, J.
Azimi, C. Badica, Z. Balogh, S. Banerjee, G. Barbier, A. Baruzzo, B. Batagelj, T. Beaubouef, N. Beaulieu, M. ter
Beek, P. Bellavista, K. Bilal, S. Bishop, J. Bodlaj, M. Bohanec, D. Bolme, Z. Bonikowski, B. Bošković, M. Botta,
P. Brazdil, J. Brest, J. Brichau, A. Brodnik, D. Brown, I. Bruha, M. Bruynooghe, W. Buntine, D.D. Burdescu, J.
Buys, X. Cai, Y. Cai, J.C. Cano, T. Cao, J.-V. Capella-Hernández, N. Carver, M. Cavazza, R. Ceylan, A. Chebotko,
I. Chekalov, J. Chen, L.-M. Cheng, G. Chiola, Y.-C. Chiou, I. Chorbev, S.R. Choudhary, S.S.M. Chow, K.R.
Chowdhury, V. Christlein, W. Chu, L. Chung, M. Ciglarič, J.-N. Colin, V. Cortellessa, J. Cui, P. Cui, Z. Cui, D.
Cutting, A. Cuzzocrea, V. Cvjetkovic, J. Cypryjanski, L. Čehovin, D. Čerepnalkoski, I. Čosić, G. Daniele, G.
Danoy, M. Dash, S. Datt, A. Datta, M.-Y. Day, F. Debili, C.J. Debono, J. Dedič, P. Degano, A. Dekdouk, H.
Demirel, B. Demoen, S. Dendamrongvit, T. Deng, A. Derezinska, J. Dezert, G. Dias, I. Dimitrovski, S. Dobrišek,
Q. Dou, J. Doumen, E. Dovgan, B. Dragovich, D. Drajic, O. Drbohlav, M. Drole, J. Dujmović, O. Ebers, J. Eder,
S. Elaluf-Calderwood, E. Engström, U. riza Erturk, A. Farago, C. Fei, L. Feng, Y.X. Feng, B. Filipič, I. Fister, I.
Fister Jr., D. Fišer, A. Flores, V.A. Fomichov, S. Forli, A. Freitas, J. Fridrich, S. Friedman, C. Fu, X. Fu, T.
Fujimoto, G. Fung, S. Gabrielli, D. Galindo, A. Gambarara, M. Gams, M. Ganzha, J. Garbajosa, R. Gennari, G.
Georgeson, N. Gligorić, S. Goel, G.H. Gonnet, D.S. Goodsell, S. Gordillo, J. Gore, M. Grčar, M. Grgurović, D.
Grosse, Z.-H. Guan, D. Gubiani, M. Guid, C. Guo, B. Gupta, M. Gusev, M. Hahsler, Z. Haiping, A. Hameed, C.
Hamzaçebi, Q.-L. Han, H. Hanping, T. Härder, J.N. Hatzopoulos, S. Hazelhurst, K. Hempstalk, J.M.G. Hidalgo, J.
Hodgson, M. Holbl, M.P. Hong, G. Howells, M. Hu, J. Hyvärinen, D. Ienco, B. Ionescu, R. Irfan, N. Jaisankar, D.
Jakobović, K. Jassem, I. Jawhar, Y. Jia, T. Jin, I. Jureta, Ð. Juričić, S. K, S. Kalajdziski, Y. Kalantidis, B. Kaluža,
D. Kanellopoulos, R. Kapoor, D. Karapetyan, A. Kassler, D.S. Katz, A. Kaveh, S.U. Khan, M. Khattak, V.
Khomenko, E.S. Khorasani, I. Kitanovski, D. Kocev, J. Kocijan, J. Kollár, A. Kontostathis, P. Korošec, A.
Koschmider, D. Košir, J. Kovač, A. Krajnc, M. Krevs, J. Krogstie, P. Krsek, M. Kubat, M. Kukar, A. Kulis, A.P.S.
Kumar, H. Kwaśnicka, W.K. Lai, C.-S. Laih, K.-Y. Lam, N. Landwehr, J. Lanir, A. Lavrov, M. Layouni, G. Leban,
A. Lee, Y.-C. Lee, U. Legat, A. Leonardis, G. Li, G.-Z. Li, J. Li, X. Li, X. Li, Y. Li, Y. Li, S. Lian, L. Liao, C. Lim,
J.-C. Lin, H. Liu, J. Liu, P. Liu, X. Liu, X. Liu, F. Logist, S. Loskovska, H. Lu, Z. Lu, X. Luo, M. Luštrek, I.V.
Lyustig, S.A. Madani, M. Mahoney, S.U.R. Malik, Y. Marinakis, D. Marinčič, J. Marques-Silva, A. Martin, D.
Marwede, M. Matijašević, T. Matsui, L. McMillan, A. McPherson, A. McPherson, Z. Meng, M.C. Mihaescu, V.
Milea, N. Min-Allah, E. Minisci, V. Mišić, A.-H. Mogos, P. Mohapatra, D.D. Monica, A. Montanari, A. Moroni, J.
Mosegaard, M. Moškon, L. de M. Mourelle, H. Moustafa, M. Možina, M. Mrak, Y. Mu, J. Mula, D. Nagamalai,
M. Di Natale, A. Navarra, P. Navrat, N. Nedjah, R. Nejabati, W. Ng, Z. Ni, E.S. Nielsen, O. Nouali, F. Novak, B.
Novikov, P. Nurmi, D. Obrul, B. Oliboni, X. Pan, M. Pančur, W. Pang, G. Papa, M. Paprzycki, M. Paralič, B.-K.
Park, P. Patel, T.B. Pedersen, Z. Peng, R.G. Pensa, J. Perš, D. Petcu, B. Petelin, M. Petkovšek, D. Pevec, M.
Pičulin, R. Piltaver, E. Pirogova, V. Podpečan, M. Polo, V. Pomponiu, E. Popescu, D. Poshyvanyk, B. Potočnik,
R.J. Povinelli, S.R.M. Prasanna, K. Pripužić, G. Puppis, H. Qian, Y. Qian, L. Qiao, C. Qin, J. Que, J.-J.
Quisquater, C. Rafe, S. Rahimi, V. Rajkovič, D. Raković, J. Ramaekers, J. Ramon, R. Ravnik, Y. Reddy, W.
Reimche, H. Rezankova, D. Rispoli, B. Ristevski, B. Robič, J.A. Rodriguez-Aguilar, P. Rohatgi, W. Rossak, I.
Rožanc, J. Rupnik, S.B. Sadkhan, K. Saeed, M. Saeki, K.S.M. Sahari, C. Sakharwade, E. Sakkopoulos, P. Sala,
M.H. Samadzadeh, J.S. Sandhu, P. Scaglioso, V. Schau, W. Schempp, J. Seberry, A. Senanayake, M. Senobari,
T.C. Seong, S. Shamala, c. shi, Z. Shi, L. Shiguo, N. Shilov, Z.-E.H. Slimane, F. Smith, H. Sneed, P. Sokolowski,
T. Song, A. Soppera, A. Sorniotti, M. Stajdohar, L. Stanescu, D. Strnad, X. Sun, L. Šajn, R. Šenkeřík, M.R.
Šikonja, J. Šilc, I. Škrjanc, T. Štajner, B. Šter, V. Štruc, H. Takizawa, C. Talcott, N. Tomasev, D. Torkar, S.
Torrente, M. Trampuš, C. Tranoris, K. Trojacanec, M. Tschierschke, F. De Turck, J. Twycross, N. Tziritas, W.
Vanhoof, P. Vateekul, L.A. Vese, A. Visconti, B. Vlaovič, V. Vojisavljević, M. Vozalis, P. Vračar, V. Vranić, C.-H.
Wang, H. Wang, H. Wang, H. Wang, S. Wang, X.-F. Wang, X. Wang, Y. Wang, A. Wasilewska, S. Wenzel, V.
Wickramasinghe, J. Wong, S. Wrobel, K. Wrona, B. Wu, L. Xiang, Y. Xiang, D. Xiao, F. Xie, L. Xie, Z. Xing, H.
Yang, X. Yang, N.Y. Yen, C. Yong-Sheng, J.J. You, G. Yu, X. Zabulis, A. Zainal, A. Zamuda, M. Zand, Z. Zhang,
Z. Zhao, D. Zheng, J. Zheng, X. Zheng, Z.-H. Zhou, F. Zhuang, A. Zimmermann, M.J. Zuo, B. Zupan, M.
Zuqiang, B. Žalik, J. Žižka,
Informatica
An International Journal of Computing and Informatics
Web edition of Informatica may be accessed at: http://www.informatica.si.
Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer,
Autumn, and Winter (4 issues per year) by the Slovene Society Informatika, Litostrojska cesta 54, 1000 Ljubljana,
Slovenia.
The subscription rate for 2017 (Volume 41) is
– 60 EUR for institutions,
– 30 EUR for individuals, and
– 15 EUR for students
Claims for missing issues will be honored free of charge within six months after the publication date of the issue.
Typesetting: Borut Žnidar.
Printing: ABO grafika d.o.o., Ob železnici 16, 1000 Ljubljana.
Orders may be placed by email (drago.torkar@ijs.si), telephone (+386 1 477 3900) or fax (+386 1 251 93 85). The
payment should be made to our bank account no.: 02083-0013014662 at NLB d.d., 1520 Ljubljana, Trg republike
2, Slovenija, IBAN no.: SI56020830013014662, SWIFT Code: LJBASI2X.
Informatica is published by Slovene Society Informatika (president Niko Schlamberger) in cooperation with the
following societies (and contact persons):
Slovene Society for Pattern Recognition (Simon Dobrišek)
Slovenian Artificial Intelligence Society (Mitja Luštrek)
Cognitive Science Society (Olga Markič)
Slovenian Society of Mathematicians, Physicists and Astronomers (Marej Brešar)
Automatic Control Society of Slovenia (Nenad Muškinja)
Slovenian Association of Technical and Natural Sciences / Engineering Academy of Slovenia (Stane Pejovnik)
ACM Slovenia (Matjaž Gams)
Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific
periodical publications.
Informatica is surveyed by: ACM Digital Library, Citeseer, COBISS, Compendex, Computer & Information
Systems Abstracts, Computer Database, Computer Science Index, Current Mathematical Publications, DBLP
Computer Science Bibliography, Directory of Open Access Journals, InfoTrac OneFile, Inspec, Linguistic and
Language Behaviour Abstracts, Mathematical Reviews, MatSciNet, MatSci on SilverPlatter, Scopus, Zentralblatt
Math
Volume 41 Number 3 September 2017 ISSN 0350-5596
Bipartivity Index based Link Selection Strategy to
Determine Stable and Energy-Efficient Data
Gathering Trees for Mobile Sensor Networks
N. Meghanathan 259
Power and Limitations of Formal Methods for
Software Fabrication: Thirty Years Later
E. Serna M., A. Serna A. 275
Improved Lane Departure Warning Method Based on
Hough Transformation and Kalman Filter
M. Liang, Z. Zhou, Q. Song 283
Decision Tree Based Data Reconstruction for Privacy
Preserving Classification Rule Mining
G. Kalyani 289
Accelerating XML Query Processing on Views Y.F. Huang, Y.-H. Cho 305
Optimization, Modeling and Simulation of
Microclimate and Energy Management of the
Greenhouse by Modeling the Associated Heating and
Cooling Systems and Implemented by a Fuzzy Logic
Controller using Artificial Intelligence
D. Faouzi, N. Bibi-Triki,
B. Mohamed, A. Abéne
317
Improving Visual Vocabularies: A More
Discriminative, Representative and Compact Bag of
Visual Words
L. Chang, A. Pérez-Suárez,
J. Hernández-Palancar,
M. Arias-Estrada,
L.E. Sucar
333
An Output Instruction Based PLC Source Code
Transformation Approach For Program Logic
Simplification
A. Ghosh, S. Qin, J. Lee,
G.-N. Wang
349
An Effective Meta-Heuristic Cuckoo Search
Algorithm for Test Suite Optimization
M. Khari, P. Kumar 363
Classification of Vegetation in Aerial LiDAR Data D. Horvat 379
Informatica 41 (2017) Number 3, pp. 259–381