Informatica 41 (2017) 47–58 47 
 
Distributed Fault Tolerant Architecture for Wireless Sensor Network 
Siba Mitra and Ajanta Das 
Department of Computer Science and Engineering, Birla Institute of Technology 
Mesra, Kolkata Campus, India 
E-mail: sibamitra@bitmesra.ac.in, ajantadas@bitmesra.ac.in  
 
Keywords: fault detection, fault recovery, fault tolerance, reliability, wireless sensor network 
Received: December 25, 2016 
Smart applications use wireless sensor network for surveillance of any physical property of that area, to 
realize the vision of ambient intelligence. Since wireless sensor network is resource constrained and for 
unattended deployment scenario faults are quite trivial. Reliability and dependability of the network 
depends on its fault detection, diagnosis and recovery techniques. Detecting faults in wireless sensor 
network is challenging and recovery of faulty nodes is very crucial task. In this research article, a 
distributed fault tolerant architecture is proposed. This paper also proposes fault recovery algorithms. 
Recovery actions are initiated based on fault diagnosis notification. The novelty of this paper is to 
perform recovery actions using data checkpoints and state checkpoints of the node, in a distributed 
manner. Data checkpoint helps to recover the old data and the state checkpoint tells the previous trust 
degree of the node. Moreover, the result section explains, that after replacement of a faulty node, the 
topology and connectivity between rests of the nodes are maintained in WSN. 
Povzetek: Opisana je arhitektura brezžičnega senzorskega omrežja. 
1 Introduction 
The use of wireless sensor network (WSN) nowadays has 
seen a huge growth in the field of ambience intelligence. 
WSN is resource constrained in nature but can be 
integrated with any system by using Dynamic Adaptive 
System Infrastructure (DAiSI) proposed by Klus and 
Niebuhr (2009) in [11]. Component reconfiguration and 
dynamic integration can be done with the help of this. 
Another interesting application that uses WSN to detect 
and track presence of human and human motion in an 
environment is presented in the research of Graham et al. 
(2011) in [7]. It is also shown in the work that 
appropriate device placement scheme can improve 
network performance. 
The unit of WSN is a tiny sensor node, which 
communicates to other sensor nodes through radio 
transmission. Small sensor nodes constituting of sensing 
unit, tiny memory, a microcontroller, a transceiver and an 
omni-directional antenna are deployed in the target area. 
Sensor nodes send relevant data to the nearest base 
station (BS), which is used for some meaningful decision 
making. Fault in WSN is trivial because of its resource 
constraints and unattended deployment scenario. 
Therefore to make the WSN reliable and dependable, 
fault tolerance must be implemented in it.  
Various types of node faults are classified in the 
Figure 1. Faults in WSN can be permanent, transient or 
intermittent in nature. Fault management is the process to 
monitor the nodes, detect and diagnose fault and perform 
necessary recovery tasks to make WSN fault tolerant. 
Permanent failures generally have no option for recovery 
but for transient and intermittent faults the recovery 
actions should prevail. Proper recovery schedules should 
be there for occurred faults to make it fault tolerant and 
help application to make correct decision, even in 
presence of faults. Some of the critical factors in 
recovery process are the available residual energy of the 
sensor node, the network traffic scenario, connectivity 
issues and current topological structure of the WSN.  
A distributed adaptive fault detection scheme for 
WSN is proposed in [28], where each node detect any 
unnatural event by fetching neighbor sensor nodes’ 
reading with queries. A three-bit control packet exchange 
is done during the fault detection phase in order to reduce 
communication overhead.  Here moving average filter 
was employed for implementing fault tolerance in WSN. 
The article claims to have reached high detection 
accuracy and low false alarm rate.  
WSN may comprise of static or mobile sensor nodes. 
Borawake-Satao and Prasad (2017) [4] presents a study 
of effects of sensor node mobility on various 
performance parameters of WSN. A proposal of mobile 
sink with mobile agent mobility model for WSN is also 
presented. In [12] Kumar and Nagarajan (2013) proposed 
Incorporated Network Topological control and Key 
management (INTK) for relay nodes of WSN, for 
privacy and security measures in the network. The 
proposed scheme includes hierarchical routing 
architecture in WSN for better performance and security. 
Another novel research proposal by Mukherjee et al. 
(2016) [22] presents a model for disaster aware mobile 
Unmanned Aerial Vehicle (UAV) for flying Ad-hoc 
network. The nodes can perform collaborative job by 
relaying useful message in a post-disaster situation of 
any ecosystem.  
 
 
48 Informatica 41 (2017) 47–58 S. Mitra et al.  
 
  
 
 
Figure 1: Node Fault Classification.  
The analytical comparison presented in [3] by Bathla 
& Jindal 2016, where two distributed self-healing 
recovery techniques, Recovery by In-ward Motion (RIM) 
and Least Disruptive Topology Repair (LeDiR) are 
analyzed and compared with respect to their efficiency in 
various applications. Both the approaches are distributed 
in nature. 
The RIM method aims to replace a failed node by a 
healthy node, by moving the latter towards the former’s 
location. Here all nodes must have a 1-hop neighbor list 
and should be aware of their neighbor’s locality and 
proximity. Now the goal of LeDiR is to restore the 
connectivity among the sensor nodes. But it also takes 
care that after the recovery action the shortest path length 
among the nodes is not extended compared to the pre-
failure topology.  
A fault recovery algorithm for WSN is proposed in 
[13] by Lakamana et al. 2015, which enhances the 
routing efficiency in the WSN. Battery depletion is a 
major issue and that is taken care of over here by 
reducing the number of node replacements and by 
reusing the historic routing paths. According to the 
authors’ claim the network longevity is increased over 
here.    
In WSN, now sensor node(s) when gets disconnected 
from the network due to some reason may generate 
partitions or isolations in the network, which is not good 
for reliability and dependability of the network. 
Moreover, it is crucial to maintain the connectivity 
throughout its longevity. So the objective of this research 
is to design a distributed fault tolerant architecture for 
WSN, which includes fault detection, diagnosis and 
recovery. However the architecture proposed here is an 
improved version of a fault tolerant framework already 
proposed in our previous research work available in [18] 
by Mitra & De Sarkar (2014). Moreover this article also 
proposes a novel fault recovery model, which is 
integrated with the proposed architecture. This research 
also proposes some algorithms for connectivity 
maintenance, and recovery tasks to be performed. The 
novelty of the proposed recovery technique is to initiate 
the recovery actions after proper diagnosis of the 
detected fault. Recovery tasks are done once after it gets 
notification from the diagnosis layer about the fault-type. 
The recovery model has two phases; the first one being 
set action and start recovery.  
The remainder part of the article is sub-divided into 
sections namely, related work done in the current field, 
followed by proposed Distributed Fault Tolerant 
Architecture and supporting fault recovery architecture 
and algorithms; and then the results and discussions 
section is presented. Finally, the conclusion and the 
references to the article are presented. 
2 Related work 
This section mainly presents some of the valuable 
researches carried out by many scholars in the field of 
WSN. Many existing fault management techniques are 
available, which are used for fault tolerance in WSN. A 
review work of the same is presented in our previous 
research work, Mitra, De Sarkar and Roy (2012) in ref. 
[20] and a few of them are mentioned here also. 
Moreover in this article a study on some of the existing 
recovery schemes is presented. Data communication is 
important factor in WSN hence routing decision is 
significant. Leskovec, et al. (2005) [14] proposed a novel 
link quality estimation model for sensor network, which 
uses link quality map to estimate a link in sensor 
network. This work also optimizes power consumption 
of radio transmission signal, while scheduling the 
communication task and taking routing decision. 
An analytical study and comparison of various 
recovery techniques are presented in our recent research 
work, in [19] (Mitra, Das & Mazumdar 2016). Some of 
those recovery schemes are also discussed briefly here. 
Among them CRAFT (Checkpoint/Recovery-based 
scheme for Fault Tolerance) [26] for WSN, proposed by 
Saleh, Eltoweissy and Agbaria (2007) is studied; another 
scheme proposed by Ma, Lin, Lv & Wang (2009) [16] 
called ABSR, which recovers some compromised sensor 
Node Fault 
Sensing 
Fault 
Communication 
Fault 
Processor 
Fault 
DoS Hardware 
failure 
Software 
failure 
Low 
Residual 
Energy 
High 
Traffic 
Congestion 
Hardware 
failure 
Hardware 
failure 
 
Distributed Fault Tolerant Architecture for... Informatica 41 (2017) 47–58 49 
 
nodes in a heterogeneous sensor network. Various types 
of sensor nodes each playing specific role are used here. 
Reghunath, Kumar & Babu (2014) proposed Fault Node 
Recovery (FNR) algorithm, which is a combination of 
Genetic algorithm with Grade Diffusion algorithm. A 
rank based replacement strategy for the sensor nodes is 
presented in [25].  
In [6] Chen, Kher & Somani (2006) proposed DLFS 
(distributed localized fault sensors) detection algorithm, 
for locating and identifying faulty nodes in WSN. Each 
node can be either in good health or can be faulty 
depending upon the node behavior. The technique here 
uses probabilistic approach. The implemention of the 
algorithm claim that execution complexity of the same is 
much low and detection accuracy is high. Haboush, 
Mohanty, Pattanayak and Al-Tarazi (2014) [8] have 
proposed a faulty node replacement algorithm for hybrid 
WSN. Mobile sensor nodes are considered over here. 
Any node having low residual energy may seek a 
replacement; after replacement maintenance of the 
topology etc. are taken care of. Redundancy is used to 
avoid faulty results and also adaptive threshold policy is 
employed for rectification of the faults and optimizing 
the network lifetime. The research in [2] Akbari et al. 
(2010) presents a survey of faults in WSN due to energy 
crunch and the role of cellular architecture and clustering 
for network sustain purpose. The cluster-based fault 
detection and recovery techniques was observed to be 
quite efficient, robust and fast for WSN sustain and 
longevity. Another cluster maintenance technique is 
designed by them for nodes having energy crunch as 
mentioned in [1] (Akbari, Dana, Khademzadeh & 
Beikmahdavi 2011). First of all, nodes with highest 
residual energy are selected as primary cluster head, and 
nodes second in residual energy becomes the secondary 
cluster head. So the technique is energy aware in nature 
and consequentially selects the cluster head as per the 
nodes’ residual energy.  
An FNR algorithm is proposed by Brahme, 
Gadadare, Kulkarni, Surana & Marathe (2014) in [5], for 
fault recovery in WSN to enhance network lifetime. 
Researchers employed genetic algorithm and grade 
diffusion algorithm for designing the scheme. Moreover 
researchers, Mishal, Narke, Shinde, Zaware & Salve 
(2015) in [17] have worked upon FNR and improved it 
performing lesser number of node replacement for fault 
recovery, and basically old routing paths are reused; 
however better  result is claimed over here. A proposal 
on a distributed fault detection algorithm for detecting 
coverage holes in the WSN is presented in Kang et al. 
2013 [9]. The research do not maintain any node 
coordinates. The critical information of a node can be 
collected from the neighbors and that can be used for 
detection and recovery purpose for WSN. On demand 
checkpoint based recovery technique for WSN is 
proposed in [23] by Nithilan & Renold (2015). In this 
scheme checkpoint coordination and non-blocking 
checkpoint is used for consistency and some backup 
nodes maintains and checks the health of a node by 
monitoring the checkpoints. A localized tree based 
method for fault detection is proposed by Wan, Wu & Xu 
(2008) [27]. The recovery scheme uses elected new 
parent technique for avoiding isolation of children node 
of the tree. This technique enhances the network lifetime.  
The main objective of this research work is to design 
a distributed fault tolerant architecture for WSN, with 
intrinsic parts for fault detection, diagnosis and recovery. 
In this research we mainly concentrate to propose a 
distributed fault recovery model for WSN with a set of 
algorithms, which are employed for performing node, 
data and network recovery. For fault detection, existing 
detection algorithm proposed in our previous work in 
[18] is used. Thereafter the proposed recovery technique 
is employed to maintain the fault tolerance.  The major 
job is to increase the reliability and dependability of the 
WSN for correct decision making. The novelty of this 
research work is to perform recovery actions, using data 
checkpoints and state checkpoints of the node, in a 
distributed manner. Also topology maintenance is being 
performed by each node during the recovery process. 
3 Proposed distributed fault tolerant 
architecture for WSN 
This section details on a proposal of a fault tolerant 
framework for WSN. Event detection is important for 
implementing fault tolerance in WSN, where the event 
can be presence of hole in the network. A distributed, 
lightweight, hole detection algorithm proposed by 
Nguyen et al. (2016) in [24] monitors and reports about 
any hole in the network. The present proposal is an 
improvisation of an already proposed framework 
mentioned in Mitra et al (2014). The architecture as 
mentioned in Figure 2 can be embedded in each sensor 
node of WSN and the node can independently perform 
fault management in a distributed way.  
In a centralized system fault management scheme 
there is a central manager, who monitors and controls the 
network. So each node has to report the central manager 
with relevant data for fault tolerance in the WSN. 
Therefore too much of communication will result and in 
the WSN huge overhead will be incurred in terms of 
energy and bandwidth, which may affect network 
performance. In centralized approach the traffic flow is 
towards a single central manager creating overheads and 
resulting in bottlenecking. However this is not desirable 
in WSN since it is resource constrained and 
infrastructure less. This critical bottleneck problem can 
be avoided in the distributed architecture of fault 
management scheme. In distributed fault management, 
the network is partitioned and self fault management is 
implemented. Moreover in comparison with the 
centralized system the communication cost is less in 
distributed system. Therefore this research work mainly 
aims for distributed architecture. The proposed 
architecture has three main phases viz. Fault Detection, 
Fault Diagnosis and Fault Recovery respectively.  
3.1 Fault detection 
The detection phase has three significant tasks namely 
Node and Link Monitoring, Fault Isolation and Fault 
50 Informatica 41 (2017) 47–58 S. Mitra et al.  
 
Prediction. Fault detection algorithm is already proposed 
in our previous work presented in Mitra et al. (2014). A 
brief discussion is presented hereafter.  All the tasks are 
computed in an energy aware mode. In the monitoring 
stage the sensor node listener carefully monitors and 
examines the health of the sensor node and detects if any 
unnatural event occurs; and then it scans the attributes of 
the event; it also evaluates some useful parameter-value 
required for detecting faults in the node. First of all a 
neighbor table for each node is created and after that 
node performs self-checking. Sensor nodes evaluates 
own tendency by comparing the average of neighbors’ 
reading with the own read value. Again the nodes do a 
similar comparison with its own previous read value and 
current value. The tendency of the node quantizes 
whether it is trustworthy or not. If a node is not 
trustworthy then the trust degree (TD) value is zero and 
if it is trustworthy then the TD value is one. So the TD 
value isolates rather detects the fault in the WSN. 
In the Prediction module the residual energy analysis 
of the node is carried out and any fault-to-be are 
forecasted. The forecast is done on the basis of some 
comparative study of the fault evaluation parameters 
namely residual energy of the node. If the residual 
energy of the node goes below a threshold then the built-
in fault predictor invokes two actions; firstly it broadcast 
the information of its low energy state. Secondly some 
query packets are broadcasted asking for a node with 
high residual energy for offloading its own 
responsibility. Finally the node is sent to sleep mode.   
 
 
 
 
Figure 2: Distributed Fault Tolerant Architecture for WSN.  
 
 
 
 
Node and Link 
Monitoring 
Fault Isolation 
Prediction 
Fault Analysis 
Node Fault Link Fault 
Act as Relay 
Node 
Off/ 
Restart/ 
Replace 
Reconfiguration 
of Routing Path 
Phase 1: Fault 
Detection 
Phase 2: Fault 
Diagnosis 
Phase 3: Fault 
Recovery 
Radio 
Fault 
Radio 
OK 
Distributed Fault Tolerant Architecture for... Informatica 41 (2017) 47–58 51 
 
3.2 Fault diagnosis 
The second phase of the fault tolerant architecture is fault 
diagnosis and it is done after the analysis of the occurred 
event. Fault analysis is a reactive process, and the fault 
category in WSN can be either a node fault or 
communication fault. For diagnosing the node fault, the 
assigned TD value is taken into consideration as 
available in [18]. Evaluation of TD value of a node is 
computed on the basis of self analysis and neighbor 
analysis for a fixed number of iterates. And depending on 
the iteration count the decision of node fault is finalized. 
Now for communication fault diagnosis, two critical 
parameters received signal strength (RSS) and link 
utilization of the sensor nodes are taken into account. 
The average RSS of all the neighbors of the sensor node 
are computed for communication fault analysis. 
Moreover the sensor node also computes the average link 
utilization parameter to check self performance. Once 
self-fault diagnosis is completed then a notification is 
forwarded to the next phase i.e. Fault Recovery. 
4 Proposed fault recovery scheme 
This section presents the fault recovery scheme for WSN. 
This scheme can be integrated in each sensor node such 
that distributed fault recovery is possible. The recovery 
process is invoked when a sensor node is suffering of 
some fault. The next sub-sections present the network 
model and the fault recovery model.  
4.1 Network model  
The WSN model in this research work can be represented 
as a graph structure, )E,S(G , where S is a set of sensor 
nodes
}S,...,S,S{S n21 , which is deployed in random or 
planned way in the target area. The area can be 
represented as a two-dimensional plane whose origin 
is )y,x( 00 . Now }n
E,...,
2
E,
1
E{E  is a set of 
communication links in between a pair of nodes Si and 
Sj, which transmits within its communication range but 
has a sensing range lesser than communication range, in 
[15] given by CS RR  where RS and RC, are sensing 
range and communication range respectively. The 
necessary condition for a node Si to transmit signal to a 
node Sj is the Euclidean distance between the two nodes 
should conform Equation 1. Each node maintains a list of 
neighbors, which may dynamically change with time as 
per availability of the node in the communication 
process. It is very general to perform low power 
transmission in WSN, where node’s transmission power 
is directly proportional to the distance. To send data with 
good quality signal strength a node may have to adjust its 
transmission power. For the current problem scope if Pi,j 
is the power of transmission for communication of  Si 
and Sj. It is quite obvious that Equation 2 will satisfy if 
and only if Equation 3 is true. Moreover the maximum 
value of transmission power is also limited. The 
assumption is that any node Si will perform low power 
transmission for nodes within RC and may sometimes, as 
required, perform high power communication with Sj if 
and only if Equation 4 is true. 
 
Cji RSS 
  Equation (1) 
 
r,kj,i PP 
  Equation (2) 
 
rkji SSSS 
  Equation (3) 
 
CjiS RSSR 
 Equation (4) 
4.2 Connectivity issue 
Now not all the nodes can directly transmit data to the 
sink or BS; so any node unable to do the same will 
employ some intermediary forwarding parent nodes to 
send the data to the BS. At the run time one or more 
sensor nodes may not work properly due to faults and 
then the recovery actions of the nodes may be initiated to 
recover the node, data or network. Any recovering node 
may need to stop its scheduled tasks for self-recovery. In 
that case other affected neighbor nodes may have to 
update their own neighbor list and exclude the recovering 
node from any current activity. This scenario is explained 
in Figure 3.  
It is well understandable from the figure that the 
normal nodes need to maintain the connectivity even in 
absence of the faulty nodes marked as black nodes. 
Hence the normal nodes have to find alternate suitable 
nodes within its communication range for forwarding 
data. But if it is unable to find one then it should perform 
a high power transmission to the nodes, which are in its 
communication range, rather than getting isolated.  In the 
figure the dotted arrows demarcate the unstable or 
sometimes unavailable links. To transmit in high power 
the node should increase its transmission power by a 
multiplicative factor, somewhat proportionate to the 
increment in the distance given by k,ij,i DD  , where Di,j 
and Di,k are explained in Equation 5 and 6. In the 
equations i-th node transmits to k-th node in the place of 
recovering j-th node. 
 
jij,i SSD 
  Equation (5) 
 
kik,i SSD 
  Equation (6) 
4.3 Fault recovery model 
The proposed fault recovery model is depicted in Figure 
4, where there is a Fault Recovery Process, which has 
two main phases viz. Set Action and Start Recovery. The 
Fault Recovery Process gets notification from the fault 
diagnosis layer along with the information on the fault 
type; depending upon that the Set Action decides on what 
kind of recovery activity has to be invoked. Again Start 
Recovery actually begins the specific recovery task. 
Faults can be due to hardware or software failure, bad 
link quality or power depletion of the node. 
 
 
52 Informatica 41 (2017) 47–58 S. Mitra et al.  
 
 
Figure 3: Connectivity Issue in WSN. 
 
Figure 4: Fault Recovery Model for WSN. 
The Recovery Process performs and maintains node 
recovery and network recovery and it also communicates 
with the permanent storage for any kind of query 
checking. The permanent storage contains node status 
and data checkpoint before occurrence of the fault. The 
Recovery Process fetches the necessary information to 
perform the recovery tasks smoothly. It also performs 
various types of communication before the node gets 
reinitialized. Node recovery means data recovery from 
the node and also checking the node state and performing 
activities to preserve the node functionality. Network 
recovery deals with reconfiguring the network by 
performing the path quality estimation already proposed 
by Mitra, Roy & Das (2015) in [21]. The recovery jobs 
are vividly explained later in Algorithm 3. 
5 Proposed algorithms for fault 
recovery 
The proposed fault recovery algorithm consists of 
various parts, where each node carry out some self-
checking task and some of them are already mentioned in 
[18]. Since the total process is being carried out in a 
distributed atmosphere so the nodes perform self-
evaluation and self-recovery. All the symbols and 
notations used in the algorithm are mentioned in Table 1.  
The proposed fault recovery scheme uses some sort of 
check pointing for performing the data recovery task. 
Each node maintains a data checkpoint and a state 
checkpoint by using two variables TD (Trust Degree) 
and DCkpt (Data Checkpoint) respectively, in the 
permanent storage of the node i.e. even if the node is 
restarted the data remains intact for future reference as 
mentioned in Algorithm 1 and presented in Figure 5. TD 
is already proposed, explained and used in our prior 
research mentioned in [18]. This previous work also 
presented a novel fault detection scheme, which is used 
over here to detect and diagnose faults. Now HF means 
sensing unit or hardware failure, where the node is 
unable to sense any ambient signal, or it may be 
transceiver fault that occurs when transmitter or receiver 
is not in working mode and microcontroller fault means 
when a node cannot perform its computations at par.  
Each node has a delivered packet counter as DPC, 
which keeps the count of the delivered packets. 
Whenever a node delivers 200 packets it stores TD, as 
state checkpoint, in the permanent storage and stores the 
current read value of the node in DCkpt, as data 
SET 
ACTION 
START 
RECOVERY 
NODE RECOVERY 
NETWORK RECOVERY 
NOTIFICATION FROM FAULT DIAGNOSER 
RECOVERY 
PROCESS 
PERMANENT 
STORAGE 
 FAULTY NODE 
NORMAL NODE 
Distributed Fault Tolerant Architecture for... Informatica 41 (2017) 47–58 53 
 
checkpoint, in permanent memory. After completion of 
these steps the DPC is reinitialized to zero so that it can 
again count the next set of 100 and 200 delivered packets 
respectively. The checkpoint creation process continues 
for each node until it goes to recovery state. When a node 
gets a notification from the fault diagnosis layer, that a 
fault has occurred, it fetches the fault type and performs 
some internal necessary actions that is mentioned in 
Algorithm 2 and presented in Figure 6. As mentioned the 
fault type can be either hardware fault (HF) or software 
fault (SF). Depending upon fault-type the recovery 
process is initiated as mentioned in Algorithm 3 and 
presented in Figure 7. 
Notation Meaning 
Si i-th node 
Sr Node at Recovery mode 
Sr.NBR Neighbor list of Sr 
Si.CURR-VAL Current reading of Si 
DCkpt Data Checkpoint 
DPC Delivered packet count 
TD Trust degree 
CR Communication range 
Pj,k Power to transmit data from Sj to 
Sk 
PRR Packet reception ratio 
PDR Packet delivery ratio 
Table 1: Symbols and Notations. 
In all these cases the node needs a third party or 
human intervention to get the problem fixed. Software 
failure refers to logical or runtime faults in the software, 
which is again needs third party intervention. If the 
packet reception ratio (PRR) and packet delivery ratio 
(PDR) are much low then there must be some 
disturbances in data transmission and receiving; hence a 
communication failure may occur in near future. So the 
nodes goes for a self-recovery process. Finally the node 
has to be shut down if the residual energy is much less 
than the threshold value, which may be specified as per 
application requirement. The node recovery module for 
any faulty node starts with low-power transmission of 
probe packets to the neighbors, which again go for 
topology maintenance as mentioned in Figure 8 as 
Algorithm 4. The recovery activity takes place by 
reinitializing the sensor nodes so that it releases all its 
resources and take a fresh start. The last state checkpoint 
and data checkpoint is recovered from the permanent 
memory. Data checkpoint helps to recover the old data 
and the state checkpoint tells the previous trust degree of 
the node. 
 
Create Checkpoint ( ) 
{ 
   For each node Si do this 
   { 
        Initialize DPC=0; 
        For each packet delivery 
        { 
             DPC++; 
             If (DPC = 200)   
             { 
                    Store TD in permanent storage;
         Store Si.CURR-VAL in DCkpt;      
         Set DPC=0;  
              } 
         } 
     } 
} 
Figure 5: Algorithm 1. 
For each node Sr with detected fault 
{ 
      Get Notification (fault-type)  
      If (fault-type=HF OR fault-type=SF) 
      { 
           Third party assistance needed 
            Initiate Node Recovery ( ) 
      } 
      If (PRR very low OR PDR very low) 
        Initiate Node Recovery ( ) 
    If (Residual energy << Threshold) 
         Shut down Sensor Node 
} 
Figure 6: Algorithm 2. 
Initiate Node Recovery ( ) 
{ 
        Send probe packets to all of Sr.NBR  
        For each node NBR.SS rj   
        { 
          Maintain Topology ( ) 
        } 
        Start recovery action (Sr) 
        { 
            Reinitialize the sensor node; 
            Fetch the last data from Sr.DCkpt; 
            Get Sr.TD; 
            Perform LQE; 
         }  
}  
Figure 7: Algorithm 3. 
 
54 Informatica 41 (2017) 47–58 S. Mitra et al.  
 
Finally the node performs link quality estimation 
given by LQE, which is again proposed in our work 
mentioned in [21]. In the node recovery algorithm any 
node which is in recovery mode sends some probe 
packets to its neighbor, stating its unavailability for some 
instance of time. The neighbors in turn, update their own 
neighbor tables and get prepared for running the 
topology maintenance schedule. 
Maintain Topology ( ) 
{ 
    Get parent list of Sr 
    For each parent Sk of Sr 
    { 
       If CRSS kj    
       { 
              Update neighbor list 
              Update routing table 
              Transmit through Sk 
        } 
        Else  
        { 
               Estimate transmission power Pj,k 
               If  1k,jk,j PP   
                        Store current Pj,k 
               Else  
                         Keep the previous Pj,k-1 
            } 
    } 
    Select Sk with minimum Pj,k  
    Update neighbor list and routing table 
    Transmit through Sk 
} 
Figure 8: Algorithm 4. 
6 Results and discussions 
In this section the results are displayed and 
corresponding discussion is presented. For simulation 
purpose and preparing the results MATLAB version 
7.11.0.584 (R2010b) and Microsoft Office Excel 2003 
was used. The sensor node specifications considered and 
the simulation environment is mentioned in the Tables 2 
and Table 3 respectively. Table 4 next shows the 
computed energy consumption for each task done by 
each node. Initially the nodes are deployed randomly and 
then they are initialized and they start to do their normal 
task. In an area of 100×100 m2 thirty sensor nodes were 
randomly deployed considering a uniform 
communication range of 25 meters.  
6.1 Fault detection and diagnosis 
The nodes were deployed randomly and then gradually 
nodes become faulty in the WSN. The faults are detected 
and consequentially the recovery is carried out by the 
nodes.  
Just after the nodes are deployed, the scenario is 
presented in the first quadrant of the Figure 9, which is 
followed by the edge development of the nodes, 
depending upon the transmission radius of the sensor 
nodes and presented in second quadrant of Figure 9. Now 
for the detection of faults the fault detection algorithm 
proposed by Mitra and De Sarkar (2014) is used and the 
nodes demarcated by red colors in the third and fourth 
quadrant of Figure 9. It was observed that five out thirty 
nodes were detected to be faulty. Moreover in the Figure 
10 especially the faulty nodes with the affected links are 
represented. 
 
Parameter  Value  
Frequency Range 2.4 – 2.48 GHz 
Data Rate 250 Kbps  
Current Draw  16 mA @ Receive mode 
17 mA @ Transmit mode 
8 mA @ Active mode 
8 µA @ Sleep mode 
Table 2: Sensor Node Specifications. 
Parameter  Value  
No. of Nodes Deployed 30 
Area Covered 100×100 m2 
Communication Range 25 meter 
Node Density (ρ) 0.003nodes/ m2 
Table 3: Simulation Environment. 
Task Performed Energy 
Consumed (in mJ) 
Data Sensing 0.0018 
Data Processing 0.0513 
Data Transmission 0.1864152 
Data Receiving  0.0627456 
Self Evaluation  0.12 
Table 4: Energy Consumption for various tasks 
performed by Sensor Nodes. [18] 
6.2 Fault recovery 
After the faults are detected then the recovery activities 
are started. Faulty nodes go for recovery and here they 
are named as recovering node (RN) and the affected 
nodes (AN) are their neighbors. Now as in Figure 10 the 
red nodes are RN and red links are affected links, which 
will get defunct later on. A list of susceptible parents for 
each set of ANs is mentioned in Table 5. However the 
ANs have to select a suitable node to maintain the 
connectivity even in absence of the corresponding RN.  
 
 
Distributed Fault Tolerant Architecture for... Informatica 41 (2017) 47–58 55 
 
 
Figure 9: Node Deployments and Fault Scenario. 
Case 1 (Recovery for Node 1): Here the node ID 1 is 
RN after getting faulty goes to recovery mode; now this 
node sends probe packets to its neighboring nodes, which 
are actually in its communication range. The IDs of the 
ANs are 9, 12 and 20. Now these nodes will update their 
neighbor list and each of them will try to find out a node, 
which will act as their new parent. There are multiple 
susceptible parents for nodes 9, 12 and 20 and they select 
node IDs 18, 23 and 6 respectively, as their new parent. 
Node ID 9 have 5 susceptible parents and out of which it 
selects node ID 18 since the transmission power factor is 
minimum among all other possible parents (as presented 
in Table 5). Similarly node ID 12 and 20 has 5 and 6 
susceptible parents respectively but node IDs 23 and 6 
are selected as actual parents because of low 
transmission factor. So in all the 3 situation the tasks are 
carried out to minimize the power consumption. The 
necessary results to support new parent selection, from 
the each node’s susceptible parent list, is elaborately 
presented in Table 5. 
Case 2 (Recovery for Node 4): In the second case 
node ID 4 is RN and its ANs are 7, 11, 28 and 29. Just 
similarly like case 1 here all ANs find a suitable parent 
for transmitting data. In this case there are multiple 
susceptible parents out of which nodes 7, 11, 28 and 29 
(mentioned in Table 5) but each node selects some 
56 Informatica 41 (2017) 47–58 S. Mitra et al.  
 
specific node as their new parent. Moreover they update 
their neighbor list also.  
Node ID 7 and 11 have 6 susceptible parents and out 
of which node ID 7 selects node ID 10 as its immediate 
parent and node 11 selects 22 as its new parent. This 
selection is done on the basis of the minimum power 
factor for these nodes. Lower power factor means lower 
power consumption for transmission. Similarly node ID 
28 and 29 selects node ID 2 and 30 as their new parent 
respectively. So in all the 4 situations specific selections 
are made to keep the power consumption of the node 
low, in comparison to others. The necessary results to 
support new parent selection, from the each node’s 
susceptible parent list, is elaborately presented in Table 
5.  
7 Discussion 
Moreover in Table 6 all the ANs are mentioned along 
with their transmission power and distance from the 
current parent. After simulation it is inferred that the 
ANs select those nodes as their new parent, in absence of 
the RN, through which they can forward data towards 
BS. 
Node 28 has to raise its multiplication factor as high 
as 2.82, in order to avoid isolation. As in the cases of 
nodes 11, 28 and 29 the distance with the new parent is 
greater than their distance with node 4 so they have to 
raise their transmission power. 
Here the activities for two of the nodes with ID 1 and 
4 are shown the same activities are carried out for other 
RNs. All RNs send the probe packets to their neighbors, 
and the ANs in turn perform the topology maintenance 
tasks and then the RNs are reinitialized and the values 
from system variable DCkpt and TD are fetched, since 
they contain the last data checkpoint and state checkpoint 
of the node. After that the link quality estimation is done 
to check the vicinity traffic situation and finally the node 
comes back to the network. 
 
 
 
Figure 10: Affected Links due to Node Fault. 
 
 
 
 
Distributed Fault Tolerant Architecture for... Informatica 41 (2017) 47–58 57 
 
AN ID Susceptible Parent IDs Power Factor For Each Parent 
9 3, 5, 13, 18, 19 1.42, 1.20, 2.15, 1.08, 1.86 
12 5, 6, 14, 18, 23 2.89, 2.53, 2.30, 2.30, 1.79 
20 3, 5, 6, 13, 17,19 2.10, 1.51, 1.44, 2.69, 2.85, 2.71 
7 2, 10, 15, 16, 26, 30 2.72, 1.93, 3.10, 3.83, 3.25, 3.99 
11 2, 3, 13, 17, 22, 29 2.47, 2.46, 2.15, 2.16, 1.91, 2.15 
28 2, 3, 15, 16, 26 2.82, 3.01, 3.11, 2.96, 3.14 
29 11, 16, 28, 30 1.51, 1.65, 1.46, 1.19 
Table 5: Susceptible Parent List for Affected Nodes. 
RN ID AN ID New  Parent 
Node ID 
Power 
Factor 
Distance of AN 
with new Parent 
(in meters) 
1 9 18 1.08 24.56 
12 23 1.79 28.39 
20 6 1.44 26.41 
4 7 10 1.93 17.45 
11 22 1.91 30.54 
28 2 2.82 38.95 
29 30 1.19 27.02 
Table 6: New Parent Selections. 
8 Conclusion 
WSN is used widely nowadays for various field 
surveillance and distributed fault tolerance in necessary 
in the same for reliability and dependability of WSN. 
Novel fault recovery architecture is designed and 
proposed in this paper; the recovery architecture is 
destined to be integrated with a fault tolerant framework 
for wireless sensor network. This paper also presented 
proposed algorithms for fault recovery and connectivity 
maintenance in WSN. This algorithm explains details of 
recovery tasks are carried out. A brief discussion is 
presented to identify the detection of faults and then 
different cases for recovery are done.  
This proposed recovery technique takes care of 
recovery actions related to the faults due to hardware or 
software failure. It also improves link quality or 
connectivity among the nodes during recovery phases. 
However, the noise-related measurement or error due to 
presence of noise is not scope of this paper. This research 
will enhance the recovery scheme with self-organization 
and noise-related measurement based recovery in future. 
As a future work this research will also present a result- 
interpretation based comparative study of recovery 
schemes. 
9 References 
[1] Akbari A., Dana A., Khademzadeh A. & 
Beikmahdavi N. (2011) “Fault Detection and 
Recovery in Wireless Sensor Network using 
Clustering” in International in Journal of Wireless 
& Mobile Networks (IJWMN) Vol. 3, Issue 1, 130-
138 
[2] Akbari A., Beikmahdavi N., Khosrozadeh A., 
Panah O., Yadollahi M. & Jalali S. V. (2010) “A 
Survey Cluster-Based and Cellular Approach to 
Fault Detection and Recovery in Wireless Sensor 
Networks” in World Applied Sciences Journal Vol. 
8 Issue 1 76-85 
[3] Bathla G. & Jindal S. (2016) “A Review of RIM 
and LeDiR recovery mechanism for node recovery 
in Wireless Sensor Actor Network” in International 
Journal of Engineering Development and Research 
Vol. 4, Issue 2, 2145-2147 
[4] Borawake-Satao R. & Prasad R. S. (2017) “Mobile 
Sink with Mobile Agents: Effective Mobility 
Scheme for Wireless Sensor Network” published in 
International Journal of Rough Sets and Data 
Analysis Vol. 4 Issue 2 24-35 
[5] Brahme C., Gadadare S., Kulkarni R., Surana P. & 
Marathe M.V. (2014) “Fault Node Recovery 
Algorithm for a Wireless Sensor Network” in 
International Journal of Emerging Engineering 
Research and Technology Vol. 2, Issue 9, 70-76 
ISSN 2349-4395 (Print) & ISSN 2349-4409 
(Online) 
[6] Chen J., Kher S. & Somani A. (2006) “Distributed 
Fault Detection of Wireless Sensor Networks” in 
Proc. of Workshop on Dependability issues in 
Wireless Ad hoc Networks and Sensor Networks 
(DIWANS), pp. 65-72 published by ACM, Los 
Angeles, CA, USA 
[7] Graham B., Tachtatzis C., Franco F. D., Bykowski 
M., Tracey D. C., Timmons N. F. & Morrison J. 
(2011) “Analysis of the Effect of Human Presence 
on a Wireless Sensor Network” published in 
58 Informatica 41 (2017) 47–58 S. Mitra et al.  
 
International Journal of Ambient Computing and 
Intelligence (IJACI), Vol. 3 Issue 1, 1-13 
[8] Haboush A., Mohanty M. N., Pattanayak B. K. & 
Al-Tarazi M. (2014) “A Framework for Wireless 
Sensor Network Fault Rectification” published in 
International Journal of Multimedia and Ubiquitous 
Engineering Vol.  9 Issue 1 133-142 
[9] Kang Z., Yu H. & Xiong Q. (2013) Detection and 
Recovery of Coverage Holes in Wireless Sensor 
Networks in Journal Of Networks, Vol. 8, Issue 4, 
822-828 
[10] Karl H. & Willig A (2005) “Protocols and 
Architectures for Wireless Sensor Networks” West 
Sussex, England, John Wiley & Sons Ltd.  
[11] Klus H. & Niebuhr D. (2009) “Integrating Sensor 
Nodes into a Middleware for Ambient Intelligence” 
published in International Journal of Ambient 
Computing and Intelligence (IJACI), IGI Global 
Vol. 1, Issue 4, 1-11 
[12] Kumar S.  & Nagarajan (2013) N. “Integrated 
Network Topological Control and Key Management 
for Securing Wireless Sensor Networks” published 
in International Journal of Ambient Computing and 
Intelligence (IJACI), Vol. 5 Issue 4, 12-24 
[13] Lakamana V. S. S. K. & Rani S. J. (2015) “Fault 
Node Prediction Model in Wireless Sensor 
Networks Using Improved Generic Algorithm” in 
International Journal of Computer Science and 
Information Technologies, Vol. 6 Issue 4, 3501-
3503 
[14] Leskovec J., Sarkar P. & Carlos Guestrin (2005) 
“Modelling Link Qualities in a Sensor Network” 
published in Informatica Vol. 29 445–451 
[15] Liu X. (2006) “Coverage with Connectivity in 
Wireless Sensor Networks” in Proc. Of Basenet 
2006, in conjunction with BroadNets, San Jose, CA 
[16] Ma C., Lin X., Lv H. & Wang H. (2009) “ABSR: 
An Agent based Self-Recovery Model for Wireless 
Sensor Network” in Proc. Of Eighth IEEE 
International Conference on Dependable, 
Autonomic and Secure Computing, pp. 400-404, 
Chengdu, China 
[17] Mishal M.D., Narke V.A., Shinde S.P., Zaware 
G.B. & Salve S. (2015) “Fault Node Recovery For 
A Wireless Sensor Network” in Multidisciplinary 
Journal of Research in Engineering and 
Technology, Vol. 2, Issue 2, 476-479 
[18] Mitra S. & Sarkar A. D. (2014) “Energy Aware 
Fault Tolerent Framework in Wireless Sensor 
Network” in Proc. Of AIMoC 2014 pp. 139-145 
published by IEEE, Kolkata, India 
[19] Mitra S., Das A. & Mazumdar S. (2016) 
“Comparative Study of Fault Recovery Techniques 
in Wireless Sensor Network” in Proc. Of WIECON-
ECE 2016 pp. 130-133, published by IEEE, 
AISSMS, Pune, India 
[20] Mitra S., Sarkar A. D. & Roy S. (2012) “A Review 
of Fault Management System in Wireless Sensor 
Network” in Proc. of International Information 
Technology Conference, CUBE pp. 144-148 
published by ACM, Pune India 
[21] Mitra S., Roy S. & Das A. (2015) “Parent Selection 
Based on Link Quality Estimation in WSN” in 
Advances in Intelligent Systems and Computing 
(AISC) Vol. 379,  Proc. of IC3T, pp. 629-637, 
published by Springer, Hyderbad, India 
[22] Mukherjee A.,  Dey N.,  Kausar N., Ashour A. S.,  
Taiar R. & Hassanien A. E. (2016) “A Disaster 
Management Specific Mobility Model for Flying 
Ad-hoc Network” published in International Journal 
of Rough Sets and Data Analysis Vol. 3 Issue 3 72-
103 
[23] Nithilan N. & Renold A. P. (2015) “On-Demand 
Checkpoint And Recovery Scheme For Wireless 
Sensor Networks” in ICTACT Journal On 
Microelectronics, Vol. 1, Issue 1, 35-40, ISSN 
online (2395-1680) 
[24] Nguyen K. V., Nguyen P. L., Phan H. & Nguyen T. 
D. (2016) “A Distributed Algorithm for Monitoring 
an Expanding Hole in Wireless Sensor Networks” 
published in Informatica Vol. 40 181–195 
[25] Reghunath E. V., Kumar P. & Babu A. (2014) 
“Enhancing the Life Time of a Wireless Sensor 
Network by Ranking and Recovering the Fault 
Nodes” published in International Journal of 
Engineering Trends and Technology (IJETT) Vol. 
15 Issue 8, 410-413 
[26] Saleh I., Eltoweissy M., Agbaria A. & El-Sayed H. 
(2007) “A Fault Tolerance Management Framework 
for Wireless Sensor Networks” published in Journal 
of Communications, Vol. 2, Issue 4, 38-48  
[27] Wan J., Wu J. & Xu X. (2008) “A Novel Fault 
Detection and Recovery Mechanism for Zigbee 
Sensor Networks” in Proc. Of Second International 
Conference on Future Generation Communication 
and Networking, pp. 270-274, Hainan Island, China 
published by IEEE 
[28] Yim S. J., & Choi Y.H. (2010) “An Adaptive Fault-
Tolerant Event Detection Scheme for Wireless 
Sensor Networks” published in Sensors Journal, 
Vol. 10 Issue 3 2332-2347