https://doi.org/10.31449/inf.v42i3.1497 Informatica 42 (2018) 417–438 417 A Hybrid Particle Swarm Optimization and Differential Evolution Based Test Data Generation Algorithm for Data-Flow Coverage Using Neighbourhood Search Strategy Sapna Varshney and Monica Mehrotra Department of Computer Science, Jamia Millia Islamia, India E-mail: sapna_varsh@yahoo.com, drmehrotra2000@gmail.com Keywords: search based software testing, particle swarm optimization, differential evolution, data flow testing, dominance tree Received: January 15, 2017 Meta-heuristic search techniques, mainly Genetic Algorithm (GA), have been widely applied for automated test data generation according to a structural test adequacy criterion. However, it remains a challenging task for more robust adequacy criterion such as data-flow coverage of a program. Now, focus is on the use of other highly-adaptive meta-heuristic search techniques such as Particle Swarm Optimization (PSO) and Differential Evolution (DE). In this paper, a hybrid (adaptive PSO and DE) algorithm is proposed to generate test data for data-flow dependencies of a program with a neighbourhood search strategy to improve the search capability of the hybrid algorithm. The fitness function is based on the concepts of dominance relations and branch distance. The measures considered are mean number of generations and mean percentage coverage. The performance of the hybrid algorithm is compared with that of DE, PSO, GA, and random search. Over several experiments on a set of benchmark programs, it is shown that the hybrid algorithm performed significantly better than DE, PSO, GA and random search in data-flow test data generation with respect to the measures collected. Povzetek: Razvit je nov algoritem kot kombinacija hibridnega roja delcev in diferenčne evolucije z uporabo sosednje iskalne strategije. 1 Introduction Software testing aims at assessing the quality and reliability of software product by detecting as many defects as possible. The cost of software testing increases exponentially with the size of input search space, thereby making manual testing a difficult and tedious task. There are software testing tools available with capture and playback features to automate the execution of test scripts. However, the test cases are manually selected by the human tester and may not be optimal. It is therefore desirable to generate optimal test data that reveals as many errors as possible according to a test adequacy criterion [1]. Structural (white-box) testing tests software for its structure and has the inherent capability to expose faults. The structural test adequacy criteria can be statement coverage, branch coverage, or path coverage that aim at executing every statement, branch or path respectively at least once. Data-flow coverage, an effective and robust test adequacy criterion, focuses on the definition and usage of variables in a program. Data- flow testing, therefore, could lead to more efficient and targeted test suites. The attempts to reduce the cost of software testing by automating the process of software test data generation have been constrained by the ever increasing size and complexity of software. In the early period of automated test data generation, gradient descent and meta-heuristic search (MHS) algorithms such as Tabu Search, Hill Climbing and Simulated Annealing [2, 3, 4]. In the past two decades, evolutionary search-based algorithms such as Genetic Algorithm (GA) have been widely employed for test data generation as an effective alternative [5, 6, 7, 8, 9]. A search-based approach captures the test adequacy criteria as a fitness function that is used to guide the search. Due to an extensive application of search-based algorithms to test data generation problem, the approach has come to be known as Search Based Software Testing (SBST, coined by Harman and Jones). Recently, the focus is on the use of other highly adaptive search-based techniques such as Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO) and Differential Evolution (DE). It has been observed that GA and ACO have slow convergence towards the optimal solution. PSO and DE are conceptually very simple and the knowledge of previous good solutions is retained by all the members of the current population by means of constructive cooperation among them. PSO and DE have been found to be robust in solving optimization problems; however, the performance depends on control parameters. PSO has been shown to be well suited for test data generation with better performance than GA [10, 11, 12, 13, 14]. Hybridization of search-based algorithms for test data generation has also been reported in literature. GA with a local search algorithm [15] and more recently, GA with 418 Informatica 42 (2018) 417–438 S. Varshney et al. PSO has been applied for test data generation in some studies [16, 17, 18, 19, 20, 21]. In this study, we propose a hybrid global search algorithm by combining an adaptive PSO with DE mutation operator to automatically generate test data for data-flow dependencies of a program. In the proposed hybrid algorithm, a new term based on DE differential operator is included for velocity update in PSO for some additional exploration capability. The greedy selection scheme of DE is used wherein position of a particle is updated only if it yields a better fitness value. This results in movement of particles only to better locations in the input search space. A local neighborhood strategy is also included in the proposed hybrid algorithm to explore more promising candidate solutions and overcome the problem of boundary constraints. Design of the fitness function [22] is based on dominance concepts and branch distance that is used to guide the search for optimal test data for data-flow dependencies of a program. The performance of the proposed hybrid algorithm is compared with that of DE, PSO, GA and random search. It is demonstrated that the proposed hybrid algorithm outperformed DE, PSO, GA and random search in terms of mean percentage coverage achieved, and mean number of generations to produce the final test suite for data-flow coverage of a program. The rest of the paper is organized as follows: Section 2 provides a brief description of automated software test data generation process and related work. Section 3 provides an overview of data-flow analysis. Sections 4 and 5 provide a brief description of PSO and DE algorithms. Section 6 describes the proposed hybrid algorithm. Section 7 gives the experimental results. Section 8 provides the discussion and the detailed statistical analysis of the experimental results. Section 9 deals with threats to validity and limitations of the proposed hybrid algorithm. Finally, section 10 gives the conclusion. 2 Related work This section presents the methods to generate test data for software structural testing and the related literature. Symbolic execution, a static method, has been employed for test data generation [2]; however, the performance is constrained by programming constructs such as pointers, loop conditions with input variables, array subscripts and procedure calls [23]. Dynamic methods that have been employed for test data generation can be classified as random, path-oriented and goal-oriented techniques [9, 23]. A random test data generator arbitrarily selects test data from the input domain. Though easy to implement, it may fail to find optimal test data. Path-oriented test data generator [5] uses control flow information to identify a set of independent paths to generate test data. However, it does not work well with infeasible paths or paths that contain loops. A goal-oriented test data generator [9, 23, 24] generates test data for a selected goal such as a statement or a branch, irrespective of the path taken. The meta-heuristic search techniques guided by a fitness function have been adopted to generate optimal test data mainly according to a structural test adequacy criterion. From the literature on structural test data generation, it can be inferred that branch coverage and path coverage are the most often used and well- understood measures [25]. For branch coverage, fitness values are calculated by finding approximation level and branch distance for a target branch from control flow graph [8, 26]. Data-flow coverage criterion has not been used much [27] due to difficulty in writing test cases that satisfy data-flow dependencies of a program. Wegener et al. [28] defined different types of fitness functions for structural testing; data-flow test criteria being classified as node-node-oriented methods. Recently only there has been more work on search based test data generation for data-flow coverage using GA as the algorithm of choice [6, 7, 22, 24, 29, 30]. Now, other highly adaptive search- based techniques such as PSO [14, 18] and ACO [31] are also being applied to generate test data for data-flow coverage due to simplicity and faster convergence. ACO [32] and Harmony Search [33] has also been applied to generate structural test data for branch coverage. Vivanti et al. [30] have proposed a GA-based technique for data-flow coverage evaluated on open source Java applications. The results have indicated the scalability and applicability of data-flow criteria for test data generation. In our previous work [22], an elitist GA-based approach is proposed to generate test data for data-flow dependencies of a program using dominance concepts and branch distance. The fitness function is derived from the work by Ghiduk et al. [6]; it is augmented with branch distance to produce a smoother landscape for guiding the search and also takes into account that a definition may be killed by another definition before the associated use is reached. The performance of the proposed approach is compared with random search and earlier studies on test data generation for data-flow dependencies of a program by Girgis [7], Ghiduk et al. [6] and Girgis et al. [21]. The proposed GA-based approach guided by the novel fitness function outperformed random search and the earlier studies [6, 7, 21] to generate test data for data-flow coverage of a program. Windisch et al. [10] applied PSO to artificial and complex industrial test objects to generate test data for branch coverage. Their results showed efficiency and efficacy of PSO over GA for most code elements to be covered. Agarwal et al. [11] applied binary PSO, Agarwal and Srivastava [12] applied discrete quantum PSO and Mao [13] applied standard PSO to generate test data for branch coverage test adequacy criterion. Nayak and Mohapatra [14] proposed an algorithm to generate test cases using PSO for data flow coverage. This technique cannot rank test cases because the fitness function, as simply taken from Girgis [7], assigns the same fitness value to all the test cases that cover the same number of test requirements and a fitness value of 0 to all the test cases that do not cover any test requirement or A Hybrid Particle Swarm Optimization ... Informatica 42 (2018) 417–438 419 cover a partial aim. Here, the fitness function is unable to guide the search. Application of hybrid algorithms have also been studied for test data generation problem. Zhang et al. [16] proposed a hybrid algorithm (GA and PSO) to generate test data for path coverage. GA and PSO operations are applied to two population sets. Triangle classification problem is taken as the case study and the hybrid algorithm is compared with GA and PSO. The hybrid algorithm is shown to be better than GA and PSO with respect to number of iterations. The average time taken is found to be more than PSO but less than GA. Their hybrid technique is complicated and may generate redundant test cases for automatic test data generation. Li et al. [17] also proposed a hybrid algorithm (GA and PSO) to generate test data for path coverage. PSO equations to update particle’s velocity and position distance are used instead of mutation operator of GA. The algorithm is applied only to the triangle benchmark problem. Singla et al. [18] applied a hybrid algorithm (GA and PSO) to generate test data for data-flow coverage. The fitness function used is same as in [6]; it does not take into account the traversal of killing nodes as well as closeness of test data in case if only partial aim is covered. The strategy is tested only on some simple programs. Kaur and Bhatt [19] proposed a hybrid algorithm (GA and PSO) to prioritize test data in regression testing. The algorithm has been tested on few simple programs. Girgis et al. [21] proposed a hybrid Genetical Swarm Optimization (GSO) Technique to generate a set of test paths that cover the all-uses criterion for data-flow coverage. The authors have claimed that the set of paths generated by the proposed GSO can be passed to a test data generation tool to find program inputs that will execute them to complete the data flow paths testing of the program under test. The fitness function used is same as in [7]; it is not able to guide the search and results in loss of valuable information in case if only partial aim is covered. Chawla et al. [20] proposed a hybrid PSO and GA algorithm for automatic generation of test suites with branch coverage as the test adequacy criterion. The experiments are performed with ten Java container classes. The algorithm is shown to perform better than GA, PSO and existing hybrid strategies based on GA and PSO. Each optimization algorithm has its own advantages and disadvantages. Also, one optimization algorithm will not work well for all the optimization problems. DE, a meta-heuristic search-based algorithm, has been applied to several optimization problems [34, 35] to demonstrate its potential. Das et al. [36] has explored hybridization of PSO with DE applied to the design of digital filters. However, DE has not been applied for test data generation and optimization problem [25, 27, 37]. The proposed study will focus on the application of a hybrid adaptive PSO-DE algorithm to generate test data for data-flow dependencies of a program. The proposed hybrid global search algorithm combines the evolution scheme of both PSO and DE incorporating the best of both the algorithms in the context of test data generation. A new term based on DE differential operator is included for velocity update in PSO. The greedy selection scheme of DE is also used wherein position of a member is updated only if it yields a better fitness value. The hybridization scheme has resulted in movement of particles only to better locations in the input search space. The design of fitness function [22] is based on the dominance relations between the nodes of a program’s control flow graph augmented with branch distance which produces a smoother landscape for guiding the search. This leads to faster and better convergence of test data to achieve the desired coverage. A neighborhood search strategy is also incorporated into the proposed hybrid algorithm that further helps in overcoming the problem of boundary constraints and local optima by exploring more promising candidate solutions. This is the main contribution of this paper. The proposed hybrid algorithm generates test data for one test requirement at a time; other test requirements are also checked for coverage thereby reducing the overall number of fitness evaluations. 3 Data flow analysis In this study, data-flow coverage is used as the test adequacy criteria. Data-flow analysis [38] augments the control-flow testing criteria; the emphasis is on the definition and use of the variables in a program. The control flow of a program is represented by a directed graph G (V, E) also known as control flow graph (CFG), where V is the set of all the nodes and E is the set of all the edges in the graph. Each node corresponds to a program statement or group of sequential program statements and an edge represents flow of control from one node to another. There are two distinct nodes: an entry node n 0 and an exit node n end. Node n dominates node m (dominance relationship) if every path from entry node n 0 to m contains n. By applying dominance relationship to all the nodes of CFG, a tree can be obtained that is rooted at n 0. This tree is called the dominator tree [39]. For each node m in the CFG, Dom (m) is the set of all the nodes that dominate node m. Figure 2 gives the CFG of the example program as given in Figure 1. The dominator tree is shown in Figure 3. For example, Dom (12) = {1, 2, 6, 7, 12}. In a program, the definition and use occurrences of each variable are identified. A variable is said to be defined in a program statement (def-node) if a value is associated with the variable. A variable is said to be used in a program statement if its value is referenced for computational use (c-use node) or a predicate use (p-use node). Data-flow testing should cause the traversal of def-clear sub-paths from the variable definition to either some or all of the p-uses, c-uses, or their combination. Empirically, the all-uses criterion has been shown to be most effective compared to the other data-flow criteria [40]. A def-clear path does not include any intermediate nodes containing other definitions of that variable (killing nodes). A def-clear path can be further 420 Informatica 42 (2018) 417–438 S. Varshney et al. categorized as a dcu-path (c-use of the variable) or a dpu- path (p-use of the variable). For the example program, Table 1 provides definition and use nodes for each variable, Table 2 provides the list of all-def-use paths and Table 3 provides the dominance paths for the nodes of the program flow graph. #include #include 1 1 void main() { 2 1 int a, b, c, valid; 3 1 printf(“\nEnter the value of three sides: “); 4 1 scanf(“%d %d %d”, &a, &b, &c); 5 1 valid=0; 6 2 if((a>=0)&&(a<=100)&&(b>=0)&&(b<=100)&&(c>=0) &&(c<=100)) { 7 3 if(((a+b)>c)&&((c+a)>b)&&((b+c)>a)) { 8 4 valid=1; 9 5 } 10 5 } 11 6 if (valid==1) { 12 7 if ((a==b)&&(b==c)) 13 8 printf(“\nEquilateral triangle.”); 14 9 else if ((a==b)||(b==c)||(c==a)) 15 10 printf(“\nIsosceles triangle.“); 16 11 else 17 11 printf((“\nScalene triangle.“); 18 12 } else { 19 13 printf(“\n Invalid input ”).; 20 14 } 21 15 } Figure 1: Triangle classification program. Table 1: List of variables and def-use occurrences in the example program Variable def Node c-use Node p-use Edge a b c 1 None 2-3 2-6 3-4 3-5 7-8 7-9 9-10 9-11 valid 1,4 None 6-7 6-13 Table 2: List of def-use paths for the example program. Path No. def-use Path (Terminates with -1 for c-use) Killing Node(s) 1 1-2-3 None 2 1-2-6 None 3 1-3-4 None 4 1-3-5 None 5 1-7-8 None 6 1-7-9 None 7 1-9-10 None 8 1-9-11 None 9 1-6-7 4 10 1-6-13 4 11 4-6-7 None 12 4-6-13 None Figure 2: CFG of the example program. Figure 3: Dominator tree for the example Table 3: Dominance paths for the nodes of the CFG. Node No. Dominance Path 1 1 2 1-2 3 1-2-3 4 1-2-3-4 5 1-2-3-5 6 1-2-6 7 1-2-6-7 8 1-2-6-7-8 9 1-2-6-7-9 10 1-2-6-7-9-10 11 1-2-6-7-9-11 12 1-2-6-7-12 13 1-2-6-13 14 1-2-6-14 15 1-2-6-14-15 13 14 15 T Predicate Node 7 8 9 10 11 12 T T F F F Predicate Node Predicate Node Predicate Node Predicate Node 1 4 5 6 2 3 T T F F 1 2 3 4 5 6 7 13 14 8 9 12 10 11 15 A Hybrid Particle Swarm Optimization ... Informatica 42 (2018) 417–438 421 4 Particle swarm optimization In 1995, Kennedy and Eberhart [41] introduced Particle Swarm Optimization algorithm, a population-based search algorithm based on the social and cognitive behavior of different swarms such as flock of birds, herd of animals or school of fishes. The application of PSO for solving many continuous space problems in the field of Computer Science and Engineering has demonstrated its potential. Unlike GA, PSO does not use evolution operators such as crossover and mutation. Instead, each member of the swarm (called particle) attains optimal solution by learning from its own experience and the experience of other members of the swarm. Each particle maintains its current position, current velocity and the best position it has achieved so far, called pbest. The global best position of the swarm is called gbest. Both pbest and gbest are used by the particle in determining its next best position in the swarm. Thus, the knowledge of previous good solutions is retained by all the particles resulting in a faster convergence towards the optimal solution. Consider a swarm of n particles denoted as (p 1, p 2... p n). Position of the i th particle in the d-dimensional search space is denoted as X i = (Xi 1 , X i 2 …X i d ) and the associated velocity is denoted as V i = (V i 1 , V i 2 …V i d ). The personal best position of the i th particle in dimension d is denoted as pbest i d . The position of the best particle of the entire swarm in dimension d is denoted as gbest d . The velocity and position of the i th particle in dimension d can be updated by Equations 1 and 2 as given below. Vi d = w×Vi d + c1×r1×(pbesti d - Xi d ) + c2×r2×(gbest d – Xi d ) (1) Xi d = Xi d + Vi d (2) where, c 1 and c 2 are positive learning constants called cognitive and social scaling parameters chosen in such a way that their sum never exceeds 4, and r 1 and r 2 are two random numbers in the range [0,1]. The inertia weight w controls the impact of the previous history on the new velocity of the i th particle. A particle’s velocity in each dimension is clamped to a maximum magnitude V max. The position and velocity of each particle in the swarm are continuously updated until an optimal solution is achieved. 4.1 Adaptive inertia weight In PSO algorithm, a large value of inertia weight facilitates exploration (global search) of the input search space and a small value of inertia weight facilitates exploitation (local search) of the input search space for the optimal solution. Various inertia weighting strategies used in the literature have been categorized into constant, random, time varying and adaptive inertia weight strategies [42]. In constant and random inertia weight strategies, value of inertia weight is either constant or is chosen randomly during the search. In time varying inertia weight strategies, inertia weight is defined as a function of time or iteration number. Here, value of inertia weight is independent of the state of the particles in the search space. In adaptive inertia weight strategies, state of the particles in the search space (feedback mechanism) is used to adjust the value of the inertia weight. In this study, fitness value of the particles is used to adjust the inertia weight. Ratio α of the particle’s fitness to the average fitness of the swarm is calculated as shown in Equation 3 below: α = f i / fmax (3) Here, f i=fitness of i th particle and f max is the maximum fitness achieved by the particles in the swarm. The range of α is [0, 1]. For lower values of α, increasing inertia weight can strengthen the particle’s search capability. For values of α that are closer to 1, smaller inertia weight should be used. The inertia weight w i for the i th particle is therefore defined as a linear function of α and is calculated as follows: wi = 0.5×(1-α) + 0.5 (4) The range of the inertia weight is [0.5, 1]. PSO is computationally inexpensive. The ability of PSO to balance between local exploitation and global exploration of the search space enhances searching ability and avoids premature convergence towards the optimal solution. 5 Differential evolution Differential Evolution (DE) algorithm was given by Storn and Price [43] in 1995. It is a stochastic population-based global optimization algorithm that uses an evolutionary differential operator to create new offspring from parent chromosomes. Unlike GA, DE works upon real-valued chromosomes. The differential operator of DE replaces the classical crossover and mutation operators of GA. Let’s say, the initial population consists of n vectors denoted as (p 1, p 2... p n). Position of the i th vector in the d- dimensional space is denoted as X i = (Xi 1 , X i 2 …X i d ). These vectors are referred as chromosomes in DE. To change each chromosome (target vector), a difference vector V i is created. In the literature, there are various mutation schemes to create this vector. In this paper, DE/Rand/1 scheme is used. In this scheme, for each i th member X i of the current population, three other members (say r 1, r 2 and r 3) are randomly chosen from the current population. Next, the scaled difference (mutation scaling factor F) of any two of the three vectors is added to the third one to obtain the difference vector V i. The j th component of the difference vector is as given below: vi,j = xr1,j + F×(xr2,j ­xr3,j) (5) To increase the population diversity, a ‘crossover scheme’ is applied. The difference vector exchanges its components with the target vector X i to obtain the offspring/trial vector U i. The most common crossover in DE is ‘uniform crossover’ as given below: ui,j = vi,j if rand(0,1) < CR = xi,j else (6) CR is called the crossover constant. 422 Informatica 42 (2018) 417–438 S. Varshney et al. The final step in DE algorithm is the fitness-based selection of either target vector or trial vector in the next generation. F and CR are the control parameters of DE. The performance of DE depends on the manipulation of target vector and difference vector in order to obtain a trial vector. 6 Proposed hybrid algorithm In the proposed study, an adaptive PSO algorithm is hybridized with the DE algorithm incorporating local neighborhood search strategy. The synergy between PSO and DE algorithms has resulted in a more powerful global search algorithm. The local neighborhood search strategy helps in exploring more promising candidate solutions to overcome the problem of local optima. In the proposed hybrid (adaptive PSO and DE) algorithm, a differential velocity term inspired by the DE mutation scheme is computed by taking the difference of the position vectors of any two distinct particles randomly chosen from the swarm. A random number r is generated between 0 and 1. If r is less than DE crossover probability, Equation 7 (given below) is used to update the velocity of a particle. In Equation 7, the cognitive term (second term) in Equation 1 is replaced by the differential term scaled by DE mutation scaling factor. Vi d = w×Vi d + F×(xj d ­xk d ) + c2×r2×(gbest d – Xi d ) (7) Here, x j and x k denote the position of particles j and k respectively (i≠j≠k) that are randomly chosen from the swarm. A survival of the fittest mechanism is also followed by incorporating the greedy selection scheme of DE as given by Equation 6. Therefore, the particle either moves to a better location or remains at its previous position in the input search space. The current position of a particle will always be its best position. The steps of the proposed hybrid (adaptive PSO and DE) algorithm are given in Figure 5. The flowchart is given in Figure 6. Inputs to the algorithm are an instrumented program, dominator tree of the program, list of def-use paths to be traversed and the killing nodes if any, number of input variables, domain range of each input variable, and the algorithmic parameters: population size, PSO acceleration parameters, PSO maximum velocity, DE mutation scaling factor and DE crossover probability. Adaptive inertia weight is used as given by Equations 3 and 4. For data-flow coverage criterion, the design of fitness function is explained in Section 6.2 below. Initial value of pbest and gbest is 0. The algorithm is run once for each uncovered def-use path. If the selected path is not covered by any member of the current population, fitness value is computed for each member. Accordingly, for each particle, the personal best position pbest and the global best position gbest can be updated. During the evolution process, particle’s position and velocity is adjusted according to Equations 2 and 7 respectively. If the updated position of the particle is out of input domain range, a local neighbourhood strategy is applied. Then, the greedy selection scheme of DE is used to generate the new population. The evolution process continues until the termination criteria is met. The other uncovered paths are also checked for coverage. The output is an optimal test suite and a list of def-use paths marked as covered or uncovered, if any. A tool is developed for instrumenting programs and to generate def-use paths. Dominator tree is generated manually. Infeasible paths, if any, are determined by careful analysis of the program. 6.1 Neighbourhood search strategy Every meta-heuristic search algorithm suffers with the problem of local optima. Another issue related to meta- heuristic search algorithms is boundary constraints. There are no set mechanisms to deal with such problems. Hence, in this study, an effort is also made to handle the problems of local optima and boundary constraints and to improve the exploitation ability of the algorithm. A neighbourhood search strategy (Figure 4) is introduced to sample more promising candidate solutions to overcome these problems. It is summarized as follows: Step 1: For each particle, Euclidean distance is calculated from the other particles in the input search space using the position of particles. Accordingly, other particles within a threshold Euclidean distance (determined by preliminary study to fine-tune the algorithmic parameters) form the neighbourhood. Euclidean distance between two particles X i and X j in the n-dimensional search space is given by the following equation: d ij = √∑(x ik −x jk ) 2 n k=1 (8) Step 2: If a particle’s new position is out of range, other particles in the neighbourhood are evaluated. Step 3: The position of the particle is then replaced with that of the best particle in the neighbourhood instead of a random value. This helps in exploring more promising candidate solutions. 6.2 Design of Fitness Function Def-use associations can be represented as node-node fitness functions [28]. Def-use associations specify the node of definition and the node of use for the program variables in the CFG without specifying a concrete path between the nodes. This implies that the first objective to reach is the definition node and then the use node, without however, specifying a path through the CFG. The distance to a node is represented by the standard minimizing metric given below: node distance=approach level + v(branch distance) (9) A Hybrid Particle Swarm Optimization ... Informatica 42 (2018) 417–438 423 It evaluates to 0 if the target is covered. Approach level is the closest point (a node) of a given execution to the target node. A branch is said to be critical if it leads the program execution away from the target node in a path through the program structure [44]; branch distance is calculated at that particular predicate node using values of the variables according to the formulae given in Table 4 [3] below. Table 4: Branch distance measure for relational and logical predicates. S. No. Predicate (C) Branch Distance Formulae: f(C) 1 Boolean if true then 0 else K 2 x = y if (x-y)=0 then 0 else abs(x-y)+K 3 x ≠ y if abs(x-y)≠0 then 0 else K 4 x > y if (y–x)<0 then 0 else (y-x)+K 5 x ≥ y If (y–x)≤0 then 0 else (y-x)+K 6 x < y if (x–y)<0 then 0 else (x-y)+K 7 x ≤ y if (x–y)≤0 then 0 else (x-y)+K 8 C1 && C2 f(C1) + f(C2) 9 C1 || C2 min(f(C1), f(C2)) K is a failure constant that is added to branch distance if predicate is false Branch distance provides a measure of how close the program execution was to traverse the alternate edge of the critical branch. Branch distance is normalized in the range [0, 1] using a normalization function v, such that the approach level always dominates the branch distance. In our previous study [22], a novel maximizing fitness function is proposed for data-flow coverage adequacy criterion based on the standard metric (Equation 9) and dominator tree. Dominance relations between the nodes of the CFG are used to obtain path- cover for the nodes of the selected def-use path. The fitness function considers each def-use path as two objectives. For a dcu-path, the first objective is to cover the dominance path of the definition node and then to cover the dominance path of the use node. For a dpu- path, the first objective is to cover the dominance path of the definition node and then to cover the dominance paths of the nodes of the p-use edge (u 1, u 2). A dpu-path is formed for both the branches (T/F) of the predicate node. A test case is evaluated with respect to the selected def-use path by executing the program under test with it as an input and recording the nodes that are covered. If a killing node is traversed between the source node and the use node, a fitness value of 0 is assigned to the test case and it is discarded. The fitness value is 1 if all the nodes of the dominance paths of both the objectives are covered; otherwise closeness of the test case to the missed objective (branch distance) is computed. In this work, for fitness maximization, branch distance bch(x, t i) at the critical branch for test case t i and target node x is the reciprocal of the value returned by an appropriate formula from Table 4 i.e. the closer a test case is to cover the required branch, higher is its fitness value. The fitness function uses control-flow information (dominance relations between the nodes of the CFG) augmented with branch distance if a partial aim is achieved. This provides a smoother landscape/guidance to the search process towards the optimal solution. Branch distance is computed using Equation 10 and the Figure 4: Local Neighbourhood Strategy. 424 Informatica 42 (2018) 417–438 S. Varshney et al. fitness functions are given by Equations 11 and 12 as explained below. Branch distance bch (x, t i) for test case t i (i=1...p) and target node x, for fitness maximization, is calculated as follows: bch(x,t i ) = { 1 if the test case t i leads to the target node x 1 f(C) otherwise,using an appropriate formula from Table 4 for the predicate C at the critical branch (10) The fitness function to evaluate the fitness of a test case t i (i=1...p) w.r.t. a dcu-path (d, u, v), where d is the definition node and u is the c-use node of a variable v, is given below: ft(d, u, t i )= 1 2 ×( |cdom(d, t i )| |dom(d)| ×bch(d, t i )+ |cdom(u, t i )| |dom(u)| ×bch(u, t I )) (11) Similarly, the fitness function to evaluate the fitness of a test case t i (i=1...p) w.r.t. a dpu-path (d, (u 1, u 2), v), where d is the definition node and (u 1, u 2) is the p-use edge of a variable v, is given below: ft(d, (u 1 , u 2 ), t i )= 1 3 × ( |cdom(d, t i )| |dom(d)| ×bch(d, t i )+ |cdom(u 1 , t i )| |dom(u 1 )| ×bch(u 1 , t i )+ |cdom(u 2 , t i )| |dom(u 2 )| ×bch(u 2 , t i ) ) (12) In general, • dom(x): set of nodes in the dominance path of the target node x • cdom(x, t i): set of nodes in dom(x) that are covered by test case t i (i=1...p) • bch(x, t i): branch distance for test case t i (i=1...p) and target node x using Equation 9 If a killing node is traversed, a fitness value of 0 is assigned to the test case t i and it is discarded; otherwise Equation 11 or Equation 12 is used to compute the fitness value. Test case t i is said to be optimal if its fitness value is 1 i.e. the target is covered. Consider the def-use path# 5 (1, 7, 8) for coverage from Table 2. This is a dpu-path that tests for ‘Equilateral triangle’ condition. Node 1 (source) and the p-use edge (7, 8) (target) form the two objectives - their dominance paths to be covered by an input test case. There are three cases - if the dominance paths of both the nodes are covered, fitness value of the input test case is 1 and it is optimal. However, if a partial aim is covered (one of the two nodes) or none of the nodes is covered, fitness value of the input test case is computed using Equations 3.2 and 3.4. From Table 3, the dominance paths of the nodes are as given below: dom(d) = dom(1) = {1} dom(u1) = dom(7) = {1, 2, 6, 7} dom(u2) = dom(8) = {1, 2, 6, 7, 8} Case 1: Input test case t1 <2, 2, 2> Path traversed {1, 2, 3, 4, 5, 6, 7, 8, 12, 15} Dominance path of the definition node (node 1) is covered. Dominance path of the first node of the p-use edge (node 7) is covered. Dominance path of the second node of the p-use edge (node 8) is covered. As the dominance paths of both the objectives are covered, the fitness value of the input test case using Equation 3.4 is 1; the input test case t 1 is therefore optimal. Case 2: Input test case t2 <2, 2, 1> Path traversed {1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 15} Dominance path of the definition node (node 1) is covered. Dominance path of the first node of the p-use edge (node 7) is covered. Dominance path of the second node of the p-use edge (node 8) is not covered; the critical node is node 7. The branch distance at node 7 using Equation 3.2 is bch (8, t 2) = 0.91 The fitness value of the input test case using Equation 3.4 is ft (1, (7, 8), t 2) = 0.91 Case 3: Input test case t3 <1, 2, 4> Path traversed {1, 2, 3, 5, 6, 12, 13, 14, 15} Dominance path of the definition node (node 1) is covered. Dominance path of the first node of the p-use edge (node 7) is not covered; the critical node is node 6. The branch distance at node 6 using Equation 3.2 is bch (7, t 3) = 0.91 Dominance path of the second node of the p-use edge (node 8) is not covered; the critical node is node 7. The branch distance at node 6 using Equation 3.2 is bch (8, t 3) = 0.91 The fitness value of the input test case using Equation 3.4 is ft (1, (7, 8), t 3) = 0.74 This case study shows that the input test case t 1 covers the selected def-use path# 5. The input test case t 2 covers the def node and the first node of the selected def- use path# 5 (partial aim). The input test case t 3 does not cover any of the two objectives for the selected def-use path# 5. Accordingly, ft (1, (7, 8), t 1) > ft (1, (7, 8), t 2) > ft (1, (7, 8), t 3). Thus, the input test cases are also ranked according to their fitness values. 7 Experimental setup In this section, research questions, algorithmic parameters settings, details of the subject programs, and experimental results are provided. DE, PSO, GA and random search techniques are also implemented for comparison with the proposed hybrid (adaptive PSO and DE) algorithm. 7.1 Research questions The following research questions are formulated to evaluate the performance of the proposed hybrid algorithm: A Hybrid Particle Swarm Optimization ... Informatica 42 (2018) 417–438 425 RQ1: How effective is the proposed hybrid (adaptive PSO and DE) algorithm for optimal test data generation to achieve 100% data-flow coverage of a program? Algorithm ATDG_Hybrid_PSO_DE Input: P : Instrumented version of the program under test arg = (a 1,a 2,…,a d) : Argument list of P encoded into a d-dimension position vector DT : Dominator tree for the program P Paths : List of test requirements i.e. def-use paths Pop init : Initial random population of n particles X i = [X i1, X i2…X id] and their velocities V = [V i1, V i2…V id] for i=1, 2…n c 1, c 2, V max : Algorithmic parameters of Particle Swarm Optimization (PSO) algorithm F, CR : Algorithmic parameters of Differential Evolution (DE) algorithm Output: TestSuite : Set of optimal test cases Pathstat : List of test requirements marked as ‘covered’ and ‘could not be covered’ (if any) Begin 1. Pop old = Pop init 2. Pop cur = Pop init 3. while some path i in Paths is not marked { 4. while (termination criterion is not met) { //Either path i is covered or MaxAttempts 5. for each particle i of Pop cur { 6. Decode position vector X i into a test case t i 7. if path i is not marked { 8. Check path i for coverage w.r.t. t i and calculate fitness value using Eq. 10 or Eq. 11 9. if path i is covered { 10. Mark path i as ‘covered’ (update Pathstat) 11. Add t i to TestSuite 12. } 13. } 14. for each path j of TestReq other than path i that is not marked { 15. Check path j for coverage with respect to t i 16. if path j is covered 17. Mark path j as ‘covered’ (update Pathstat) 18. } 19. } 20. if path i is covered 21. Go to line 3 22. else { 23. Update gbest i j 24. for each particle i of Pop cur { //Generate a new population Pop new 25. Calculate inertia weight w using Equations 3 and 4 26. Randomly choose two distinct particles k and l from Pop cur (i≠k≠l) 27. for each dimension j (1≤j≤d) of particle i{ 28. Update pbest i j 29. Randomly generate r between 0 and 1 30. if r