Informatica 26 (2002) 433-438 433 A List Scheduling Heuristic for Allocating the Task Graph to Multiprocessors Janez Brest and Viljem Žumer University of Maribor Faculty of Electrical Engineering and Computer Science Smetanova 17, 2000 Maribor, Slovenia E-mail: janez.brest@uni-mb.si, http://marcel.uni-mb.si/janez Keywords: parallel processing, compiler, static scheduling Received: July 2, 2002 In this paper we propose a new static scheduling algorithm for allocating the task graph without communication costs to fully connected multiprocessors. A global comparison is carried out for the proposed algorithm and three reported scheduling algorithms. The proposed algorithm outperforms the previous algorithms in terms of the generated schedule length using Standard Task Graph set. 1 Introduction To efficiently execute a program on a multiprocessor system [11, 19, 20, 8, 18], it is essential to solve a minimum execution time multiprocessor scheduling problem [16, 13, 14, 2, 4, 5], which determines how to assign a set of tasks to processors and in what order those tasks should be executed to obtain the minimum execution time. The tasks can then be scheduled to the processors for execution by using a suitable scheduling algorithm, static in compile-time or dynamic in run-time [7, 9, 3]. The optimal static scheduling, except for a few highly simplified cases, is an NP-complete problem. Thus, heuristic approaches are generally sought to tackle the problem. Traditional static scheduling algorithms attempt to minimize the schedule length through iterative local minimization of the start times of individual tasks. On the other hand for example the Dynamic Level Scheduling (DLS) algorithm dynamically selects tasks during the scheduling process [15]. As optimal scheduling of tasks is a strong NP-hard problem, many heuristic algorithms have been introduced in the literature [6]. In this paper we proposed a low time complexity multiprocessor static scheduling algorithm called MCP/CLR without communication costs, which is based on critical path (CP) algorithm, such as, for example, the MCP [21] algorithm. It generates high quality scheduling solutions. The remaining paper is organized as follows: In the next section, we present a brief overview of various approaches that have been proposed for the DAG scheduling problem. In Section 3, we present the proposed algorithm, and discuss its design principles. We present the experimental results in Section 4, and conclude the paper with some final remarks in Section 5. 2 The Multiprocessor Scheduling Problem In static scheduling, a parallel program is presented by a directed acyclic graph (DAG) [19], In a DAG, G = (V, E), Visa set of v nodes, representing the tasks, and E is a set of e directed edges, representing the communication messages. Edges in a DAG are directed and, thus, capture the precedence constraints among the tasks. The cost of node rii, denoted as wirii), represents the computation cost of the task. The cost of the edge, emerges from the source node rii and incidents on the destination node n,j, denoted by Cij, represents the communication cost of the message. The source node of an edge is called a parent node, while the destination node is called a child node. A node with no parent is called an entry node and a node with no child is called an exit node. A node can only start execution after it has gathered all of the messages from its parent nodes. The b-level of a node is the length (sum of the computation costs only) of the longest path from this node to an exit node. The t-level of a node is the length of the longest path from an entry node to this node (excluding the cost of this node). The objective of scheduling is to minimize the schedule length, which is defined as the maximum finish time of all the nodes, by properly assigning tasks to processors such that the precedence constraints are preserved. The existing scheduling algorithms are classified into four categories by Ahmad and Kwok [2, 14]: 1. Bounded Number of Processors (BNP) Scheduling: A BNP algorithm schedules a DAG to a limited number of processors directly. The processors are assumed to be fully connected without any regard to link contention and scheduling of messages. The proposed algorithm belongs to this class. 2. Unbounded Number of Clusters (UNC) Scheduling: An UNC algorithm schedules a DAG to an unbounded 434 Informatica 26 (2002) 433-438 J. Brest et al. number of clusters. The clusters generated by these algorithms may be mapped onto the processors using a separate mapping algorithm. These algorithms assume the processors to be fully connected. 3. Arbitrary Processor Network (APN) Scheduling: An APN algorithm performs scheduling and mapping on an architecture in which the processors are connected via a network topology. An APN algorithm also explicitly schedules communication messages on the network channels, taking care of the link contention factor. 4. Task-Duplication-Based (TDB) Scheduling: A TDB algorithm duplicates tasks in order to reduce the communication overhead. Duplication, however, can be used in any of the other three classes of algorithms. For our purpose, we will compare the proposed algorithm with three other BNP scheduling algorithms. In a traditional scheduling algorithm, the scheduling list is statically constructed before node allocation begins, and, more importantly, the sequencing in the list is not modified. The Earliest Task First (ETF) algorithm [10] uses static node priorities and assumes only a bounded number of processors [16,17]. The High Level First with Estimated Time (HLFET) algorithm [1] assigns the nodes in a DAG to the processors, level by level. Similar to the ETF and HLFET algorithms, the Modified Critical Path (MCP) algorithm [21] constructs a list of tasks before the scheduling process starts. The MCP algorithm uses the ALAP (As-Late-As-Possible) start time of a node as the scheduling priority. The MCP algorithm first computes the ALAP times of all the nodes, then constructs a list of nodes in ascending order of ALAP times. Ties are broken by considering the ALAP times of the children of a node. The MCP algorithm then schedules the nodes on the list one by one so that a node is scheduled to a processor that allows the earliest start time using the insertion approach. The MCP algorithm looks for an idle time slot for a given node. The algorithm is briefly described in Figure 1 [21, 16, 14], The complexity of the MCP algorithm is 0(v2 logu). 3 The Heuristic Algorithm In this section we discuss some of the principles used in the design of proposed algorithm. To minimize the final schedule length, we select a node as it is selected in the MCP algorithm. At each step of the scheduling process, the first node is removed from the list of nodes (the list of nodes is sorted in increasing lexicographical order of the latest possible start times) and it is scheduled to a processor. While we are able to identify a selected node, we still need a method to select an appropriate processor for scheduling that node into the most suitable idle time slot. At each step, the algorithm needs to find the most suitable proces- (1) Compute the ALAP time of each node. (2) For each node, create a list which consists of the ALAP times of the node itself and all its children in descending order. (3) Sort these lists in ascending lexicographical order. Create a node list according to this order. Repeat (4) Schedule the first node in the node list to a processor that allows the earliest execution, using the insertion approach. (5) Remove the node from the node list. Until the node list is empty. Figure 1: The MCP algorithm. sor which contains the most suitable place in time for a selected node. The MCP algorithm schedules the selected node to a processor that allows for the earliest start time. The proposed algorithm has another processor selection criteria and they are described as follows. 3.1 The MCP/CLR Algorithm Build_ALAP(); Sort_ALAP()\ // v is number of tasks for (i = 0; i < v\ i++) { U = EST(ALAP(nj)); if a processor j exists where SLj (i) < ti then schedule node rii to a processor j where SLj(i) — ti is minimal else schedule node rii to a processor that allows the earliest execution J_ Figure 2: The MCP/CLR algorithm. The function Build_ALAP() computes the ALAP time of each node and creates a list, which consists of the ALAP times of the node itself and all its children in descending order. Function Sort_ALAP() sorts these lists in ascending lexicographical order as in the MCP algorithm. Assumed that, in the scheduling process there are already scheduled i — 1 nodes. Next selected node is n*. SLj(i) is the schedule length of the step i of the scheduling process on the processor j. The MCP/CLR (MPC/Close-Left-Right) algorithm (see Fig. 2) tries to find a processor j for the selected node rii. It is needed to distinguish two cases of the processor selection step. If a processor exists, say j, which satisfy that SLj(i) is less or equal to the earliest start time (EST) of the selected node n¿, our A LIST SCHEDULING HEURISTIC. Informatica 26 (2002) 433-438 435.. algorithm assigns the selected node n,; to the processor j with the smallest value SLj(i) — tOtherwise it assigns the selected node n* to a processor that allows the earliest execution (like the MCP algorithm), using non-insertion approach. The complexity of the MCP/CLR algorithm is 0(v2 logv), too. 3.2 Scheduling Example In this section, we present an example to demonstrate the operation of the proposed algorithm using the task graph shown in Fig. 3. The task graph was drawn using the Graphlet Tool (http://www.fini.uni-passau.de/Graphlet). The schedules of the algorithms are shown in Fig. 4. The entry and exit node are dummy. The MCP algorithm creates a list of edges and schedules the task graph onto the multiprocessor machine with 2 processors (processing elements) in the order: ni, n2, n5, n3, n8, n7, n4, nn, ni0, n9, n6, ni2. The HLFET and MCP/CLR schedule the nodes in the same order as the MCP algorithm. The ETF algorithm schedules the nodes in the order: rii, ns, nz, n^, n-j, ns, ruo, ng,nu,ni2. The order of nodes 714,117,11g and the processor selection during the scheduling process, have caused different schedules of the task graph, and therefore also different schedule lengths. (a) gj_ id' '' ' lij ......' to'''',' 40 '''''. st (b) (c) rg= Figure 4: The schedules of the task graph on Fig. 3 generated by: (a) ETF algorithm (schedule length = 67 time units); (b) HLFET and MCP algorithms (schedule length = 64 time units); and (c) MCP/CLR algorithm (schedule length = 63 time units). Table 3: Number of times the optimal schedule is found, and a global error Optimal Global Algorithm schedule % error ETF 33 12.94 5577 HLFET 50 19.61 3189 MCP 57 22.53 1531 MCP/CLR 167 65.49 257 Figure 3: An example of a task graph with 12 nodes. 4 Results In this section, we present the performance results of the proposed algorithm and compare them with the results of the HLFET, ETF and MCP algorithms. We have implemented the scheduling algorithms on a SUN workstation using C/C++. They were evaluated by using a Standard Task Graph set: http://www.kasahara.elec.waseda.ac.jp/schedule/. The Standard Task Graph set has 900 task graphs with 50 to 2700 tasks. The results obtained in our experiments are shown in Table 1. The second and third columns indicate the name of the task graph instance and number of nodes, respectively. In next four columns results of the schedule length for the all of algorithms are shown, respectively. In the last column the optimal schedule length value is shown. If the optimal-schedule is found,, the schedule length value is boldface. For some problem instances, the optimal schedule length is not known. In order to rank all the algorithms in terms of the schedule lengths, we made a global comparison [17]. We observed the number of times each algorithm performed better, worse or the same compared to each of the other algorithms. This comparison is presented in Fig. 5, where some boxes have the left and the right side. Each left side of the box compares two algorithms - the algorithm on the left side and the algorithm on the top. Each left side of the box contains three numbers preceded by ">", "<", and "=" signs which indicate the number of times the algorithm on the left performed better, worse, or the same, respectively, compared to the algorithm shown on the top. Each comparison is based on the total of 300 task graphs. Each right side of the box contains the number of times when one of algorithms, the algorithm on the left side or the algorithm on the top, find the optimal schedule length. Optimal sched- 436 Informatica 26 (2002) 433-438 J. Brest et al. Table 1: Schedule results of 50 task graph instances Graph #Nodes ETF HLFET MCP MCP/CLR Optimum 1 protoOOO.stg 452 537 537 537 537 537 2 protoOOl.stg 473 1191 1179 1179 1178 1178 3 proto002.stg 499 357 363 355 343 341 4 proto003.stg 164 556 556 556 556 556 5 proto004.stg 457 267 238 234 222 — 6 proto005.stg 404 758 749 742 742 742 7 proto006.stg 273 171 154 149 142 — 8 proto007.stg 499 492 489 489 489 489 9 proto008.stg 399 578 582 579 572 571 10 proto009.stg 438 625 625 625 625 625 11 protoOlO.stg 539 351 338 338 334 334 12 protoOll.stg 759 513 496 494 484 — 13 proto012.stg 939 1804 1795 1795 1793 1793 14 proto013.stg 799 698 688 685 682 681 15 proto014.stg 636 523 520 516 509 — 16 proto015.stg 712 513 513 501 491 491 17 proto016.stg 641 1016 1026 1022 1006 — 18 proto017.stg 722 487 475 473 463 — 19 proto018.stg 730 704 706 704 701 700 20 proto019.stg 617 683 682 674 668 667 21 proto020.stg 1104 1523 1514 1511 1505 1504 22 proto021.stg 1145 644 632 616 605 605 23 proto022.stg 1189 1625 1620 1617 1610 1609 24 proto023.stg 1353 1619 1628 1624 1614 1612 25 proto024.stg 1218 1295 1291 1289 1283 1281 26 proto025.stg 1258 1193 1198 1194 1191 1188 27 proto026.stg 1239 1509 1502 1501 1500 1500 28 proto027.stg 1055 2003 2001 2001 2001 2000 29 proto028.stg 1424 1538 1506 1506 1504 1504 30 proto029.stg 1341 845 830 830 830 830 31 proto280.stg 1668 2821 2809 2806 2800 2800 32 proto281.stg 1622 977 970 921 897 896 33 proto282.stg 1793 3131 3131 3127 3123 3123 34 proto283.stg 1591 2479 2429 2428 2425 2422 35 proto284.stg 1703 4499 4458 4447 4444 4444 36 proto285.stg 1766 1333 1304 1285 1268 1268 37 proto286.stg 1703 945 940 930 918 918 38 proto287.stg 1615 1423 1422 1381 1360 1353 39 proto288.stg 1672 1938 1944 1935 1926 1925 40 proto289.stg 1642 1968 1922 1917 1915 1915 41 proto290.stg 2133 1453 1450 1438 1428 1428 42 proto291.stg 2122 2611 2605 2597 2591 2591 43 proto292.stg 2333 8022 8012 8011 8009 8009 44 proto293.stg 2089 1516 1477 1474 1472 1472 45 proto294.stg 2014 1336 1307 1273 1259 1257 46 proto295.stg 2168 1349 1329 1321 1318 1318 47 proto296.stg 2162 2589 2590 2575 2563 2563 48 proto297.stg 2136 7708 7710 7710 7703 7703 49 proto298.stg 2399 2492 2476 2474 2471 2471 50 proto299.stg 2205 1171 1153 1147 1142 1141 A LIST SCHEDULING HEURISTIC. Informatica 26 (2002) 433^38 437 Table 2: Schedule length with respect to the optimal solution Quality of the solution (Error) ETF HLFET MCP MCP/CLR 1 0% (optimum) 33 50 57 167 2 < 5% 178 182 195 88 3 5%- 10% 30 21 3 0 4 10%- 15% 11 2 0 0 5 15%-20% 3 0 0 0 6 Optimum not known 45 45 45 45 Total 300 300 300 300 Figure 5: A global comparison of four algorithms in terms of better, worse, and equal performance. ule lengths are known for 255 of all 300 task graphs. They were computed on a parallel machine using the ISH algorithm [13, 12]. For example, the MCP/CLR algorithm performed better than the MCP algorithm in 238 cases, never performed worse, and performed the same in 62 cases. The MCP/CLR algorithm or the MCP algorithm or both of them found optimal solution of the schedule length in 167 cases. An additional box for each algorithm compares that algorithm with all other algorithms combined. The experimental results of the quality of the schedule length are summarized in Table 2. For example, the MCP/CLR algorithm found the optimal schedule length in 167 cases and, additionally, the solution within 5% in 88 cases. Table 3 shows number of times the algorithm has found the optimal schedule, and global error which is defined as difference between the sum of all the optimal schedule values and the sum of all the schedule values generated by the algorithm. It can be noticed that the proposed MCP/CLR algorithm outperformed three other well known algorithms. Based on these experiments, all the algorithms can be sorted in the following order: MCP/CLR, MCP, HLFET and ETF. The same order of the MCP and ETF algorithms can be found in [17], where communications are also assumed among the tasks. 5 Conclusion This paper presents the static task scheduling algorithm which can schedule directed acyclic graphs (DAGs) with a complexity of 0(v2 log v), where v is the number of tasks in the DAG. The algorithm schedules the tasks and it is suitable for the graphs with arbitrary computation and without communication costs, and is applicable to the system with homogeneous fully connected processors. The performances of the proposed algorithm has been observed by comparing it with other existing bounded number of processor (BNP) scheduling algorithms in terms of the schedule length. References [1] T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Communications of the ACM, 17(12):685— 690, December 1974. [2] I. Ahmad and Y.-K. Kwok. On parallelizing the multiprocessor scheduling problem. IEEETPDS: IEEE Transactions on Parallel and Distributed Systems, 10, 1999. [3] J. Brest, V. Zumer, and M. Ojstersek. Dynamic scheduling on a network heterogeneous computer system. LNCS 1557, pages 584-585,1999. [4] J. Brest and V. Zumer. A Performance Evaluation of List Scheduling Heuristics for Task Graphs without Communication Costs. Proceedings of the International Workshop on Parallel Processing (ICPP'00), pages 421^28,2000. [5] J. Brest, J. Jejcic, A. Vreze and V. Zumer. An Approximation Algorithm for the Static Task Schedul- 438 Informatica 26 (2002) 433-438 J. Brest et al. ing on Multiprocessors. VECPAR'2000 4th International Meeting on Vector and Parallel Processing, Vol. 1, pages 46-56,2000. ~ [6] D. Darbha'and ;D. P. Agrawal: Optimal scheduling algorithm for distributed-memory machines. 1EEET-PDS: IEEE Transactions on Parallel and Distributed Systems, 9, 1998. [7] M. M. Eshagian, editor. Heterogeneous Computing. Artech House, Inc., Norwood, MA 02062, ISBN 089006-552-7,1996. [8] I.Foster. Designing and Building Parallel Programs. Addison-Wesley, ISBN 0-201-57594-9,1995. [9] E. Haddan. Load Balancing and Scheduling in Network Heterogeneous Computing. In M. M. Eshagian, editor, Heterogeneous Computing, pages 224-276, Norwood, MA 02062, ISBN 0-89006-552-7, 1996. Artech House, Inc. [10] J. J. Hwang, Y.-C. Chow, F. D. Anger, and C.-Y. Lee. Scheduling precedence graphs in systems with inter-processor communication times. SIAM Journal on Computing, 18(2):244-257, April 1989. [11] K. Hwang and Z. Xu. Advanced Computer Architecture: Technology, Architecture, Programming. McGraw-Hill, New York, 1998. [12] H. Kasahara, H. Honda, and S. Narita. Parallel processing of near fine grain tasks using static scheduling on OSCAR (optimally scheduled advanced multiprocessor). In IEEE, editor, Proceedings, Supercomputing '90: November 12-16, pages 856-864. IEEE Computer Society Press, 1990. [13] H. Kasahara and S. Narita. Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Trans, on. Computers, 33(11): 1023, November 1984. [14] Y.-K. Kwok. High-Performace Algorithms for Compile-Time Scheduling of Parallel Processors. PhD thesis, The Hong Kong University of Science and Technology, 1997. [15] Y.-K. Kwok and I. Ahmad. FASTEST: A practical low-complexity algorithm for compile-time assignment of parallel programs to multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 10(2):147-159, February 1999. [16] Y.-K. Kwok and I. Ahmad. Parallel program scheduling technique. In Buyya Raykumar, editor, High Performance Cluster Computing: Architectures and Systems. Prentice Hall - PTR, NJ, USA, 1999. [17] Y.-K. Kwok and I. Ahmad. Dynamic critical-path scheduling: An effective technique for allocating task graphs to multiprocessors:. IEEE Transactions on Parallel and Distributed Systems, 7(5):506-521, May 1996. [18] M. Quinn. Parallel Computing: Theory and Practice. McGraw-Hill, 1994. [19] B. Raykumar, editor. High Performance Cluster Computing: Architectures and Systems. Prentice Hall - PTR, NJ, USA, 1999. [20] B. Wilkinson and M. Allen. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice-Hall, En-glewood Cliffs, NJ 07632, USA, 1998. [21] M.-Y. Wu and D. D. Gajski. Hypertool: A programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1(3):330—343, July 1990.