Information 18 (199-1) 71-79 71 SCHEDULING STRATEGIES IN HIGH-LEVEL SYNTHESIS Jurij Silç Jozef Stefan Institute, Laboratory for Computer Architectures Jamova 39, Ljubljana, SJovcnia E-mail: jurij .silc@ijs.si Keywords: high-level synthesis, scheduling, allocation Edited by: Matjaz Gams Received: May 3, 1993 Revised: March 11, 1994 Accepted: March 13, 1994 The paper describes objectives of high-level synthesis. It concentrates on operation scheduling strategies and the interaction with the rcsourcc allocation. Some transformational and iterative/constructive scheduling algorithms are described. Moreover, a new scheduling/allocation approach is presented and compared with other known algorithms. Finally, some open problems of the high-level synthesis arc given. 1 Introduction The high-level synthesis task is to take a specification of the behavior required of a system and a set of constraints and goals to be satisfied, and to find a structure that implements the behavior while satisfying the goals and constraints. In recent years there has been a trend toward automating synthesis at higher and higher levels of the design hierarchy. There are a number reasons for this: shorter design cycle, fewer errors, the ability to search the design space, documenting the design process, and availability of IC technology to more people [28]. The roots of high-level synthesis can be traced back to the 1960s [15]. During the 1970s most of the effort went into automating tasks at lower levels of the design hierarchy, such as layout. Great progress was made in the development of algorithms and techniques [4, 51]. In the 1980s work on high-level synthesis started to spread from the academic community to industry. High-level synthesis systems are now producing manufacturable chip designs for applications such as signal processing [10], pipelined processors [32], and interfaces [7]. However, there are still many unanswered questions related to such issues as specification, input/output, designer intervention, complex timing constraints, and the relation of synthesis to the overall design and fabrication process. The paper starts with the description of highlevel synthesis structure and then concentrates on scheduling which seems to be the most important step during the synthesis. In particular, some of the basic scheduling techniques arc discussed. 2 High-Level Synthesis Given a system, its structural description is a specification of a set of components and their interconnections. More recently, however, behav-ioml descriptions of systems are used. Such a description specifies what the system needs to do, i.e. the way that each of the systems components interacts with its environment. High-level synthesis transforms behavioral description to the structural one. A typical way of describing behavior is to write a program in an ordinary computer language or in a special hardware description language. The first step in high-level synthesis is usually the compilation of the hardware description language into an internal representation. Most approaches use graph-based representations that contain both the data flow and the control flow implied by the specification. Control dependencies are derived directly from the explicit order given in the program and from the compiler's choice of parsing the arithmetic expressions. Data dependencies show the essential ordering of opera- 74 Informatica 18 (1994) 71-79 J. Šile tions. Some important tasks that should be performed by the compiler at this stage include variable disambiguation, taking care of the scope of variables, converting complex data structures into simple types, and type checking. Moreover, some optimizing transformations may be done at this stage, such as expression simplification. These graphs are given different names in different synthesis systems (e.g. value trace [47], data dependency graph [2], directed acyclic graph [14], control and data flow graph [17]) but are simply different adaptations of similar basic concept. In many systems, the control and data Row graphs are integrated into one structure. In this paper we will use the term flow graph. Before proceeding to the second step it is desirable to do some initial optimization of the flow graph, such as dead code elimination, constant propagation, common subexpresion elimination, and loop unrolling. The second step of the high-level synthesis, which is the core of transforming behavior into structure, includes operation scheduling and hardware allocation. Since these two tasks are essential in high-level synthesis they have been studied extensively and a variety of algorithms have been published. An excellent overwiev of the different schools of thought has been given in [28], The scheduling and allocation are closely interrelated. In order to have an optimal design, both tasks should be performed simultaneously [19]. However, due to the time complexity, many systems perform them separately [10, 23, 27, 30, 48, 50] or introduce iteration loops between the two sub-tasks [17, 33, 35, 45]. Scheduling involves assigning the operation to so-called controi steps. A control step is the fundamental sequencing unit in synchronous systems; it corresponds to a clock cycle. (Different methods for scheduling will be examined in detail in the following sections.) Allocation involves assigning the operations and values to hardware, i.e., providing storage, function units, and communication paths, and specifying their usage. To minimize them together is usually too complex, so in many high-level synthesis systems they arc minimized separately. Therefore, allocation is usually further divided into three subtasks - variable binding, operation assignment, and data transfer binding. Variable binding refers to the allocation of registers to data, i.e., values that are generated in one control step and used in another must be assigned to registers. Some systems have a one-to-one correspondence between variables and registers [42], while others allow register sharing for those variables which have disjoint lifetimes [34, 50], Operation assignment binds operations (e.g., addition) to function units (e.g., an adder or an ALU). Of course, operations can share functional units only if they are mutually exclusive, that is, they arc assigned to different control steps. The problem is then to form the minimum number of groups consisting of mutually exclusive operations since this will minimize the number of function units. Data transfer bindings represent the allocation of connections (e.g., busses, multiplexers) between hardware components (i.e., registers and function units) to create the necessary information paths as required by the specification and the schedule. Connections consist of busses and/or multiplexers. Busses offer the advantage of requiring less wiring, but they may be slower than multiplexers. A combination of both is often the best solution. Once the schedule and allocation have been accomplished, it is necessary to synthesize a controller (hardwired or microcoded) that will drive allocated resources as required by the schedule. Finally, the design has to be converted into real hardware. Lower level tools such as logic synthesis and layout synthesis complete the design. 3 Scheduling Strategies As noted earlier, a good scheduler is very important to a high-level synthesis system. There are three dimensions along which scheduling algorithms may differ: 1. the objective function and constraints that algorithms use; 2. the interaction between scheduling and allocation; 3. the type of scheduling algorithm used, 3.1 Constraints Roughly speaking, operation scheduling determines the cost-speed tradeoffs of the design. A time-cons trained scheduling problem can be defined as follows: given the maximum number of control steps, find a minimal cost schedule that satisfies the given set of constraints. Here the SCHEDULING STRATEGIES IN HIGH-LEVEL SYNTHESIS Informatics 18 (199-1) 71-79 79 cost may consist of the costs of function units, connections, and registers. Some systems that perform time-constrained scheduling are HAL [34, 35], M AH A [33], and Sehwa [32], A resource-constrained scheduling problem is stated as follows: given the maximum number of resources, find the fastest schedule that satisfies the given set of constraints. Until recently, the resources included only function units. Lately, connections and registers are also taken into consideration. Some systems that perform resource-constrained scheduling are CMUDA [12, 18, 50], MIMOLA [27, 51], MAHA [33], and Sehwa [32]. The previous two formulations can be combined into a feasible scheduling problem [22]: given a fixed amount of resources and a specified number of time steps, decide if there is a schedule which satisfies all the constraints, and output the solution if it exists. A system that performs feasible-constarincd scheduling is BUD [29], If the design is subject to a time-constraint, the scheduling algorithm will attempt to parallelize the operations to meet the timing constraint. Conversely, if there is a limit on the cost of resources, the scheduler will serialize operations to meet the rcsource-constraint. 3.2 Interaction with Allocation In order to know whether two operations can be scheduled in the same control step, one must know whether they use common hardware resources. Moreover, finding the most efficient possible schedule for the real hardware requires knowing the delays for the different operations, an those can only be found after the details of the function units and their interconnections are known. On the other hand, in order to make a good allocation, one must know what operations wilt be done in parallel, which comes form the schedule. Therefore, scheduling and allocation are strongly interdependent tasks. The most straightforward approach to this problem is to set some limit (or no limit at all) on the resource cost and then to schedule, as it is done in systems CMUDA [12, 18, 50],-Flame] [48], and V [5], A more flexible approach is to iterate the whole process changing the resource limits until a satisfactory design has been found. This approach is used in MIMOLA [27, 51] and Sehwa [32]. Another approach is to develop the sched- ule and allocation simultaneously, as in systems HAL [34, 35] and MAHA [33]. Some recent approaches formulate scheduling and allocation together as an optimization problem to be solved by general optimization techniques such as simulated annealing [3, 11,41] or integer programming [22]. Finally, the allocation can be done first, followed by scheduling, as it is the case in BUD system [29]. 3.3 Scheduling Algorithms The simplest way to perform scheduling is to relegate the task to the user, which is the approach favored by the Silc system [ft]. There is, however, a trend toward automated scheduling. Such algorithms can be classified into transformational or iterative/constructive algorithms. A transformational type of algorithm starts with an initial schedule (e.g., maximally serial or maximally parallel) and applies transformations to it to obtain other schedules. These algorithms differ in how they choose transformations (e.g., using exhaustive search [4], branch-and-bound [19], or some heuristics [37]). The other type of algorithms, the iterative/constructive ones, build up a schedule by adding operations one at a time till all the operations have been scheduled. These algorithms differ in how the next operation to be scheduled is chosen and into which control step it is put. The simplest way is to schedule operations as soon as possible (ASAP) as is done in the Facet [5G], early CMUDA [18], MIMOLA [27, 51], and Flamel [48] systems. ASAP assigns each operation to earliest possible control step such that data and control dependencies allow it to execute. A similar approach is to schedule operations as late as possible (ALAP). The problem with ASAP and ALAP scheduling is that when there are limits on resource usage no priority is given to operations on critical paths. Hence, less critical operations can be scheduled first and thus block critical ones [39]. Continuing along the scale of increasing complexity, there are algorithms that use list scheduling. For each control step, the operations available to be scheduled into that step are kept in a list, which ordered by some priority function. Each operation on the list is scheduled if the resources it need are still free in that step; otherwise it is deferred to the next step. In some cases, this form 74 Informatica 18 (1994) 71-79 J. Šile of scheduling works nearly as well as branch-and-bound. Schedulers differ in the priority function they use. A priority function may use the length of the longest path from the operation to the end of graph [39, 40, 43]. This is approach taken in the BUD system [29]. Elf system [17] uses the urgency of the operation, i.e. the length of the shortest path from the operation to nearest local time constraint. In Sliccr system [30] the priority function is based on increasing operation mobilities, i.e., differences between ASAP and ALAP times of operations. A composite priority is used in MAHA system [33] where the operations on critical paths are scheduled first (and also assigned to function units). Then the other operations are scheduled (and assigned) one at a time according to the least mobility. The HAL system [34, 35] does list scheduling with force as a priority. The force between an operation and a particular control step is proportional to the number of operations of the same type that could be scheduled into that step. To conclude, in list scheduling operations that might present more difficult scheduling problems are taken care of first. In what follows we will briefly describe some known scheduling algorithms. First we give some common definitions. Let G{V, A) be a flow graph, where V is the set of operations and A is the set of dependencies (arcs), which is to be scheduled into s control steps. Let n = |V| and a = |yl|. Each of the operations is labeled as o;, 1 < i < n, A precedence relation between operations o,- and Oj is denoted by o; —> Oj , where o, is immediate predecessor of Oj. The earliest possible start time and the latest possible start time of o, are S{ and Li, respectively. There are m types of function units available, A function unit of type i is denoted by Ft. A relation between operation Oi and a function unit Ft is denoted by o, € Fu if Ft can perform o,. Integer Linear Programming Algorithm In [22] integer linear programming ILP is used to formulate the feasible scheduling problem. Let the cost of a function unit of type t be ct and Mt be integer variables denoting the number of function units of type t needed. Finally, let be 0 — 1 integer variables where xt> = 1 if o; is scheduled into control step r; otherwise, .t,> = 0. Assuming a one-cycle propagation delay for each operation and a nonpipelined execution, the feasible scheduling problem can finally be stated as follows: £ a.-r < Aft, 1 < t < s, 1 < t < m; OiZFt Li Y: XiT = 1, 1 < i < n; T=Si Li Lk r * x" - !C T * XkT - -1> 0i -+ A. The first constraint states that no schedule should have a control step containing more than Mt function units of type t. It is clear that o, can only be scheduled into a step between 5,- and which is reflected in the second constraint. The third constraint ensures that precedence relations of the flow graph will be preserved. The objective function is a combination of time-constraint objective function min Mt and resource-constraint objective function min Cstep, where C^tcp is total number of control steps required. This approach allows the user to control the resource-time tradeoff. More explicit resource-time tuning is the advantage of the next algorithm to be presented. Simulated Annealing Based Algorithm Another type of transformational feasible scheduler based on the simulated annealing idea is given in [3]. The simulated annealing algorithm can be used for combinatorial optimization problems specified by a finite set of configurations and a cost function defined on all the configurations. The algorithm randomly generates a new configuration which is then accepted or rejected according to a random acceptance rule governed by the parameter analogous to temperature in the physical annealing process [26]. Algorithm starts on an initial configuration which is the schedule obtained by applying ASAP strategy, i.e. the start time of o,- is 5,, for each o, G V. The function Cost evaluates how good a configuration is. It is defined as Cost{X) = aArea(X) + f3Time(X), where Area(X) is the estimated total area of the resources used and Ti'me(À') is the total execution time corresponding to the given configuration X. The tuning of the algorithm is performed SCHEDULING STRATEGIES IN HIGH-LEVEL SYNTHESIS Informatics 18 (199-1) 71-79 79 by taking different values for a and /3. For example, if a < /? the algorithm is closer to resource-constrained scheduler (since solutions efficient in speed become more important) while a » /3 makes the algorithm more time-constrained. Initially, a high temperature Initial is given in order to accept most new configurations even if they increase the cost. As temperature decreases, less configurations arc accepted unless they have improved cost. Given a configuration A", a new configuration V is generated either by insertion or removal of a register, scheduling an operation to next or previous control step, or by shrinking/expanding a control step. A similar algorithm appears in [11] where it is also reported that the algorithm achieves excellent results. However, it performs scheduling and allocation simultaneously. This is also the characteristics of the approach which is to be presented next. Force Directed Algorithm Let us first describe a force-directed scheduling algorithm which is based on list scheduling with a force as a priority function. The first step consists of determining the time frames [5,, of each operation o; € V. Let piT denote the probability that o^ will be scheduled into control step r € [5;, £,-]. A useful heuristic, is to assume a uniform probability, i.e. PiT = i+L -S • llGX'' steP 'S sum" mation of the probabilities of each type t of operation for each control step r: i3(i,r) = Ylo.el't P'T~ The final step is to calculate the force T associated with each operation o,- and bounded to a temporarily reduced time frame [S^LJ: ra' r'\ P{t'r) V T—■ where t is the type of the operation o;. Once all the forces are calculated, the operation-control step pair with largest negative force (or least positive force) is scheduled. Then P and T values are updated. The process is repeated untill all operations are scheduled. The scheduling process described above'is a part of the HAL system [34, 35]. In particular the scheduling/allocation is performed simultaneously by stepwise refinement in an iterative loop consisting of four phases. The first phase (default allocation) allocates single-function processor to each type of operation. The second phase (preliminary schedule) balances the distribution of similar of operations using force-directed scheduling. The third phase (refined allocation) allocates single and multi-function processors. The last, fourth phase (final schedule) balances the distribution of operations requiring similar processor types. 4 Global Arc Minimization Algorithm In last section we briefly described three approaches to the scheduling/allocation problem. In this section we present a new algorithm named Global Arc Minimization (GAM). The algorithm was developed by the author [44, 45, 46] where both time and resource constrained scheduling algorithms were described. In this paper, however, we will concentrate only on the time constrained scheduling. We consider situation with m = 1, i.e., all function units are of the same type (as it is the case in preliminary scheduling in IIAL system). The first step consists of determining the time frames [S^L,-] of each operation 0 then f(r) = f(T)-\OfinM(r)\ end if /(T) = f(T) + |0„r3en((T)| if f(r) < fmin then Let Oarfi —» oj a global arc if