HIGH-LEVEL SYNTHESIS BASED UPON DEPENDENCE GRAPH FOR MULTI-FPGA Mohamed Akil Laboratoire A2SI, Groupe ESIEE, Cite Descartes, Noisy Le Grand cedex, France INVITED PAPER MIDEM 2003 CONFERENCE 01.10.2003-03.10.2003, Grad Ptuj Abstract: The increasing complexity of signal, image and control processing algorithms in real-time embedded applications requires efficient system-level design methodology to help the designer to solve the specification, validation and synthesis problems. Indeed the real-time and embedded constraints may be so strong that the available high performant processors are not so enough. That leads to use, in complement of processor, the specific component like ASIC or FPGA. Several projects have developed high-level design flow that translates higlvlevel algorithm specification to an efficient implementation for mapping onto multi-component architecture. In this paper, we present: 1. a unified model for hardware/software codesign, based on the AAA methodology (Algorithm-Architecture Adequation). In order to exhibit the potential parallelism of algorithm to be implemented, the AAA methodology is based on conditioned (conditional execution of computations) factorized (loop) data dependence graph, 2.Some simple rules that allow synthesizing both the data path and the control path of a circuit corresponding to an algorithm specified as a Conditioned and Factorized Data Dependence Graph (CFDOG), 3. the optimized implementation of CFDDG algorithm onto FPGA circuit and Multi-FPGA (partionning), by using simulated annealing approach, 4, the resources and time delay estimation method. This method allows us to have a performance analysis for the implementation. The obtained results: resource (gates, lO) and latency estimation are used by the optimization step to decide which implementation respects the constraints (real-time implementation which minimises the resource utilisation), 5. the results of the implementation of the matrix-vector product algorithm onto a Xilinx Multi FPGA and the software tool SynDEx which implements the AAA methodology. Visokonivojska sinteza FPGA vezij na osnovi odvisnih grafov Izvleček: Naraščajoča zapletenost algoritmov za obdelavo signalov, slike in opravljanje nadzora v vgrajenih sistemih v realnem času zahteva učinkovite metodologije načrtovanja na nivoju sistema z namenom pomagati načrtovalcu reševati probleme pri specifikaciji, validaciji in sintezi sistema. V resnici se lahko zgodi, da so omenjene zahteve tako zahtevne, da omejijo uporabo obstoječih zmogljivih procesorjev To navede na uporabo dodatnih specifičnih komponent, kot so ASIC vezja ali FPGA vezja. Pri nekaj projektih se je že zgodilo, da smo uspeli zahteve algoritmov na visokem nivoju prevesti na implementacijo arhitekture zasnovane na sistemu z večimi komponentami. V tem prispevku predstavimo : 1. poenoten mode! za sočasno načrtovanje programske in strojne opreme zasnovane na metodologiji AAA (Algorithm-Architecture-Adequation), ki je zasnovana na pogojnem podatkovno odvisnem grafu s čimer lahko izkoristimo potencialno paralelno izvajanje algoritma. 2. nekaj osnovnih pravil, ki omogočajo sintezo podatkovnih in kontrolnih poti za vezje, ki odgovaria algoritmu definiranemu kot CFDDG graf. 3. optimizirano implementacijo CFDDG algoritma z enim ali več FPGA vezji z uporabo metode simuliranega ohlajanja. 4. sredstva in metodo za oceno časovnih zakasnitev. Ta metoda omogoča analizo delovanja za določeno implementacijo. Dobljeni rezultat: sredstva (vrata, 10) in oceno latentnosti uporabimo pri koraku optimizacije za odločanje, katera implementacija spoštuje omejitve (implementacija v realnem času, ki minimizira uporabo sredstev). 5. rezultate implementacije algoritma matričnega produkta na Xilinx Multi FPGA vezje in programsko orodje SynDex s katerim je izvedena AAA metodologija. 1. Introduction As the size and complexity of high performance signal, image and control processing algorithms is increasing continuously, the implementations cost of such algorithms is becoming an important factor. This paper addresses this issue and presents an efficient rapid prototyping methodology to implement such complex algorithms using recon-figurable hardware. The proposed methodology is based on an unified model of conditioned factorized data dependence graphs vi/here both data and control flow are represented as well to specify the application algorithm, than to deduce the possible implementations onto reoonfigurable hardware, in terms of graphs transformations. This work is part of the AM methodology and has been implemented in SynDEx (CAD software tool that support AAA, a system level CAD software tool. To fulfill the ever increasing requirements of embedded real-time applications, system designers usually require mixed implementation that blends different types of programmable components (RISC or CISC processors, DSP,..)corresponding to software implementation, with specific non-programmable components (ASIC, FPGA,...) corresponding to hardware implementation. This makes the implementation task a complicated and challenging problem, which implies a strong need for sophisticated CAD tools based on efficient system-level design methodologies to cope with these difficulties and so to simplify the implementation task from the specification to the final prototype. In this field, several system-level design methodologies and their associated tools have been suggested during the last years. SPADE /1/ methodology enables modelling and exploration of signal heterogeneous processing systems. The result is the definition of a heterogeneous architecture able to execute these applications with respect realtime constraints. SPARK /2/ is high-level synthesis framework that provides a number of code transformations techniques. The back-end of the SPARK system generates synthesizable RTL VHDL (control synthesis is a finite sate machine controller). GRAPE-II /3/ is a system-level development environment for specifying, compiling, debugging, simulating and emulating digital-signal processing applications on heterogeneous target platforms consisting of DSPs and FPGAs. After specification, resources requirement, mapping architecture, the last phase generates 0 or VHDL code for each of the processing devices. POLIS /4/ system implements a HW/SW codesign using the CFSM (the Codesign Finite State Machine formal model). A complete codesign environment, based on POLIS system, which combines automatic partitioning and reuse of a module database is presented in polls. The SPARGS design system /5/ is an integrated design environment for automatically partitioning and synthesizing behavioural specifications (in the form of task graphs) on multi-FPGA architecture. The SPARCS contains a temporal partitioning tool to temporally divide and schedule the tasks on the architecture and high-level synthesis tool to synthesize register_transfer level designs for each set of tasks. Each of the above tools has its own features (for example several models can be used for application and architecture specification) and innovative aspects but none of them support the entire implementation process onto mixed architecture using an unified model as well to specify the application algorithm, as to deduce the possible implementation onto multicomponent architecture. To achieve this goal, we have developed, in the one hand, the AAA (Algorithm-Architecture Adequation) rapid prototyping methodology /6/ which helps the real-time application designer to obtain rapidly an efficient implementation of his application algorithm onto his heterogeneous multiprocessor architecture and to generate automatically the corresponding distributed executive /7/. This methodology uses an unified model of graphs as well to modelize the application algorithm, the available architecture as to deduce the implementation which is formalized in terms of transformations applied on the previous graphs. In the other hand we aim to extend our AAA methodology to the hardware implementation onto specific integrated circuits in orderte finally provide a methodology allowing to automate the implementation of complex application onto multicomponent architecture using an unified approach. This paper presents the design methodology based upon graph transformation from algorithm specification to hardware implementation. This methodology automates the hardware implementation of an application algorithm specified as a Conditioned Factorized Data Dependence graph in the case of reconfigurable integrated circuits (FPGA). This methodology is illustrated through all the sections with a condi- tioned matrix-vector product case study that involve a moderately complex control flow involving both conditioning and loops. We first present the conditioned factorized data dependence graph model proposed to specify the application algorithm in section 2. In section 3 we present the implementation model describing the result obtained by applying a set of rules that allows to automate the synthesis of data and control paths from the algorithm specification. Following that, the principles of optimization by defactorization are described in section 4. In this section we present the using of the simulated annealing technique to obtain an optimized implementation on mono and multi circuit architecture. The proposed algorithms guided by the cost functions find the best solution that respects the real time constraint while minimizing the resources consumption. 2. AAA methodology: Algorithm model According to the AAA methodology, the algorithm model is an extension of the directed data dependence graph, where each node models an operation (more or less complex, e.g. an addition or a filter), and each oriented hyper-edge models a data dependence, where the data produced as output of a node is used as input of an other node or several other nodes (data diffusion). The set of data dependences defines a partial order relation on the execution of the operations, which may be interpreted as a "potential parallelism". This extended data dependence graph, called Conditioned Factorized Data Dependence Graph (CFDDG) allows to specify loops through factorization nodes, and conditioned operations (operation executed, or not, depending on its conditioning input) through conditioning edge. In this CFDDG graph, each oriented dependence edge is either a data dependence ora conditioning dependence, and each node is either a computation operation, an input-output operation, a factorization operation or a selection operation. This algorithm graph may be specified directly by the user using the graphical or textual interface of the SynDEx software / 7/ or it may be generated by the compiler from high level specification languages, such as the synchronous languages, which perform formal verifications in terms of events ordering in order to reject specifications including deadlocks /8/. 2.1 Conditioned Factorized Data Dependence Graph Typically an algorithm specification based on data dependence contains regular parts (repetitive subgraph) and non-regular parts. As described in /9/, these spatial repetitions of operation patterns (identical operations that operate on different data) are usually reduced by a factorization process to reduce the size of the specification and to highlight its regular parts. Graph factorization consists in replacing a repeated pattern, i.e. a subgraph, by only one instance of the pattern, and in marking each edge crossing the pattern frontier with a special "factorization" node, and the factorization frontier itself by a doted line crossing these nodes. The type of factorization nodes depends on the way the data are managed when crossing a factorization frontier: 1. A Fork F node factorizes array partition in as manysubarrays as repetitions of the pattern. 2. A Join J node factorizes array composition from results of each repetition of the pattern. 3. A Diffusion D node factorizes diffusion of a data to all repetitions of the pattern. 4. An Iterate I node factorizes inter-pattern data dependence between iterations of the pattern. Moreover, the user may want to specify that some operations will be executed depending on some condition. In our CFDDG model, we provide a conditioning process such that the execution of operations of the algorithm graph may be conditioned by a conditioning dependence, which is represented on the algorithm graph by a dashed edge. In this case, the conditioned operation is executed only if its inputs data are present and its condition of activation is satisfied. In order to indicate the end of the conditioned sub-graph in the algorithm graph that corresponds to the 'Endlf of the typical control primitive IF-THEN-ELSE, we need a specific node 'select'. It allows to select among the data it receive the one that will be sent to its output. The input data of a select node correspond to the data produced by the conditioned operations with their condition of activation satisfied. As the parallel execution of these conditioned operations, that are not necessarily exclusive, can lead to simultaneous presence of several input data at the select node, we introduced priorities between its data which will be specified in an explicit way with labels on the input edges (pi, P2, ... ,Pn). The input data having the highest priority pi will be selected and sent to its output. 2.2 Specification of Conditioned Matrix-Vector by using Conditioned Factorized Data Dependence Graph Figure 1 represents use a Conditioned Matrix-Vector Product example (G-MVP) specifying by CFDDG model: de- pending on the value of the input data C, this algorithm will compute either the product of the matrix Me R'" x R" by the vector Ve R" and will return the resulting vector or will directly return the input vector V. The computation of the product of the matrix M {composed of m vectors M,-.- M= (M,)|<,.<,„ by the vector V can be decomposed into m scalar products PS ={MiV)\&T (3) Where: T indicates the time constraint, t is the execution time for a given defactorization, S is the surface of the hardware implementation for a given defactorization, k defines the penalty factor. Based on the technique of simulated annealing, the algorithm that we propose starts with the calculation of the control parameter Co, then one starts by gradually decreasing his value and for each one of these values one carries out a certain number of changes of states of the system (defactorization). With each reduction of the control parameter of control, one increases the number of changes of states of the system by a factor ß. In the algorithm below: Xo presents the initial solution of the system, Lo is the initial value of a number of changes of system state, iter defines the number of change of a control parameter, L/( is the modification number of system state for a given control parameter Ck, X/ is a given solution or given state vector of the system, X/ is state vector after a defactorization process, it is a solution close to X,. Ck is indicates the value of a control parameter during the kth iteration, is the initial value of a control parameter, F(X) indicates the cost function for a system state X. Algorithm: simulated annealing for mono circuit architecture 1. initialize (co,Lo,Xo) 2. for iter =1 to num_iter do 3. 4. 5. 6. 7. for n-modif to Lk do Xj=V(X) if (F(Xi)- T tut — conli- and and if t > T IUI ~ conlr and and 3//0, >//0,,„,„, lis.-S,:„„,) +f/0„„+k{t„„-T^.„„„.) if t > T ''101 — ^uo/Ur and and lis, - +gii 10,-11 )+~ r,,,,,,) if t > T to! — co/ilr and and '/ ifui ^ comr and and if i 101 ^ '^cojitr and and if ifot ^ fo'ilr- and and 3//0,>//0„„„, (4) example, if one defactorises FF2 by 2 then X22 is equal to 2 and X21 and X23 are equal to 0. X22 contains a set of frontiers which their edges. To each frontier, one associates in random way a given circuit of the architecture by applying the simulated anneal- We applied this algorithm to an example which describes a five Conditioned Product Matrix Vector calculation (the CFDD graph contains 60 edges). The hardware implementation uses two circuits: XilinxXC4013E/X2. The parameters bellow defines the main characteristics of XG 4012 E/ X2 circuit: Device Logic Cells Max Logic Gates Max RAM bits Typical Gate Range CLB Matrix Total CLBs Number of Flip -Flops Max User I/O XC4013E/X2 1,368 13,000 18,432 10,000 - 30,000 24x24 576 1,536 192 ing method, in order to estimate the partionning we defined a cost function: Where : ttot is the execution time of the algorithm after applying defactorization, Slot is the area occupied by all the circuits after the defactorization process Si defines the area used in the circuit i, k,l,g are the penalty coefficients (varying from 10 to 100) used if respectively the time, area and I/O constraint are not respected, Tcontr is the time constraint, Scontr and l/Ooontr define respectively the time and I/O constraints for every implementation, l/Otot is the total number of the I/O used for all the circuits used by a given implementation, l/Oi is the I/O number for the circuit/. The proposed algorithm is similar than the mono circuit case, we only change the cost function. This algorithm takes the neighbourhood graph as an input and returns the partionning graph. This partinoning graph corresponds to muli-circuit implementation which respects the real time constraints and uses as less possible the hardware resources (i.e. area and I/O). To compare the optimisation results, we applied the simulated annealing algorithm for 5 different time constraints: 1000, 1500, 2000, 2500 and 3000 ns. The table below presents the results for each time constraint. The hardware implementation needs two circuits: circuit 1 and circuit 2. For each circuit table 1 gives the CLB and I/O used at different time constraint. 5. Conclusion We showed that from an application algorithm specified with a conditioned factorized data dependence graph it is possible to obtain a hardware implementation onto a reconfigurable integrated circuit following a set of graphs transformations, leading to a seamless design flow. These transformations allow to automatically generate the datapath and the control-path for designs with moderately complex control flow involving both conditioning and loops. The proposed delocalized control approach allows the CAD tools used for the synthesis to place the control units closer to the operators to control. We have presented an optimization heuristic based upon simulated annealing and we applied this technique for Conditioned and factorized Data Dependence Graph by using a defactorization process guided by cost function. We defined two cost functions for mono and multi Circuit architectures. We used these two algorithms to obtain automati- TIME Constraint [ns] 1000 1500 2000 2500 3000 circuit 1 CLB used 60 433 298 202 306 Circuit 2 CLB used 458 114 210 317 172 circuit 1 I/0 used 34 30 41 35 43 Circuit 2 I/O used 37 27 39 30 47 Total latency [nsl 397 1231 985 1280 1231 Total CLB 518 547 508 519 478 Total I/O 71 57 80 65 90 Table 1: number of circuit, CLB and i/0 used at different time constraint. 274 cally the best solution at the given time constraint. The optimization heuristics will address both defactorization and partitioning issues. Moreover, this extension of the AAA methodology to the hardware implementation of algorithm onto integrated circuit, provides a global methodology in order to tackle complex hardware/software co-design problems involved by multicomponent architecture. 6. References /1 / P, Lieverse, P. van detrWolf, Ed Deprettere, K. Vissers, "A Methodology for architecture exploration of heterogeneous signal processing systems", Proc. 1999 IEEE Workshop on Signal Processing Systems. /2/ S. Gupta, N, Dutt, R. Gupta, A. Nicolau, "SPARK, High-level Synthesis Framework For Applying Parallelizing Compiler Transformations", 7th Intl.Conference on VLSI design, January 5-9, 2004, Mumbai, India. /3/ R, lauwereins, M. Engels, M. Ade, J. Peperstraete, "Grape-ll: A system-level Prototyping Environment For DSP applications", IEEE Computer, Vol. 28,No 2, pp. 35-43, Feb. 1995. /4/ S, Edwards, L. Lavagno, E.A. Lee, A. Sangiovanni-Vincenteili, "Design of embedded systems: formal models, validation, and synthesis", Proc. of IEEE, v85, n.3, March 1997. /5/ M. Kaul , R. Venuri, "Optimal Temporal Partitioning and Synthesis for Reconfigurable Architectures", Design, Automation and Test in Europe, February 1998. /6/ T. Grandpierre, 0. Lavarenne, Y. Sorel, "Optimized rapid prototyping for real-time embedded heterogeneous multiprocessors", CODES'99 7"' Intl. Workshop on Hardware/Software Co-De-sign, Rome, May 1999. /7/ T. Grandpierre, Y. Sorel, "From algorithm and architecture specifications to automatic generation of distributed real-time executives: a seamless flow of graphs transformations", MEMOCO-DE03, Intl. Conference on Formal Ivlethodsand Models for Code-sign, Mont Saint-Michel, France, June 2003. /8/ /9/ /10/ /11/ /12/ N. Halbwachs, "Synchronous programing of reactive systems", Kluwer Academic Publishers), Dordrecht Boston, 1993. L.Kaouane, M. Akil, Y. Sorel, T. Grandpierre, "From algorithm graph specification to automatic synthesis of FPGA circuit: a seamless flow of graphs transformations", FPL03, Intl.Confrence on Field Programmable Logic and Applications), Lisbon, Portugal, September 1-3, 2003. L.Kaouane, M, Akil, Y. Sorel, T. Grandpierre, "A methodology to implement real-time applications on reconfigurable circuits", ERSA'03, Intl. Conference on Engineering of Reconfigurable Systems and Algorithms), Las vegas, USA, June 2003. R. Vodisek, M. Akil, S.Gailhard, A.Zemva, "Automatic Generation of VHDL code for SynDEx v6 software", Electro technical and Computer Science conference), Portorož, Slovenia, September 2001. S. Kirkpatrick, C.D. GelattJr, M. P Vecchi, „Optimization by Simulated Annealing", Science, 220(4598): 671-680, May 1983. Mohamed Akil Laboratoire A2SI, Groupe ESIEE Cite Descartes, 2 Bid Blaise Pascal - BP 99 93162 Noisy Le Grand cedex, France akilm@esiee.fr Prispelo (Arrived): 15.09.2003 Sprejeto (Accepted): 03.10.2003