A SIMULATION OF DATAFLOW COMPUTER MODELS AND A MODEL OF FAULT TOLERANT DATAFLOW COMPUTER INFORMATICA 4/87 UDK 681.3.068 Milan OjsterSek, Viljem burner, Peter Kokol, Anton Zorman TehnicaJ Faculty Maribor ABSTRACT: Dataflow architecture is accepted as the architecture for future computers because it can' exploit potentialy all the parallelism in a prograa. This architecture is assuaed to execute dataflow graphs. The nodes in the dataflow graphs represent asynchronous tasks. The arcs connecting nodes represent coaaunication paths for the aessages (tokens) generated by nodes or suplied froa the external environaent. Each node is executed (fired) when the required input becoaes available. The dataflow arhitecture proposals can be classified as static and dynaaic architectures. In a static architecture the nodes of a prograa graph are loaded into aeaory before the coaputation begins and at aost ona instance of a node is enabled for firing at a tiae. A dynaaic architecture facilitates the firing of several instances of a node at a tiae and these nodes can be created at runtiae. Me have developed prograa language DFL1, siaulator for the siaulation datafloa cuaputer aodels and their key aechanisas, iaproved hardware, software and set of instructions needed to support an efficient fault-tolerant dataflow coaputer. 1 INTRODUCTION Over the last few years data flow coaputer architectures have been propose'd and soae coaputers have been built. A siaple data flaw execution eodel is based on a 'pure* data flow prograa organisation: with an instruction being enabled when all its input tokens are available. Each instruction aay have any nuaber of inputs and any nuaber of outputs. Discussing the U-interpreter, Arvind and 6ostelow suggest that aaxiaua parallelisa can only be extracted froa a prograa if an arc is a!lowed to carry aore than a single token - a process achieved by carrying a label with each token that identifies the context of that particular token (ARVI 82). Packet coaaunication aachine organisation and high level coaputer language with exploiting inherent parallelisa in data, flow graph is the aost significant for the data flow architecture. Me have developed prograa language DFL1 (KOK B4) and siaulator for the siaulation of five various data flow coaputer aodels, which art suited for todays technology. During the siaulation, siaultaneous execution of a data flow prograa in distinct units (token transaission and •itching, foraing activities) and soae iapartant statistical paraaeters (i.e. an input stream, a business period, an idle period, a service tiae ...) are observed (OJST 84, 2UH 83-1, ZUH 83-2, OJST 86, OJST 97-2). 2 DESCRIPTION OF MODELS Our aodels are based on the packet coaaunication eachine organisation with token watching and consist of an input section, aeaory sections, a global store, processor sections, an output section and units for coaaunication aaang sections. The input section decomposes the data flow prograa and supports aeaory sections with input data. Me have used two decoaposition aethods: - Each instruction into its own aeaary section. If there are aore instructions than aeaory sections, then aore instructions are stored in one aeaory section. Instructions tre randoaly distributed. - Each block into its own aeaory section. If there are aore blocks than aeaory sections, then aore blocks are stored in one aeaory section. The aeaory section aatches tokens into sets of tokens. Hhen all of its input tokens having the saae context are available, it foras activity (executable instruction which consists of a set of tokens with the saae context and a copy of instruction) is foraed and sent into an processor section. The processor section executes the activity and sends output results into aeaary sections. The output section collects final results of coaputation. The global store is used to prevent deadlock of the systea. Coaaunication units (networks, busses) transait aessages (tokens, sets of tokens, activities). They use post office interconnection principle for transaission aessages. Coaaunication units in our siaulation are siaulated as delay units. 2.1 Model 1 input isotlo global star* 1 J proceesor section 1 netiory section 1 laotlvltyl Inetwork | , nroceasor (section R >. ^-" JJSti ^ne tworkj -> <^. |n«mory (section S L \ Figure l:The aodel 1 The aodel 1 uses two networks: one for transaitting activities and one for transaitting data. It has separate aeaary sections and separate processor sections. The aeaory section consists of an input FIFO 60 queue, a latching store, an instruction store, an associative logic and an output FIFO queue. The processor section consists af an input FIFO queue,a processor, an output FIFO queue. This configuration is not suitable for super-systeas, because networks cause too such delay tiae in transmitting aessages. An advantage of this aodel is autonoaic nark of units. The •odel is suitable for saall systeas and for a subsystea of a large systea. 2.2 Models Connected In Clusters Input 1 section J cluster i| - s global store J global network •^_ . - Icluster II t i * ] Figure 2:Rodels connected in clusters Models 2,3,4,5 are very siailar. Sections of aodels are connected in clusters. 2.2.1 Model 2 The aeaory section and the processor are associated in one eleeent (figure 3). The aeaory section Batches tokens into sets of tokens, fares and executes activities. Output tokens nhich don't have their own instructions in the cluster are transaitted into butput FIFO queue. Other tokens are aatched in the cluster. Higher speed of foraing and executing activities is the advantage of this aodel. There is no additional transaitting betwen aeaory section and processor. This aodel is very suitable for saall and aediua systeas if the adequate decoeposition aethad is used. 2.2.2 Model 3 The processor and the aeaory section are separated (figure 4). If the processor is idle and the aeaory section hasn't any activities, it can receive activities froa other clusters. If the aeaory section has activities and the processor isn't idle, activities are transaitted into the output FIFO queue. The global network transaits it to the idle processor section. J input PIPO 1 nemory sectior • processor 1 butput FIFO [jueue 1 i Input FlfOl *]ueue I } global 1 Input PIFOl luftue 1 1 roeraory • ansoclatlve logic J output PIKO founue 1 1 1 i global store bus "I Ibuffel procefl output queue 1 BOr FIFO 1 1 2.2.3 Models 4 And S Models 4 and S are siailar to aodel 3. In each cluster they have one or aore processors. The aeaory logic establishes which processor is idle and adresses the activity to it. Model 4 is slightly better than aodel 5, because it needs less tiae for writing in FIFO queues. Figure 3 Figure 4 Figure 3: The cluster of aodel 2 Figure 4i The cluster of aodel 3 llniiut FIF'Jl n"<"» I iput FIFO) leue ' I 1 memory + aseoc1 at 1v« logic i butput FIFOI hueue 1 global I L store uffer 1 Suffer Mj J output FIFO] f # > mtput PIFCl queue 1 I ' " ' tueue H | T Figure StThe cluster of aodel 4 llnput PIFOI hueue Input FIFO] lueue I memory • associative loglo output FIFO buffer 1 J— output FIPC In memory section 1 tutput FIFO in global network 1 global store bus ibuffe output PIf( In aeaory section H output fir: in global network H Figure 6iThe cluster of •odel S 61 3 SIMULATION RESULTS 98 different parameters can be varied in our simulation but it is reasonable to vary just soae of thea. Constant values *re given to all others. Our aodels are siaulated Kith four programs, the blocks of which are coaposed of data flo« graphs shown on figures 7,8,9,If. Prograa blocks are independent and they have no coaaon flow of data tokens. Figure 7 Figure 8 Figure 9 Figure If All data flow graphs have IS instructions. Sraphs shown on Figures 7 and 8 have aany inherent parallelism and need aany input data tokens. The graph shoan on Figure 8 consists of instructions /, which are executed 31 tiaes slower than instructions +. Such an instruction delays the execution of following instructions. The graph with few inherent parallelises is shown on Figure 9. On Figure If the graph with aany inherent parallelises and three input data tokens is shown. Each instruction in the graph shown on Figure II, has one input operand and one constant. This increases the speedup of firing enabled instructions. How are algorithas influenced to our aodels is described and evaluated. Our siaulation is executed with these constant input paraaetersi an instruction + is executed in 2 cycles an instruction / is executed in 6f cycles aatching one token in a set of tokens is executed in 1 cycle activity forming is executed in 1 cycle FIFO queues have 3 eleaents aritting or reading tiae to FIFO queues is negleable a network can transmit an infinite nuaber of aessages blocks of prograa are independent - a aodel 1 has S aeaory sections In the file of input parameters He have varied the following paraaetersi - the selection of system ( 1 - 5 ) - the decoaposition aethod the delay tiae in transaitting aessages through the network (f, 1, 2 cycles) - the capacity of aeaory section and a capacity of global store - the input algoritha The prograa consists of an equal nuaber of blocks and aeaory sections. He have reduced the nuaber of observed output paraaeters to the ainiaua following ones: a prograa execution tiae requested for execution of 2fl instructions, an average nuaber of busy processors, an utilization of processors, an average nuaber of busy aeaory logic in aeaary sections, an average nuaber of sets of tokens, which are aatched in aeaory section, an utilization of a aatching store, an average nuaber of sets of tokens, which are aatched in global store, and an utilization of a global store. The following abbreviations are used in diagrams: UP - the utilization of processors (I) DN - the delay tiae in transaitting aessages through the network (cycles) MP - nuaber of processors TE - the prograa execution tiae (cycles) algoritha shown on Figure 7 algoritha shown on Figure 8 — • algoritha shoan on Figure 9 algoritha shoan on Figure If For the aodel 1 the decoaposition aethod "each instruction into its own aeaory section" is better than 'each block into its oan aeaory section* , so we have aade diagraas which evaluate the aodel 1, only for this aethod. For aodels 2,3,4 Me have aade diagraas only for "each block into its own aeaory section* decoaposition aethod because this method is better. Th« program execution time dependent on the delay time in transmitting messages through tbe network TE naa.a teea.e sea.a isea.a sea.a TE uea.e ura.e sea .a uea.a " 1S&. sea.a 2.DK 2. DX MODEL 1 •Fii I.Di MODEL 4 IIP .3 6 DJI 62 Tfta utllltatlon of prooaaaora depond«nt on tti* delay ti«« In tranaaltttoa MI«|M through Mi* n.txork 0 2. 0" construction of a larga and fast aatching acaory at a reasonable cost, construction and aanageaent of a structure ataory, network construction, interruption, error and exception handling, inefficiency due to insufficient parallelise, developing efficient control, partitioning and scheduling algorithas. Me have developed key aechanlses, iaproved hardware, software and set of instructions needed to support an efficient fault-tolerant dataflow coaeuter IOJST B7-1, OJST 87-31. Proposed datafloM architectures »r» very inefficient on regular structures because of fine granularity of their operations. When data is structured (vectors, aatriccs, records) the control and data flOH is very regular and predictable and there is no need to pay high overhead for scheduling. These architectures don't have aechanisas for interruption, error and exception handling, a aechanisa which reassigns nodes to another unit if faults have been detected, a aechanisa Hhlch stops sending tqkans to the faulty processor and a aechanisa Mhich destroying read/Mrite requests in eeeory after fault. Contents of data buffer in the output port of the fail processor are lost. 2. DN Figure 111 Task Invocation Structure control and information «lo» The algoritha shown on Figure if is the fastest one (it has a aaxiaua degree of inherent parallelise and a ainiaua nuaber of input data). Proportions in executing tiee of prograa and an utilization of units are changed by increasing the delay tiae in the transeission through the network. An executing tiae of prograa is dependent on an inherent parallelisa of algoritha and a nuaber of input data needed for the execution of this algoritha. The executing tiae of algoritha is decreased to soae liait and then it keeps constant value (all inherent parallelises are used), if the nuaber of aeaory sections or the nuaber of processor sections are increased or delay tiae in transaission through the coaaunication units is decreased. Owing to the fact that the processor section and the aeaory section of aodel 2 fora one eleaent, execution tiae of aodel 2 is alaoust t«ice as long coapared to aodel 3; but its advantage is the best utilization of units. Because delays of reading and writing into FIFO queues Here not considered, the results of aodels 4 and 5 are identical. Kith this considerations He have found that aodel 4 is better than aodel 3. The biggest average nuaber of sets of tokens in the aeaory section is in the algoritha shown on Figure 9, because instruction / delays the execution of following instructions. On aodel 1 Me have siaulated the deadlock of a systea. If aatching stores are full, then other tokens aust aatch in the global store. This causes bad utilization of units in a aodel (a lot of tiae is Hasted for coaaunication betwen aeaory sections and the global store). He have siaulated the deadlock with decreasing capacity of aatching stares. A good decoaposition aethod can prevent deadlock or bad utilization of a systea. nttructlor tore Tobal lock U global •tructuri global - -(control jnlj r Utor«[ - 1—tt/O clumtai Figure 12: Global level of Fault Tolerant Dataflow Machine (FTDFH) Figure 13: Cluster level of FTDFH Jlockl1 t«bl«l I Btructura jproc—mor box •— r~ —1~ |ln«tr.1 fetor, ll 4 FAULT TOLERANT DATAFLOW COMPUTER Results of siaulation and our knowledge about datafloH coaputers are shown that there are aany probleas to be solved. In the design of an actual dataflow coaputer the aain probleas arei Figure 14: Processing Eleaent of FTDFH In our coaputer aodel (figures 12,13,14) the coabination of control flow, data flow and deaand driven are used. He adapt to the granularity of the data structure and treate large structures as one object. He reduce scheduling overhead by coabining together as eany scalar operations as possible and executing thea as one 63 object. Hierarchical decoaposition aethod (ARVI BS) are used. This aithod is based on the concept of resource bounded graphs (RBGs). The RB6 is a prograa fragaent for Mhich a bound on the resource requireaent of the fragaent can be derived at compile tiae as a siaple function of a collection of parallelisa paraaeters. A prograa is viewed as a collection of RBBs. The eiecution of a prograa can be represented by a tree of task invocations ( figure 11). This aethod use breadh-first partitioning algoritha until sufficient parallelisa is generated and then it use depth-first partitioning algoritha. Kith depth-first partitioning algoritha deadlock are avoided. Our systea is a dynaaic architecture for task level dataflow with facilities to support node reassignaent when processors fail. It uses tagging scheae siailar to the HIT dynaaic architecture. It have three hierarchical levels. The purpose of using three hierarchical levels is to execute prograas efficiently by utilizing principles of locality. Each hierarchical level have an instruction store, a block table, a structure store and a control unit. The instruction store holds copies of the instructions which are executed in this level. The structure store holds input data structures (DS) of RB6 which are executed in this level. The block table contains adresses of units where RBBs are executed. The control unit adresses RBGs and its OS to units in this level and generates block table. If hardware faults occur in one of the units, control unit reassigns RBGs and its DS to healthy units. Kith this aechanisa Me achieve that only the RB6 and their DS which is assigned in the faulty unit aust be reexecuted. Control unit also deletes RBSs and DS when the computations of thea Mrt finished. It also controls the aaount of available aeaory and the utilization of units. If deadlock is occured the control unit reassigns RBGs, its sets of tokens and DS froa too busy units to idle units. Our proposed aodel has high bandwidth (HB) and low bandwidth (LB) coaaunication paths. HB paths are intended for transsaiting RBBs, DS, activities and tokens betwen units. The LB paths are intended for transsaiting status, diagnostics, control and aeasureaent inforaation. Our aodel has I/O instructions, instructions which are executed in exception condition, instructions iapleaenting the special operators for dynaaicaly creating instances of node resulting froa recursive nodes in dataflow graphs, loop and streaas, table-oriented instructions (for readind an entry, writing into an entry and aadifying parts of an entry), structure-oriented instructions (for selecting an eleaent froa a structure, appending an eleaent to a structure and testing for eaptiness), string aanipulation instructions (to search for a substring and a lenght of a string to coapare strings, and concatenate strings), streaa oriented instructions, fixed-point instructions, logical/shift instructions, floating-point instructions, coapound instructions (for reducing token aaveaent). 5 CONCLUSION Today a vast collection of single-board computers *re available which offers roughly 1 HIPS at low cost: these are touted as building blocks for aultiprocessors. Can dataflow aachines coapete? It is not clear if a single dataflow processor can achieve the perforaance of a von Neuaann processor at the saae hardware cost. The dataflow instruction-scheduling aechanisa is clearly aore coaplex than increaenting a prograa counter. An engineering effort substantially beyond any of the current dataflow projects is required to aake fair coaparison. The SI6HA-1 (SHIM 86) project is an iaportant step in this direction. The question becoaes aore interesting when we consider aachines with aultiple processors, where the dataflow scheduling aechanisa yields significant benefits. The dataflow approach can be viewed as an extreae solutions to the aeaory latency problea - the processor never waits for responses froa aeaoryi it continues processing other instructions. Instructions are scheduled based on the availability of data, so aeaory responses are siaply routed along with the tokens produced by processors. It is our belief, that dataflow architectures together with iaproved hardware, software, set of instructions needed to support an efficient fault tolerant dataflow coaputer and with powerfull high level functional languages, will show the prograaaing generality, perforaance and cost effectiveness nedded to aake parallel aachines widely applicable. 6. References (ARVI 82) Arvind, Gostelow K. P.: •The U-interpreter' IEEE Coaputer, February 19B2, pp. 42-48. (ARVI 85) Arvind, Culler D. E.: 'Managing Resources in a Parallel Machine', Proc. of the IFIP TC-U., Conference an Fifth-generation Coaputer Architecture, Manchester U.K., July 198S, pp. 113-121. (KOK 84) Kokol P., Stiglic B., zuaer V.i Pretvorba aplikativnega jezika v grafe pretoka podatkov', ETAN, XXVIII Jugoslavenska konferencija Split, 4.- 8. juna 1984 Split, pp. IV. S47 - IV SS4, - in Slovene. (OJST B4) Ojstersek H.: "Siaulacija podatkavno vodenega racunalnika* Tehniska fakulteta Naribor, Maribor 1984 - diploasko delo -in Slovene. (OJST B6> Ojstersek H., 2uaer V., Kokol P.: 'Data Flow Coaputer Siaulation", in Proceedings of the 8th International syaposiua Coaputer at the University, Cavtat, pp. 2.87 1 - 2.17 1», Hay 1986. (OJST 87-1) Ojstersek H., 2uaer V., Kokol P., Zaraan A.: "Izboljsana strojna in prograaska opreaa ter nabor instrukcij, ki oaogaCajo delovanje ucinkayite in na napake neobtutljive podatkovno vodene arhitekture", XI Bosanskohercegovacki siapoziua iz inforaatike 'Jahorina 87*, zbornik radova, knjiga I, Sarajevo, april 1987, pp. IS* 1-8 - in Slovene. (OJST 87-2) Ojstersek N., 2uaer V., Kokol P.: 'Dataflow Coaputer Models', COUP EURO 87, VLSI and Computers, Haaburg, aay 1987, pp. 884-GBS. (OJST 87-3) Ojstersek H., turner V., Kokol P., Zoraan A.: "An Overview and Coaparison of Data Flow Coaputer Systeas*, Proceedings of the 9th International Syaposiua Coaputer at the University, Cavtat, Yugoslavia, pp. 2R.»6 1 - 2R.I6 6, Hay 1987. (SHIM 86) Shiaada T., Hiraki K., Nishida K., Sekiguchi S.: "Evaluation of a Prototype Data Flow Processor of the SI6HA-1 for Scientific Computations', Proc. of the 13th Annual International Syaposiua on Coaputer Architecture, Nashington, USA, 1986, pp. 226-234. (2UN 8S-1) zuaer V., Kokol P., Ojstersek M.: "Modeli podatkovno vodenih sisteaov in ocena zaogljivosti", IX Bosanskohercegovacki siapoziua iz inforaatike "Jahorina 85", zbornik radova, knjiga I, Sarajevo, april 198S, pp. 168-1-8, - in Slovene. (2UM 86) Zuaer V.,Kokol P., Ojstersek H., Zoraan A.: "Parallel Coaputer Architectures, Prograaaing Languages and Algorithms," Tehniska Fakulteta Haribor, Haribor 1986, Technical Report - in Slovene.