THE REVIEVV OF SOME DATA FLOW COMPUTER ARCHITECTURES INFORMATICA 1/87 UDK 681.32.02 Jurij šile and Borut Robič Jožef Štefan Institute Department of Computer Science and Informatics Ljubljana The article reviews same selected data flow camputer arhitectures. Ali the architectures are designed far VLSI implementation to provide large throughput, law power consunptloni and reduced size and weight. Uhile some are in the phase of sloulation and VLSI chip floor-plan contruction the others already ekhibit real VLSI implementation. PREGLED IZBRANIH PODATKOVNO PRETOKOVNIH RAČUNALNIŠKIH ARHITEKTUR - V eianku podajava pregled izbranih podatkovna pretokovnih računalniških arhitektur. Nekatere arhitekurs so bodisi v fazi simuliranja oziroma izdelovanja logifinih natirtov za VLSI vezja, druga' pa so ie implementi­ rane v VLSI tehnologiji. Inbroduction In spite of the conceptual break uith previous coaputers, the harduare of the fifth generation computers will be based on VLSI of seaiconductor cooponents. Vet it is to be expected, that the hardware of each type of the fifth generation camputer will be much more closely tailured to the application area than it is the čase at present. For a nuober of reasons, one of the most promising architectural models is data flow architecture. It is (lexlble and ektensible, it has the potential for very high data thro- ughputs, and it reflects, at harduare level, the inherent parallelism of the processing. Thus, the potential realm of use includes problem solving i inference machine and intel- ligent interface machine as it was proposed in the JIPOEC project for fifth generation Compu­ ter systems. The presence of some real data flow compu- ters indicates that the state of the art in data flow conputing has already passed initi- al, purely.academic discussions. In the article ue review some evisting data llow computers from the architectural view point. The presentation is not intended to be thorough. Instead, we concentrate on similari- ties and differences among the selected airchi- tectures. HanchBster data (low coaputar The machine org data llou ooaputar nication organizati bie-86D and is s structure is a ring to a host system vi modules operate ind iashion with pack rate of 4.37 M pack ned for the same i her in the aatching storage capacity, anization o Ceurd-85] i on with tok houn in F of four a an I/O BW ependently ets transfe ets/second. nstruction unit. T so that an f the Hanchastar s a packet commu- en roatching CHo- ig.1 . The basic odules connected itch module. The in a pipelined rred at a maxiiiiuiii Paokets desti- are paired toget- his has limited overtlov unit is required for programs with large data sets. Paired packets and those destined for unary instructions, fetch the appropriate instruction from the instruction stara, which contains the nachine-code for the executing program. The instruction is forwarded togethec with its input data to the prooassing unit, where it is executed. Output packets are eventually produ- ced and transnit.ted back to toward to the matching unit to enable subsequent instructi­ ons. The return path passas through the' I/O switch module, which connects the 5ysteiii to a host proosssor, and to the token queue, which is FIFO buffer for smoothing out an even rates of generation and consumption of packets. 10 Host (168 Kbvtes/second maM ) (14 Klokefis/second niax.) token packcis MalChing Unil l/OS*ilch lohen pa^r packets InsifuctlOr StOfO executable pacitets PfOcoss"tgL'i Ifom tHost (168 KbytM/socond max ) (14 Ktol^ens/second tna«) Fig.1. Manchester dataflow system structure. 62 The I/O switoh oiodule gives prlority to input froffl the ring and selects the output route by perlorming a deoode of certain marker bits. It is organized as a simple two-by-tuo cooaon bus swltch. The token queue comprises three plpellne bufier registers and a circular buffer memory. The later has a capaclty of 32K packets, each beeing 96-bit wide. The aatching unit is based on a 1,2S MU pseudoassociative oieoory uith six pipeline re- gistere in the nain ring and two buffere interfacing with the overflo« unit, The roemory is used to store oatched packets while auaiting their partners. Its associative operation is achieved by accessing a parallel store using an appropriate hash function. Recall, that pac­ kets destined for unary instructions do not need to oatoh uith partners; instead, they pass straight through the unit. The overflo« unit handles packets that cannot be placed in the parallel hash table becouse they encounter a (ull hash entry. Overflow packets are stored in linked lists in the ovarflou unit, uhich contains a oicrocoded processor together uiith data and pointer nemories. The instruction store coraprises two pipeli­ ne buffar registers, a segment lookup table, and a random-access instruction store to hold the program, The segment field of the inconing packet is used to. access the instruction froo the store. The instruction is 70 bits uide, The instruction is conblned uith a destination fiald and the data iield of the inconing packet and Is sent to the perocassing unit as a 'tA6-bit eiiecutable instruction packet, The processing unit comprises five pipeli- ned buffer registers, a special purpose prepro- cessor, and a parallel array of up to 20 hoaogeneous oicrocoded funotion units uith lo- cal buffer registers and cominon buses for input and output, A soall nuober of instructions are executed in the preprocessor but the iiiaJorlty ara passed into one of the funotion units vla the distributlon bus. Each funotion unit con­ tains a oicrocoded bit-slice processor with input and output buffering, 51 internal regi­ sters, and 4 KU of uritable nicrocode inemory. Instructions are enecued independently in their allotted funotion unit, and the output is oerged onto the arbitration bus and thence out of the processing unit touard the I/O suitch. In the nanohester archltacture a haruare nashing schene is used to simulate the associa­ tive neaory uhich turns out to be Isss exBpen- cive, Unfortunately, this scheme does not produce very good results in terns of uaiting tiso. In order to reduce the uaiting tirne, a •ultiple aatching units scheoe is incorporated in the EIHAN - EXtended MANchester data flou coaputer CPatnaik-863, NIT data flou ooaputor The HIT data ilav ooaputar bases on a static concept of data flou architecture CDen- nis-B03 in uhich the instructions of machine leval prograo are loaded into specific iiiermory location in the nachine before conputation begins, and only one instance of an instruction is active at a tine. Instructions are held in the local niesories of the proovsalng alaaant PE, Each instruction includes an ooeration code, spaces for operand values, and destination fields that spacify uhere resuls should be sent. Each PE is equlpped to recognise uhich of the instructions it holds havs been enabled for execution by arrival of needed operand values, If an ena­ bled instruction calls for a scalar arithoetic operation, the instruction, including its ope- rands, is sent to a functlonal unit FU capable of perforaing that operation. The array •a*ory units AH are provided to hold arrays of data PC 1 A__ _ ..r n. s o 0 PE 1 Hr T A ^ ^ fu AM 0 o o AM FU 1 A v V ^ D(5ni«LincN ROUTNC fit ruMIomtUMt Fig.2o The HIT data flou computer. making up the data base of coiaputation, and are accessible through the seaorv routing network. Instruction eKecution in FU or AM ylelds result packets each of uhich consiets of a data value and 8 destination field that specifies the targat instruction for the result packet. The result packets are sent to PEs that hold the target instruction through the distributlon routing natuork. Other instructions, such as those calling (or duplicata data values, for boolean operations, and for simple tests, are performed uithin the PE. The current status of the HIT data flou projeot ie that hardueire for the above conputer architecture is under development. For this sake, a data flou engineering model C0ennis-S33 consisting of eight processing units coupled by a paket coaouniation nstuork built of tuo-by-t- uo routers is designed for aoulating the de- scribed architecture. Data llou coaputar SIGHA-l 8ieHA-1 is a data flou oultiprocessor sy- steo for soisntific conputations CShifflada-86]. The configuration o( the 5ysteci is depicted in Fig.3. Four processing elenents PE and four structure eleoents SE are connected by local netuork and oalled a group. Sroups are connec­ ted by global nstuork. The purpose of using hiorarchlcal netuork is to executQ prograns afficiently by utillzlng prinoiples of locali- ty. Fig.3o Global configuratlon of the SIGHA-l, The procassing slammnb consists of five units, uith the units organized as a tuo-stage plpellne as shoun in Fig.4. PE enecutes ali Instructions except those that sianipulate struoture Beaory. The buffar unit (B KU of AO bits) is an Interface betueen the netuork and the PE. The length of the incoaing packet is 88 bits. It is divided into tuo parts (top 63 48-bit and bottom *0-bit) and passes through' the network as aonsecutive parts. The most signifficant 8 blts are a network address, next 40 bits are tag, and the remalning 40 bits are data type and value. Hhen there is no gaiting packet in the buffer iiieinQry and the nent units are not dealing with an other packet^ the incooing packet bypasses this unit and proceeds to the subsequent units. The (stoh unit is 16 KM, 40-bit-wide program memory. The link num- ber carried by an incotning packet is used to access the address o( an Instruotlon to be letched. The operation field of the fetched instruction /indicates an operation code and is sent to the Bxecution unit. The destination field of the fetched instruction gives addree- ses of destination instructions Cwaiting for the result) and is sent to the destination' unit. The matching flags from the destination field are sent to the matching unit. The •atohlng unit is a 16KH, 80-bit-wide associati- ve fflemory to find a partner packet of an incooiing packet. The matching-flag indicates whether the operation is unary or binary. When it is a unary, the incoraing data packet is bypassed to the execution unit. If the in­ struction is binary operation the incoming packet is stored in the associative memory if it is a first arrived packet of the two operands. Otheruisei the matching unit succe- eds to find a partner packet in the matching aemory and sends both data of packet pair to the executlon unit. The •Ksoutlon unit con- sists of an ALU, a shift unit and a floatlng point arithffletio unit. The word length is 32 bits. It receives an operation code from the fetch unit and data from matching unit. The destination unit nakes output packets by combi- ning the destination addresses and results from the execution unit. JL PBTCH OIIT KATCailS DiriT DESTimiOl UIIT "^V PIRST STACe eiECtmoii DIIT SECOID STAGB used for the module at each stage of the global netuork. Judging from the performanpe of 1.35 MIPS of the pratotype harduare for the benchoark prograns, the performance of the next version of a processor with CMOS LSI technology should be about 1.? MIPS. |iP07281-baaed data flow architaotura The pPD7281 is the first VLSI device on Silicon using data flou architecture CNEC-8S:]. ThB pPD72ai image pipeline processor is desi- gned to be used as a peripheral processor with a mini- or aicrocomputer serving as the host. Fig.S shows a general system configuration exafflple of uhich four pP07281s are used connec- ted to the neaory in a ring shape with the entlre ring interfacing with the host computer Via a standard bus. For the above architecture, NEC is develo- ping a support chip HAGIC, nefflory Access and General bus Interface Chlp. It handlee aH packet flow between the pP07281s, the image ffleaory, and the host processor. / \ I 8 I I V I I I I 8 I I I I T I I I • • • • I I I I I IHA6E I I HENORV I I I •_—+•_—4. II •I -f-i • •—• i I H A 6 I C I I I I I I I I B I I U • !• HOST CPU ir 8 I •- I I I \ / No. 4 PPD7281 II No. 3 |iPD72ai II No. 2 PPD72S1 II No. 1 )iPD72ai II —+1 Fi4.4. Structure of the processing element. Fig.S. )iPD7281-based data flcv architecture. The atruotur« aleaent comprises 64KU, 35-bit-wide memory te store arrav data and a control unit to nanage free «emory uords and uaiting queue8. When an array is deolared in a program, a contiguous area corresponding to the array size is allooated in the structure memo- ry. Once the uord is allooated, the used bit of each uord in the area is turned on, Each word has two other special bita. The presence bit aeans that data has already been vritten in the word. The waiting bit indicates that at least one read request packet exists in the uaiting queue. Hhen data is written in the word the data is sent to the instructions indicated by the walting packets. A 10 by 10 crossbar is adopted for a looal natuork. This is realized by bit slice ohip. The global nctvork is organlzed as a nultistage netuork CHavrie-863. The same orossbar chip is The pPb72ai uses an internal circular pipe­ line and the powerful instruction set Ceilc-86tl to allow high end immage processing. A data flow architecture allows the processor to maxi- Bize efficiency in a variety of multiprocessing applications. As shoun in the block diagram in Fig.6, the nPD7281 is formed by ten functional blocksi the Input oontrollcr IC, the link table LT, tha funotion tabl« FT, the address genera­ tor and flow oontraller A6&FC, the data acvir/ DH, the queue G, the processing unit PU, the output fueue 08, the output controller OC, and the refreeh eontroller RC. Before any processing occurs, the host processor down-loads the object code into the LT and FT by using specially fornated input packets. The contents of the LT and FT are closely related to a data flou graph. The arcs represent the entries in the LT uhlle the nodes represent the entries in the FT. 64 Fig.6. Block diagram of tha pPD72B1, prooaaaing aleaants (PEs). The PEs are allowed to be functianally nonldentical in order to capitalize the eKisting high speed architectu- res for fixed signal processing algorithas. Frequently usad operations inay be executed in dedicated PEs having the appropriate hardware structure. The I/O functions take plače in speoial PEs called I/O processors. This is oonvenient in signal processing applications, beoouse signal sources and sinks lend to inlro- duca specialized requirenents. The control section schedules instruotions lor the PEs using fiiced-format messages. Re- oall, that initially ali the informatlon about tha data flou graph of the application program resides in the local menories of the PEs. Each result packet carries the neccessary part of this information to the control section where it is temporarily stored in the activity store until the destination operation may be schedu- led for the execution. The eKecution is per- foraed by sending an operation packet to one of the PEs. The aotlvlty atora contains the activlty tenplates of those operations which ^-.Ave reoei- ved at least one of the operands, Di:t, whi.ch are not Echeduled for the eKecution. Conceptually, tha activity store contains a representation of tha active part of the data flow graph. The contains of the result packet are used by the updata unit for locating the activity teaplate (of the destination operation). It also contains the the address of a block in the When a data packet enters the pPD7281, it fetches fro« the LT the address of the instruc- tion in FT, uaiting for the incoming data. After the destination instruotion has been fetched, the ASFC unit deteraines whether the instruotion is unary or binary. If it is unary, the operation packet, consisting of the instruotion and the data is oomposed and sent Via 9 to the PU. If it it binary, the ASFC Stores the inooaing data to the 011 if it is the first arrived operand for the instruotion. Otheruise, it fetohes tha flrst operand from the on and sends it together uith the incoming packet and the instruotion to PU via a. The result packet froa PU can either be sent out of the pPD7281 , Informatlca 10 (4) (1986) 44-50 (in Slovene). CPatnaik-S63 L.n.Patnaik, R.GovindaraJan, and N.SiRanadoss, Design and perforaance evaluation of EXnANi an EXtended MANchester data flow ooeputer, IEEE Trans. Comp. 3S (3) (1986) 229-244. CRabie-863 B.Robiti and J.Sile, Classif ication of new generation computer architectures, In- formatioa 10 (4) (1986) 18-32 (in Slovene). C8hiaada-863 T.Shlmada, K.Hiraki, K.Nishida, and S.Sekiguohi, Evaluation of prototype data flo« processor of the SIGMA-I for sclentiflc computation, Proč. 13th Int'1 Syfflp. Comp. Aroh., IEEE (1986) 226-234. Ceilc-863 J.eilc and B.Robifi, Data flow archi­ tecture based processor, Informatica 10 (4) (1986) 74-80 (in Slovene).