A PARADIGM OF TRANSPUTER SYSTEM IMPLEMENTATION INFORMATICA 4/87 UDK 681.3.06 Branko Mihovilovic, Peter Kolbezen, Jurij Jozef Stefan Institute, Ljubljana • Che implementation of • raytracing algorithm on an array of transputers Is presented. 36ucD system l>as some important properties. Processing speed is directly proportional to the number of transputers used. • Cl)e system is remarkably robust namely, Individual transputers can be removed from tl)e system while the program is running, and trie system will continue to function afthought with reduced performance and some loss of data. KEY WORDS - Transputer, occam, r&ytr&clng, performances. 1. INTRODUCTION This paper describes the implementation of a computer graphics program on an array of tran- sputers. This program was written to provide a denonstratlon of the performance obtainable by using large numbers of transputers. We used a technique Known as raytracing wlch can generate very realistic Images but requires massive amounts of computer power. This is an Ideal application for transputers as the calculation for each picture element on the screen are Independent of one another an so can be done in parallel on separate transputers. The entire program Is written in occam, however, the main part of the program could have been written in any suitable language. Only those parts of the program which deal exspllcitly with concurrency and the distribution of work are easier to write in occam. 2. BACKGROUND 2.i. Transputer Concurrent systems can be constructed from a collection of microcomputers which operate concurrently and communicate through point to point serial communication Jinks. The INMOS transputer /1-4/ is the ideal building block of high performance multiprocessor systems which provide maximum speed with minimum hardware. A transputer contains memory, a processor and a number of standard point-to-point communica- tion links which allow direct connection to other transputers. The processing capability may be general purpose or may be optimized to a specific purpose. . The on-chlp memory may be extended of chip by suitable interface. A tran- sputer may also have special purpose Interfaces for connection to specific types of hardware. The first transputer available was the IMS T414, a 32 bit RISC microprocessor with a throughput of to mips. It has 2 Kbytes of fast SRAM (SO ns) and four serial links. The 32 bit multiplexed bus allows up to 4 Obytes of exter- nal memory to be accessed. Compatible with the T414 is IMS T600 wlch Includes 4 Kbytes SRAM and floating point arithmetic. 2.2. Occam Occam /5,6/ is a small and elegant language. It combines the best of contemporary thought about control structures and variable scoping with some radical new structures to handle concur- rency. It is based on an model of computation that is different from conventional languages in that It includes the notion of communica- tion, parallel execution, and synchronization in its very structure. The basic unit of occam programming is a pro- cess that performs a sequence of actions and then terminates. There may be more than one proccess executed at any given moment. Occam programs are constructed from three primitive processes: assignment, Input and output. The assignment v := e sets the variable v to the value of the expres- sion e. The output c t e outputs the value of expression e to the chan- nel c. Similarly, the Input c ? v sets the variable v to a value input from the channel c. Constructors are used to combine processes to form larger processes. The sequen- tial constructor, SBQ, causes its components to executed one after another. The parallel con- structor, PAR, causes its components to be 55 executed concurrently. Finally the alternative constructor, ALT, chooses one component process for execution which is earliest ready. It is cleare that IF and WHILE constructors are also provided in occam program. 2.3. Ka.ytra.dng Raytraclng is a now well introduced technique for realistic image synthesis from three dimen- sional geometric scenes. The basic raytraclng algorithm Is described in /7/ and is briefly given here. Simplifying somewhat, for each picture element of the rastered image, a ray Is trased from the viewpoint into the three dimen- sional scene to calculate the first Intersec- tion with an object. If the object Is reflec- ting or refracting, an appropriate ray is dete- rmined by the law of reflection and refraction. These new rays are traced analogously. To cal- culate shadows, the ray-object intersection points are connected by line segments to the point light sources illuminating the scene. If there Is an object intersecting the line seg- ment, the intersection point lies in the shadow of this light source, and its intensity is not taken into account for intensity calculations. 3. LOGICAL ARCHITECTURE The key problem with raytraclng is the relati- vely high unavoidable basic amount of computa- tion. In the past, two main strategies were followed to process sets of elementary primi- tives: image decomposition and scene decomposi- tion. In a former case a subset of picture elements of the image are assigned to each of several processors. Every processor has access to the relevant primitives of the scene. In the later case a subset of primitives are assigned to each processor. A processor has access to the relevant rays of the scene. The calculations performed for each picture element on the screen are completely Indepen- dent so they can be performed in any ordered and on any number of processors. The example of distributing the work to a number of processors Is given in Fig. i. from host to display hardware Fig.l: Logical architecture This solution requires three different proces- ses running concurrently on three, or more, processors: - control process (controller), intersect and shading calculation process (calculator), - display process (display). A controller Interfaces with the user or host computer to provide a description of the scene beeing viewed and allocate work to processors. A calculator can be replicated any number of times, to render the picture elements. A dis- play collect the results from each calculator and drives the graphic display. Every calculator is first given the description of the scene and then processing work can be allocated by the controller giving each calcu- lator picture element to evaluate. When the calculations have been completed the results are passed out to the display. The display then informs the controller that there is now a free processor and another picture element is sent out for evaluation. The amount of computation required varies from picture element to element and this method automatically balances the load amongst the processors and ensures they are all kept busy. An interesting idea here is that the picture elements do not need to be generated in sequen- ce and, if they are generated In some pseudo- random order, a good impression oof the final picture can be obtained well before every pic- ture element has been evaluated. This could be particularly useful in a CAD system where the user wishes to change his view of an object very rapidly. Note, that this structure is not related to the raytracing algorithem and Is suitable for any problem which can be broken into independent parts. 4. PHYSICAL ARCHITECTURE *./. Basic design It appears, a first sight, that the above ar- chitecture cannot be maped directly into a network of transputers because of the fixed number of communication links available. There is a partitioning which can aid the un- derstanding and implementation of the structure of parallelism. The data processed on calcula- tors consist a sequence of values (picture element), then all of the processes can be executed concurrently, even those which process the data In sequence. Alternatively the con- structed process can be replicated over a num- ber of calculating transputers each of which will execute the construct on a subset of the data structure as illustrated in Fig. 2. This hardware realisation named proces replication is mapped onto a network of transputers, active data structure is mapped onto the reconflgurab- le processor array (RPA). Both are Ideal for occam process visualisation. The occam model adopted in the RPA system, uses point to point communications to synchronise processes. A processprocessor mapping is Implemented by providing a physical network' of transputers, which Is isomorfic to the process structure, but only at the chosen point in the hierarchy of the occam program. It Is very simple to arrange for the controller to communicate with any transputer in a network by passing messages through the Intervening 56 host-j control/ display calculator patches display picture elements Fig. 2: Physical architecture transputers. For simplicity, the raytraclng algorithm was mapped on to a linear array of transputers /&/. In Fig. 2 where the basic architecture is shown we see that each transputer link implements two occam channels, one- in each direction. There- fore, this mapping uses only two of the four links available on a transputer. Control and display processes are executed in parallel on one transputer (control/display transputer) while the rest of the transputers do the inter- section and shading calculation processes (cal- culating transputers). The control/display tra- nsputer also does these calculations and the same, parameterlsed, program is loaded into every calculating transputer. Such method of mapping processes requires that each calcula- ting transputer also execute routing processes, i.e. commands and data are passed from the controller along the array and results are passed back for display. This linear connection of transputers Implies some sort of command protocol for identifying the nature and desti- nation of data, consequently the routing pro- cess on each transputer only needs to decide whether a message is to be accepted locally or passed on to be dealt with elsewhere. +.2. The control/display transputer There are two processes executed by the con- trol/display transputer (Fig. 3): sendPatches, loadBalance. screen, called patches, rather than individual picture element, are given to each transputer for two resons: a) to give better ratio of calculation to com- munication; b) to enable segments of data to be transmit- ted. A segment communication transmits an array of words as a single operation. This has two big advantages over the transmission of individual words. Firstly, there is the same processor overhead for setting up' the links to transmit a single word as for a million words. This better exploits the autonomous transputer links and allows the processor to continue calculating at very nearly full speed. Secondly, it gives a better ration of communication to computation. LoadBalance coordinates the sending of data to the other transputers and the display of the generated picture elements. The first thing it does is to determine the number of transputers in the system. This is done by sending a count, Incremented by each transputer, around the loop. If there are n transputers then loadBala- nce passes on primitives containing 2n picture elements from the process sendPatches and then waits until a result is returned before passing out another request. 4.3. The calculating transputer The work of each of calculating transputers is organised as three processes (Fig. 4): thTougput, - render, - feedback. from previous calculating transputer to next from next host display control/display transputer ^, Bend •*\ Patches patches. picture elements Fig. 3: Processes on control/display transputer SendPatches interfaces to the host computer to receive the description of the scene being modelled and other commands. It passes the world model out to all the other transputers and then sends out requests for picture ele- ments to be evaluated. Square areas of the Fig. 4: Processes on the calculating transputer The throughput process receives patch requests from the previous transputer and either pass them on to the next or keep them for processing locally. It is also able to hold one request in an Internal buffer so, Initially, it accepts two requestes. The first Is passed immediately to render process for evaluation and the second is held until needed. Any further patches re- ceived are passed on to be evaluated elsewhere. As soon as render has finished the computation of its patch, throughput passes it the buffered patch to work on and is then ready to accept another. Since the time taken to compute a patch is less than the time before throughput receives the next patch request, the render process is kept busy. The render process is given patches to eva- luate, i.e. it does all the calculations to find intersections, build the bundle of rays and then traverse this bundle to get the final picture element value. When the picture ele- 57 ments In the patch are evaluated then another patch Is requested from throughput and the picture elements are passed out to the feedback process. This is a completely sequential part of program which can be written in any standard programming language. The feedback process multiplexes the local result and those received from other transpu- ters and passes them back towards the display- transputer via the shortest route. •••*. Occam Implementation As we have said the transputer architecture simplyfles system design by the use of proces- ses as standard software and hardware building blocks. The ability to specify a hardwired function as an Occam process provides the ar- chitectural framework for transputers with specialized capabilities (e.g. graphics). The required function (e.g. graphics drawing and display engine) is definied as an occam pro- cess, and implemented in hardware with a stan- dard occam channel interface. The occam program of the proposed architecture (see Fig. 2) contain: the description of the entire transputer system, - the control/display transputer program, - the calculating transputer program. For simplicity, only the essential outline of the mentioned above occam programs Is given here. The system description is as follows: VAL number.transputers IS 84: VAL last IS number.transputers - 1: CHAN Host, display, loop back, forwardfnumber.transputers), returnt number.transputers]: PLACED PAR — transputer 0 : control/display transputer PROCESSOR 0 T4 — data from host PLACE host AT HnkOin : — to display PLACE display AT linXOout: — patches out PLACE forward[O] AT linklout: — picture element value back PLACE returnfOJ AT linxiin : control ( host, display, forward[OJ, returnfO} ) -- the main body of the pipeline of calculators PLACED PAR 1:1 FOR number.transputers - S PROCESSOR 1 T4 — patches in PLACE forward[i] AT llnKOln : — picture elements out PLACE returnfij AT linklout: — patches out PLACE forwardll+l] AT linklout: — picture elements in PLACE return[l+lj AT HnkOin : calculate ( forward[i], forward[i*l), return[i+l), return(l) ) — the last transputer is a special case as it — has no one else to talk to. The fact that — the channel 'loopback' is not placed means — that an internal ("soft") channel rill be — created. In fact this channel is never used -- but is required as a parameter. PROCESSOR last T4 PLACE forward[last) AT HnkOin : PLACE return[last) AT llnkOout: calculate ( forwardflast), loopback, loopback, return[last) ) The program running on the control/display transputer Is: PROC control ( CHAN fromHost, toDlsplay, toCalculators, plxelsln ) ... definition of sendPatches procedure ... definitions of loadBalance procedure CHAN data: PAR sendPatches ( fromHost, data ) loadBalance ( data, toCalculators, plxelln, toDlsplay )' Finally, each of the calculating transputers runs the following program: PROC calculate ( CHAN fromPrev, toNext, fromNext, toPrev ) ... definition of the throughput procedure ... definition of the render procedure ... definition of the feedback procedure CHAN toLocal, fromLocal, requestMore: PRI PAR — run these at high priority for — fastes response to messages PAR throughput ( fromPrev, toNext, toPrev, toLocal, requestHore ) feedback ( fromLocal, fromNext, toPrev ) — and this is at low priority render ( toLocal, fromLocal, requestHore ) S. CONCLUDING REMARKS S.I. Performances Whithout doubt, the processing speed of the system is directly related to the number of transputers used. A number of factors contri- bute to this aspect of the system. Most of these were carefully worked out design deci- sions but one had to be determined empirically. The transputers require only two words of data to specify the position of the all picture elements in the patch. If the work were distri- buted on a picture element by element basis then two words of data would be required for every element. This would mean a worst ratio of communication to processing. A more Intelligent approuch use of segment communication for data (paragraph 4.2.). These means less processor overhead per word sent and allows a greater amount of concurrency between the link engines and the processor. It is important to say that the. message routing processes are had to be run at high priority to ensure that incoming mes- sages can be examined and forwarded immediately It is received. Carefully ordered priority in the ALT constructs of these processes are en- 58 sure that patches are returned to the control transputer as quickly as possible. Wlthholdlngs in the work in throughput and feedback proces- ses, are reduced by introducing software buffers into two Input channels of these processes. Channel buffers are frequently used, and easy to Implement, in occam programs. Buffers intro- ducing made a significant difference to the performance from speed - (transputers + 1) » K/2 to very nearly speed = transputers • K, where K is the performance of a single transpu- ter. 5.2. Robustness and reliability The system described above is already remarkab- ly robust. It should be possible to exploit the number of transputers with some degree of redu- ndancy. If a calculating transputer falls then the system will progressively deadlock only if the neighbour, on the controller side, attempts to communicate with it. In order to make the system more robust it must be possible to de- tect when a failure has occured. This reqire the - using a timeout on all communications. Secondly It must be possible to ensure that, if a communication does fall, all the Input and output processes will terminate. These desired function are performed by a number of prede- fined procedures which allow an input or output to be attempted within a time limit, an recove- ry from a failed communication /9/. The use of these procedures means that failure of a tran- sputer can be detected by its neighbour. The controlling transputer could then be Informed and so take action to recover or regenerate the lost data. Detection of the failure of a tran- sputer Implies that facilities could be added to allow the defective transputer to be bypas- sed. As we remember, on each transputer two communication links are unused. So this can be done with no extra hardware In such a way that the precedent of fall transputer switches It to the other link to communicate with the next transputer along. 6. REFERENCES /!/ Whltby-Strevens, C, The Transputer, Proc. 12th Annual Int'l Symp. Computer Arch., Boston, Massachusetts, 19A5, pp. 292-300. /2/ Taylor, R., Transputer Communication Link, Microprocessors and Microsystems, Vol.10, No.4, 1986, pp. 211-215. /3/ Mlhovllovie, B.. S. Mavrlc, P. Kolbezen, Transputer - The Basic Componnent of Multi- processor Systems, Informatlca, Vol.10, No.4, 1966, pp. A1-A4. /4/ Mihovllovlc, B., P. Kolbezen, J. Sllc, Communicating Processes In Transputer Sys- tems, Informatlca, Vol.11, No.2, 1967, pp. 74-77. /S/ May, D., R. Taylor, Occam - An Overview, Microprocessors and Microsystems, Vol.A, No.2, 19S4, pp. 73-79. /6/ Curry, B. J., Occam Solves Classical Operating System Problems, Microprocessors and Microsystems, Vol.A, No.6, 1984, pp. 280-283. /7/ Whitted, T., An Inproved Illumination Model for Shaded Dlspley, Conn, ACM, Vol. 23, No.6, 1980, pp. 343-349. /A/ Packer, J., Exploiting Concurrency; A Ray Tracing Example, Tech. Note 7, INMOS Ltd, Bristol. /9/ Shepherd, R., Extraordinary Use of Transpu- ter Links, Tech. Note 1, INMOS Ltd, Bris- tol.