A PARADIGM OF TRANSPUTER SYSTEM IMPLEMENTATION INFORMATICA 4/87
UDK 681.3.06
Branko Mihovilovic, Peter Kolbezen, Jurij
Jozef Stefan Institute, Ljubljana
• Che implementation of • raytracing algorithm on an
array of transputers Is presented. 36ucD system l>as some important
properties. Processing speed is directly proportional to the
number of transputers used. • Cl)e system is remarkably robust
namely, Individual transputers can be removed from tl)e system
while the program is running, and trie system will continue to
function afthought with reduced performance and some loss of
data.
KEY WORDS - Transputer, occam, r&ytr&clng, performances.
1. INTRODUCTION
This paper describes the implementation of a
computer graphics program on an array of tran-
sputers. This program was written to provide a
denonstratlon of the performance obtainable by
using large numbers of transputers. We used a
technique Known as raytracing wlch can generate
very realistic Images but requires massive
amounts of computer power. This is an Ideal
application for transputers as the calculation
for each picture element on the screen are
Independent of one another an so can be done in
parallel on separate transputers. The entire
program Is written in occam, however, the main
part of the program could have been written in
any suitable language. Only those parts of the
program which deal exspllcitly with concurrency
and the distribution of work are easier to
write in occam.
2. BACKGROUND
2.i. Transputer
Concurrent systems can be constructed from a
collection of microcomputers which operate
concurrently and communicate through point to
point serial communication Jinks. The INMOS
transputer /1-4/ is the ideal building block of
high performance multiprocessor systems which
provide maximum speed with minimum hardware.
A transputer contains memory, a processor and
a number of standard point-to-point communica-
tion links which allow direct connection to
other transputers. The processing capability
may be general purpose or may be optimized to a
specific purpose. . The on-chlp memory may be
extended of chip by suitable interface. A tran-
sputer may also have special purpose Interfaces
for connection to specific types of hardware.
The first transputer available was the IMS
T414, a 32 bit RISC microprocessor with a
throughput of to mips. It has 2 Kbytes of fast
SRAM (SO ns) and four serial links. The 32 bit
multiplexed bus allows up to 4 Obytes of exter-
nal memory to be accessed. Compatible with the
T414 is IMS T600 wlch Includes 4 Kbytes SRAM
and floating point arithmetic.
2.2. Occam
Occam /5,6/ is a small and elegant language. It
combines the best of contemporary thought about
control structures and variable scoping with
some radical new structures to handle concur-
rency. It is based on an model of computation
that is different from conventional languages
in that It includes the notion of communica-
tion, parallel execution, and synchronization
in its very structure.
The basic unit of occam programming is a pro-
cess that performs a sequence of actions and
then terminates. There may be more than one
proccess executed at any given moment. Occam
programs are constructed from three primitive
processes: assignment, Input and output. The
assignment
v := e
sets the variable v to the value of the expres-
sion e. The output
c t e
outputs the value of expression e to the chan-
nel c. Similarly, the Input
c ? v
sets the variable v to a value input from the
channel c. Constructors are used to combine
processes to form larger processes. The sequen-
tial constructor, SBQ, causes its components to
executed one after another. The parallel con-
structor, PAR, causes its components to be
55
executed concurrently. Finally the alternative
constructor, ALT, chooses one component process
for execution which is earliest ready. It is
cleare that IF and WHILE constructors are also
provided in occam program.
2.3. Ka.ytra.dng
Raytraclng is a now well introduced technique
for realistic image synthesis from three dimen-
sional geometric scenes. The basic raytraclng
algorithm Is described in /7/ and is briefly
given here. Simplifying somewhat, for each
picture element of the rastered image, a ray Is
trased from the viewpoint into the three dimen-
sional scene to calculate the first Intersec-
tion with an object. If the object Is reflec-
ting or refracting, an appropriate ray is dete-
rmined by the law of reflection and refraction.
These new rays are traced analogously. To cal-
culate shadows, the ray-object intersection
points are connected by line segments to the
point light sources illuminating the scene. If
there Is an object intersecting the line seg-
ment, the intersection point lies in the shadow
of this light source, and its intensity is not
taken into account for intensity calculations.
3. LOGICAL ARCHITECTURE
The key problem with raytraclng is the relati-
vely high unavoidable basic amount of computa-
tion. In the past, two main strategies were
followed to process sets of elementary primi-
tives: image decomposition and scene decomposi-
tion. In a former case a subset of picture
elements of the image are assigned to each of
several processors. Every processor has access
to the relevant primitives of the scene. In the
later case a subset of primitives are assigned
to each processor. A processor has access to
the relevant rays of the scene.
The calculations performed for each picture
element on the screen are completely Indepen-
dent so they can be performed in any ordered
and on any number of processors. The example of
distributing the work to a number of processors
Is given in Fig. i.
from host
to display
hardware
Fig.l: Logical architecture
This solution requires three different proces-
ses running concurrently on three, or more,
processors:
- control process (controller),
intersect and shading calculation process
(calculator),
- display process (display).
A controller Interfaces with the user or host
computer to provide a description of the scene
beeing viewed and allocate work to processors.
A calculator can be replicated any number of
times, to render the picture elements. A dis-
play collect the results from each calculator
and drives the graphic display.
Every calculator is first given the description
of the scene and then processing work can be
allocated by the controller giving each calcu-
lator picture element to evaluate. When the
calculations have been completed the results
are passed out to the display. The display then
informs the controller that there is now a free
processor and another picture element is sent
out for evaluation.
The amount of computation required varies from
picture element to element and this method
automatically balances the load amongst the
processors and ensures they are all kept busy.
An interesting idea here is that the picture
elements do not need to be generated in sequen-
ce and, if they are generated In some pseudo-
random order, a good impression oof the final
picture can be obtained well before every pic-
ture element has been evaluated. This could be
particularly useful in a CAD system where the
user wishes to change his view of an object
very rapidly.
Note, that this structure is not related to the
raytracing algorithem and Is suitable for any
problem which can be broken into independent
parts.
4. PHYSICAL ARCHITECTURE
*./. Basic design
It appears, a first sight, that the above ar-
chitecture cannot be maped directly into a
network of transputers because of the fixed
number of communication links available.
There is a partitioning which can aid the un-
derstanding and implementation of the structure
of parallelism. The data processed on calcula-
tors consist a sequence of values (picture
element), then all of the processes can be
executed concurrently, even those which process
the data In sequence. Alternatively the con-
structed process can be replicated over a num-
ber of calculating transputers each of which
will execute the construct on a subset of the
data structure as illustrated in Fig. 2. This
hardware realisation named proces replication
is mapped onto a network of transputers, active
data structure is mapped onto the reconflgurab-
le processor array (RPA). Both are Ideal for
occam process visualisation. The occam model
adopted in the RPA system, uses point to point
communications to synchronise processes. A
processprocessor mapping is Implemented by
providing a physical network' of transputers,
which Is isomorfic to the process structure,
but only at the chosen point in the hierarchy
of the occam program.
It Is very simple to arrange for the controller
to communicate with any transputer in a network
by passing messages through the Intervening
56
host-j
control/
display
calculator
patches
display picture
elements
Fig. 2: Physical architecture
transputers. For simplicity, the raytraclng
algorithm was mapped on to a linear array of
transputers /&/.
In Fig. 2 where the basic architecture is shown
we see that each transputer link implements two
occam channels, one- in each direction. There-
fore, this mapping uses only two of the four
links available on a transputer. Control and
display processes are executed in parallel on
one transputer (control/display transputer)
while the rest of the transputers do the inter-
section and shading calculation processes (cal-
culating transputers). The control/display tra-
nsputer also does these calculations and the
same, parameterlsed, program is loaded into
every calculating transputer. Such method of
mapping processes requires that each calcula-
ting transputer also execute routing processes,
i.e. commands and data are passed from the
controller along the array and results are
passed back for display. This linear connection
of transputers Implies some sort of command
protocol for identifying the nature and desti-
nation of data, consequently the routing pro-
cess on each transputer only needs to decide
whether a message is to be accepted locally or
passed on to be dealt with elsewhere.
+.2. The control/display transputer
There are two processes executed by the con-
trol/display transputer (Fig. 3):
sendPatches,
loadBalance.
screen, called patches, rather than individual
picture element, are given to each transputer
for two resons:
a) to give better ratio of calculation to com-
munication;
b) to enable segments of data to be transmit-
ted.
A segment communication transmits an array of
words as a single operation. This has two big
advantages over the transmission of individual
words. Firstly, there is the same processor
overhead for setting up' the links to transmit a
single word as for a million words. This better
exploits the autonomous transputer links and
allows the processor to continue calculating at
very nearly full speed. Secondly, it gives a
better ration of communication to computation.
LoadBalance coordinates the sending of data to
the other transputers and the display of the
generated picture elements. The first thing it
does is to determine the number of transputers
in the system. This is done by sending a count,
Incremented by each transputer, around the
loop. If there are n transputers then loadBala-
nce passes on primitives containing 2n picture
elements from the process sendPatches and then
waits until a result is returned before passing
out another request.
4.3. The calculating transputer
The work of each of calculating transputers is
organised as three processes (Fig. 4):
thTougput,
- render,
- feedback.
from
previous
calculating transputer
to
next
from
next
host
display
control/display
transputer
^, Bend
•*\ Patches
patches.
picture
elements
Fig. 3: Processes on control/display transputer
SendPatches interfaces to the host computer to
receive the description of the scene being
modelled and other commands. It passes the
world model out to all the other transputers
and then sends out requests for picture ele-
ments to be evaluated. Square areas of the
Fig. 4: Processes on the calculating transputer
The throughput process receives patch requests
from the previous transputer and either pass
them on to the next or keep them for processing
locally. It is also able to hold one request in
an Internal buffer so, Initially, it accepts
two requestes. The first Is passed immediately
to render process for evaluation and the second
is held until needed. Any further patches re-
ceived are passed on to be evaluated elsewhere.
As soon as render has finished the computation
of its patch, throughput passes it the buffered
patch to work on and is then ready to accept
another. Since the time taken to compute a
patch is less than the time before throughput
receives the next patch request, the render
process is kept busy.
The render process is given patches to eva-
luate, i.e. it does all the calculations to
find intersections, build the bundle of rays
and then traverse this bundle to get the final
picture element value. When the picture ele-
57
ments In the patch are evaluated then another
patch Is requested from throughput and the
picture elements are passed out to the feedback
process. This is a completely sequential part
of program which can be written in any standard
programming language.
The feedback process multiplexes the local
result and those received from other transpu-
ters and passes them back towards the display-
transputer via the shortest route.
•••*. Occam Implementation
As we have said the transputer architecture
simplyfles system design by the use of proces-
ses as standard software and hardware building
blocks. The ability to specify a hardwired
function as an Occam process provides the ar-
chitectural framework for transputers with
specialized capabilities (e.g. graphics). The
required function (e.g. graphics drawing and
display engine) is definied as an occam pro-
cess, and implemented in hardware with a stan-
dard occam channel interface.
The occam program of the proposed architecture
(see Fig. 2) contain:
the description of the entire transputer
system,
- the control/display transputer program,
- the calculating transputer program.
For simplicity, only the essential outline of
the mentioned above occam programs Is given
here.
The system description is as follows:
VAL number.transputers IS 84:
VAL last IS number.transputers - 1:
CHAN Host, display, loop back,
forwardfnumber.transputers),
returnt number.transputers]:
PLACED PAR
— transputer 0 : control/display transputer
PROCESSOR 0 T4
— data from host
PLACE host AT HnkOin :
— to display
PLACE display AT linXOout:
— patches out
PLACE forward[O] AT linklout:
— picture element value back
PLACE returnfOJ AT linxiin :
control ( host, display,
forward[OJ, returnfO} )
-- the main body of the pipeline of calculators
PLACED PAR 1:1 FOR number.transputers - S
PROCESSOR 1 T4
— patches in
PLACE forward[i] AT llnKOln :
— picture elements out
PLACE returnfij AT linklout:
— patches out
PLACE forwardll+l] AT linklout:
— picture elements in
PLACE return[l+lj AT HnkOin :
calculate ( forward[i], forward[i*l),
return[i+l), return(l) )
— the last transputer is a special case as it
— has no one else to talk to. The fact that
— the channel 'loopback' is not placed means
— that an internal ("soft") channel rill be
— created. In fact this channel is never used
-- but is required as a parameter.
PROCESSOR last T4
PLACE forward[last) AT HnkOin :
PLACE return[last) AT llnkOout:
calculate ( forwardflast), loopback,
loopback, return[last) )
The program running on the control/display
transputer Is:
PROC control ( CHAN fromHost, toDlsplay,
toCalculators, plxelsln )
... definition of sendPatches procedure
... definitions of loadBalance procedure
CHAN data:
PAR
sendPatches ( fromHost, data )
loadBalance ( data, toCalculators,
plxelln, toDlsplay )'
Finally, each of the calculating transputers
runs the following program:
PROC calculate ( CHAN fromPrev, toNext,
fromNext, toPrev )
... definition of the throughput procedure
... definition of the render procedure
... definition of the feedback procedure
CHAN toLocal, fromLocal, requestMore:
PRI PAR
— run these at high priority for
— fastes response to messages
PAR
throughput ( fromPrev, toNext, toPrev,
toLocal, requestHore )
feedback ( fromLocal, fromNext, toPrev )
— and this is at low priority
render ( toLocal, fromLocal, requestHore )
S. CONCLUDING REMARKS
S.I. Performances
Whithout doubt, the processing speed of the
system is directly related to the number of
transputers used. A number of factors contri-
bute to this aspect of the system. Most of
these were carefully worked out design deci-
sions but one had to be determined empirically.
The transputers require only two words of data
to specify the position of the all picture
elements in the patch. If the work were distri-
buted on a picture element by element basis
then two words of data would be required for
every element. This would mean a worst ratio of
communication to processing. A more Intelligent
approuch use of segment communication for data
(paragraph 4.2.). These means less processor
overhead per word sent and allows a greater
amount of concurrency between the link engines
and the processor. It is important to say that
the. message routing processes are had to be run
at high priority to ensure that incoming mes-
sages can be examined and forwarded immediately
It is received. Carefully ordered priority in
the ALT constructs of these processes are en-
58
sure that patches are returned to the control
transputer as quickly as possible. Wlthholdlngs
in the work in throughput and feedback proces-
ses, are reduced by introducing software buffers
into two Input channels of these processes.
Channel buffers are frequently used, and easy
to Implement, in occam programs. Buffers intro-
ducing made a significant difference to the
performance from
speed - (transputers + 1) » K/2
to very nearly
speed = transputers • K,
where K is the performance of a single transpu-
ter.
5.2. Robustness and reliability
The system described above is already remarkab-
ly robust. It should be possible to exploit the
number of transputers with some degree of redu-
ndancy. If a calculating transputer falls then
the system will progressively deadlock only if
the neighbour, on the controller side, attempts
to communicate with it. In order to make the
system more robust it must be possible to de-
tect when a failure has occured. This reqire
the - using a timeout on all communications.
Secondly It must be possible to ensure that, if
a communication does fall, all the Input and
output processes will terminate. These desired
function are performed by a number of prede-
fined procedures which allow an input or output
to be attempted within a time limit, an recove-
ry from a failed communication /9/. The use of
these procedures means that failure of a tran-
sputer can be detected by its neighbour. The
controlling transputer could then be Informed
and so take action to recover or regenerate the
lost data. Detection of the failure of a tran-
sputer Implies that facilities could be added
to allow the defective transputer to be bypas-
sed. As we remember, on each transputer two
communication links are unused. So this can be
done with no extra hardware In such a way that
the precedent of fall transputer switches It to
the other link to communicate with the next
transputer along.
6. REFERENCES
/!/ Whltby-Strevens, C, The Transputer, Proc.
12th Annual Int'l Symp. Computer Arch.,
Boston, Massachusetts, 19A5, pp. 292-300.
/2/ Taylor, R., Transputer Communication Link,
Microprocessors and Microsystems, Vol.10,
No.4, 1986, pp. 211-215.
/3/ Mlhovllovie, B.. S. Mavrlc, P. Kolbezen,
Transputer - The Basic Componnent of Multi-
processor Systems, Informatlca, Vol.10,
No.4, 1966, pp. A1-A4.
/4/ Mihovllovlc, B., P. Kolbezen, J. Sllc,
Communicating Processes In Transputer Sys-
tems, Informatlca, Vol.11, No.2, 1967,
pp. 74-77.
/S/ May, D., R. Taylor, Occam - An Overview,
Microprocessors and Microsystems, Vol.A,
No.2, 19S4, pp. 73-79.
/6/ Curry, B. J., Occam Solves Classical
Operating System Problems, Microprocessors
and Microsystems, Vol.A, No.6, 1984, pp.
280-283.
/7/ Whitted, T., An Inproved Illumination Model
for Shaded Dlspley, Conn, ACM, Vol. 23,
No.6, 1980, pp. 343-349.
/A/ Packer, J., Exploiting Concurrency; A Ray
Tracing Example, Tech. Note 7, INMOS Ltd,
Bristol.
/9/ Shepherd, R., Extraordinary Use of Transpu-
ter Links, Tech. Note 1, INMOS Ltd, Bris-
tol.