A SIMULATION OF DATAFLOW COMPUTER MODELS AND
A MODEL OF FAULT TOLERANT DATAFLOW COMPUTER
INFORMATICA 4/87
UDK 681.3.068
Milan OjsterSek, Viljem burner, Peter Kokol, Anton Zorman
TehnicaJ Faculty Maribor
ABSTRACT: Dataflow architecture is accepted as the architecture for future
computers because it can' exploit potentialy all the parallelism in a
prograa. This architecture is assuaed to execute dataflow graphs. The
nodes in the dataflow graphs represent asynchronous tasks. The arcs
connecting nodes represent coaaunication paths for the aessages (tokens)
generated by nodes or suplied froa the external environaent. Each node is
executed (fired) when the required input becoaes available. The dataflow
arhitecture proposals can be classified as static and dynaaic
architectures. In a static architecture the nodes of a prograa graph are
loaded into aeaory before the coaputation begins and at aost ona instance
of a node is enabled for firing at a tiae. A dynaaic architecture
facilitates the firing of several instances of a node at a tiae and these
nodes can be created at runtiae. Me have developed prograa language DFL1,
siaulator for the siaulation datafloa cuaputer aodels and their key
aechanisas, iaproved hardware, software and set of instructions needed to
support an efficient fault-tolerant dataflow coaputer.
1 INTRODUCTION
Over the last few years data flow coaputer
architectures have been propose'd and soae coaputers have
been built. A siaple data flaw execution eodel is based
on a 'pure* data flow prograa organisation: with an
instruction being enabled when all its input tokens are
available. Each instruction aay have any nuaber of
inputs and any nuaber of outputs. Discussing the
U-interpreter, Arvind and 6ostelow suggest that aaxiaua
parallelisa can only be extracted froa a prograa if an
arc is a!lowed to carry aore than a single token - a
process achieved by carrying a label with each token
that identifies the context of that particular token
(ARVI 82). Packet coaaunication aachine organisation
and high level coaputer language with exploiting
inherent parallelisa in data, flow graph is the aost
significant for the data flow architecture. Me have
developed prograa language DFL1 (KOK B4) and siaulator
for the siaulation of five various data flow coaputer
aodels, which art suited for todays technology. During
the siaulation, siaultaneous execution of a data flow
prograa in distinct units (token transaission and
•itching, foraing activities) and soae iapartant
statistical paraaeters (i.e. an input stream, a
business period, an idle period, a service tiae ...) are
observed (OJST 84, 2UH 83-1, ZUH 83-2, OJST 86, OJST
97-2).
2 DESCRIPTION OF MODELS
Our aodels are based on the packet coaaunication
eachine organisation with token watching and consist of
an input section, aeaory sections, a global store,
processor sections, an output section and units for
coaaunication aaang sections. The input section
decomposes the data flow prograa and supports aeaory
sections with input data. Me have used two
decoaposition aethods:
- Each instruction into its own aeaary section. If
there are aore instructions than aeaory sections,
then aore instructions are stored in one aeaory
section. Instructions tre randoaly distributed.
- Each block into its own aeaory section. If there
are aore blocks than aeaory sections, then aore
blocks are stored in one aeaory section.
The aeaory section aatches tokens into sets of
tokens. Hhen all of its input tokens having the saae
context are available, it foras activity (executable
instruction which consists of a set of tokens with the
saae context and a copy of instruction) is foraed and
sent into an processor section. The processor section
executes the activity and sends output results into
aeaary sections. The output section collects final
results of coaputation. The global store is used to
prevent deadlock of the systea. Coaaunication units
(networks, busses) transait aessages (tokens, sets of
tokens, activities). They use post office
interconnection principle for transaission aessages.
Coaaunication units in our siaulation are siaulated as
delay units.
2.1 Model 1
input
isotlo
global
star*
1
J
proceesor
section 1
netiory
section 1
laotlvltyl
Inetwork |
, nroceasor
(section R
>. ^-"
JJSti ^ne tworkj
-> <^.
|n«mory
(section S
L
\
Figure l:The aodel 1
The aodel 1 uses two networks: one for
transaitting activities and one for transaitting data.
It has separate aeaary sections and separate processor
sections. The aeaory section consists of an input FIFO
60
queue, a latching store, an instruction store, an
associative logic and an output FIFO queue. The
processor section consists af an input FIFO queue,a
processor, an output FIFO queue. This configuration is
not suitable for super-systeas, because networks cause
too such delay tiae in transmitting aessages. An
advantage of this aodel is autonoaic nark of units. The
•odel is suitable for saall systeas and for a subsystea
of a large systea.
2.2 Models Connected In Clusters
Input 1
section
J
cluster i| -
s
global
store
J
global
network
•^_ . -
Icluster II
t
i * ]
Figure 2:Rodels connected in clusters
Models 2,3,4,5 are very siailar. Sections of
aodels are connected in clusters.
2.2.1 Model 2
The aeaory section and the processor are associated
in one eleeent (figure 3). The aeaory section Batches
tokens into sets of tokens, fares and executes
activities. Output tokens nhich don't have their own
instructions in the cluster are transaitted into butput
FIFO queue. Other tokens are aatched in the cluster.
Higher speed of foraing and executing activities is the
advantage of this aodel. There is no additional
transaitting betwen aeaory section and processor. This
aodel is very suitable for saall and aediua systeas if
the adequate decoeposition aethad is used.
2.2.2 Model 3
The processor and the aeaory section are separated
(figure 4). If the processor is idle and the aeaory
section hasn't any activities, it can receive activities
froa other clusters. If the aeaory section has
activities and the processor isn't idle, activities are
transaitted into the output FIFO queue. The global
network transaits it to the idle processor section.
J
input PIPO
1
nemory sectior
• processor
1
butput FIFO
[jueue
1
i
Input FlfOl
*]ueue I
}
global
1
Input PIFOl
luftue 1
1
roeraory •
ansoclatlve logic
J
output PIKO
founue
1 1
1
i
global store
bus
"I
Ibuffel
procefl
output
queue
1
BOr
FIFO
1
1
2.2.3 Models 4 And S
Models 4 and S are siailar to aodel 3. In each
cluster they have one or aore processors. The aeaory
logic establishes which processor is idle and adresses
the activity to it. Model 4 is slightly better than
aodel 5, because it needs less tiae for writing in FIFO
queues.
Figure 3 Figure 4
Figure 3: The cluster of aodel 2
Figure 4i The cluster of aodel 3
llniiut FIF'Jl
n"<"» I
iput FIFO)
leue ' I
1 memory +
aseoc1 at 1v« logic
i
butput FIFOI
hueue 1
global
I
L
store
uffer 1 Suffer Mj
J
output FIFO] f # > mtput PIFCl
queue 1 I ' " ' tueue H |
T
Figure StThe cluster of aodel 4
llnput PIFOI
hueue
Input FIFO]
lueue I
memory •
associative loglo
output FIFO
buffer 1
J—
output FIPC
In memory
section 1
tutput FIFO
in global
network 1
global store
bus
ibuffe
output PIf(
In aeaory
section H
output fir:
in global
network H
Figure 6iThe cluster of •odel S
61
3 SIMULATION RESULTS
98 different parameters can be varied in our
simulation but it is reasonable to vary just soae of
thea. Constant values *re given to all others. Our
aodels are siaulated Kith four programs, the blocks of
which are coaposed of data flo« graphs shown on figures
7,8,9,If. Prograa blocks are independent and they have
no coaaon flow of data tokens.
Figure 7
Figure 8
Figure 9 Figure If
All data flow graphs have IS instructions. Sraphs
shown on Figures 7 and 8 have aany inherent parallelism
and need aany input data tokens. The graph shoan on
Figure 8 consists of instructions /, which are executed
31 tiaes slower than instructions +. Such an
instruction delays the execution of following
instructions. The graph with few inherent parallelises
is shown on Figure 9. On Figure If the graph with aany
inherent parallelises and three input data tokens is
shown. Each instruction in the graph shown on Figure
II, has one input operand and one constant. This
increases the speedup of firing enabled instructions.
How are algorithas influenced to our aodels is described
and evaluated.
Our siaulation is executed with these constant input
paraaetersi
an instruction + is executed in 2 cycles
an instruction / is executed in 6f cycles
aatching one token in a set of tokens is executed
in 1 cycle
activity forming is executed in 1 cycle
FIFO queues have 3 eleaents
aritting or reading tiae to FIFO queues is
negleable
a network can transmit an infinite nuaber of
aessages
blocks of prograa are independent
- a aodel 1 has S aeaory sections
In the file of input parameters He have varied the
following paraaetersi
- the selection of system ( 1 - 5 )
- the decoaposition aethod
the delay tiae in transaitting aessages through the
network (f, 1, 2 cycles)
- the capacity of aeaory section and a capacity of
global store
- the input algoritha
The prograa consists of an equal nuaber of blocks
and aeaory sections.
He have reduced the nuaber of observed output
paraaeters to the ainiaua following ones: a prograa
execution tiae requested for execution of 2fl
instructions, an average nuaber of busy processors, an
utilization of processors, an average nuaber of busy
aeaory logic in aeaary sections, an average nuaber of
sets of tokens, which are aatched in aeaory section, an
utilization of a aatching store, an average nuaber of
sets of tokens, which are aatched in global store, and
an utilization of a global store.
The following abbreviations are used in diagrams:
UP - the utilization of processors (I)
DN - the delay tiae in transaitting aessages through the
network (cycles)
MP - nuaber of processors
TE - the prograa execution tiae (cycles)
algoritha shown on Figure 7
algoritha shown on Figure 8
— • algoritha shoan on Figure 9
algoritha shoan on Figure If
For the aodel 1 the decoaposition aethod "each
instruction into its own aeaory section" is better than
'each block into its oan aeaory section* , so we have
aade diagraas which evaluate the aodel 1, only for this
aethod. For aodels 2,3,4 Me have aade diagraas only for
"each block into its own aeaory section* decoaposition
aethod because this method is better.
Th« program execution time dependent on the delay time
in transmitting messages through tbe network
TE
naa.a
teea.e
sea.a
isea.a
sea.a
TE
uea.e
ura.e
sea .a
uea.a
" 1S&.
sea.a
2.DK
2. DX
MODEL 1
•Fii
I.Di
MODEL 4
IIP .3 6
DJI
62
Tfta utllltatlon of prooaaaora depond«nt on tti* delay
ti«« In tranaaltttoa MI«|M through Mi* n.txork
0 2. 0"
construction of a larga and fast aatching acaory at
a reasonable cost,
construction and aanageaent of a structure ataory,
network construction,
interruption, error and exception handling,
inefficiency due to insufficient parallelise,
developing efficient control, partitioning and
scheduling algorithas.
Me have developed key aechanlses, iaproved
hardware, software and set of instructions needed to
support an efficient fault-tolerant dataflow coaeuter
IOJST B7-1, OJST 87-31.
Proposed datafloM architectures »r» very
inefficient on regular structures because of fine
granularity of their operations. When data is
structured (vectors, aatriccs, records) the control and
data flOH is very regular and predictable and there is
no need to pay high overhead for scheduling. These
architectures don't have aechanisas for interruption,
error and exception handling, a aechanisa which
reassigns nodes to another unit if faults have been
detected, a aechanisa Hhlch stops sending tqkans to the
faulty processor and a aechanisa Mhich destroying
read/Mrite requests in eeeory after fault. Contents of
data buffer in the output port of the fail processor are
lost.
2. DN
Figure 111 Task Invocation Structure
control and information «lo»
The algoritha shown on Figure if is the fastest one
(it has a aaxiaua degree of inherent parallelise and a
ainiaua nuaber of input data). Proportions in executing
tiee of prograa and an utilization of units are changed
by increasing the delay tiae in the transeission through
the network. An executing tiae of prograa is dependent
on an inherent parallelisa of algoritha and a nuaber of
input data needed for the execution of this algoritha.
The executing tiae of algoritha is decreased to soae
liait and then it keeps constant value (all inherent
parallelises are used), if the nuaber of aeaory sections
or the nuaber of processor sections are increased or
delay tiae in transaission through the coaaunication
units is decreased. Owing to the fact that the
processor section and the aeaory section of aodel 2 fora
one eleaent, execution tiae of aodel 2 is alaoust t«ice
as long coapared to aodel 3; but its advantage is the
best utilization of units. Because delays of reading
and writing into FIFO queues Here not considered, the
results of aodels 4 and 5 are identical. Kith this
considerations He have found that aodel 4 is better than
aodel 3. The biggest average nuaber of sets of tokens
in the aeaory section is in the algoritha shown on
Figure 9, because instruction / delays the execution of
following instructions. On aodel 1 Me have siaulated
the deadlock of a systea. If aatching stores are full,
then other tokens aust aatch in the global store. This
causes bad utilization of units in a aodel (a lot of
tiae is Hasted for coaaunication betwen aeaory sections
and the global store). He have siaulated the deadlock
with decreasing capacity of aatching stares. A good
decoaposition aethod can prevent deadlock or bad
utilization of a systea.
nttructlor
tore
Tobal
lock
U
global
•tructuri
global
- -(control
jnlj
r
Utor«[ -
1—tt/O clumtai
Figure 12: Global level of Fault Tolerant Dataflow
Machine (FTDFH)
Figure 13: Cluster level of FTDFH
Jlockl1
t«bl«l I
Btructura
jproc—mor
box
•— r~ —1~
|ln«tr.1
fetor, ll
4 FAULT TOLERANT DATAFLOW COMPUTER
Results of siaulation and our knowledge about
datafloH coaputers are shown that there are aany
probleas to be solved. In the design of an actual
dataflow coaputer the aain probleas arei
Figure 14: Processing Eleaent of FTDFH
In our coaputer aodel (figures 12,13,14) the
coabination of control flow, data flow and deaand driven
are used. He adapt to the granularity of the data
structure and treate large structures as one object. He
reduce scheduling overhead by coabining together as eany
scalar operations as possible and executing thea as one
63
object.
Hierarchical decoaposition aethod (ARVI BS) are
used. This aithod is based on the concept of resource
bounded graphs (RBGs). The RB6 is a prograa fragaent
for Mhich a bound on the resource requireaent of the
fragaent can be derived at compile tiae as a siaple
function of a collection of parallelisa paraaeters. A
prograa is viewed as a collection of RBBs.
The eiecution of a prograa can be represented by a
tree of task invocations ( figure 11). This aethod use
breadh-first partitioning algoritha until sufficient
parallelisa is generated and then it use depth-first
partitioning algoritha. Kith depth-first partitioning
algoritha deadlock are avoided.
Our systea is a dynaaic architecture for task level
dataflow with facilities to support node reassignaent
when processors fail. It uses tagging scheae siailar to
the HIT dynaaic architecture. It have three
hierarchical levels. The purpose of using three
hierarchical levels is to execute prograas efficiently
by utilizing principles of locality. Each hierarchical
level have an instruction store, a block table, a
structure store and a control unit. The instruction
store holds copies of the instructions which are
executed in this level. The structure store holds input
data structures (DS) of RB6 which are executed in this
level. The block table contains adresses of units where
RBBs are executed. The control unit adresses RBGs and
its OS to units in this level and generates block table.
If hardware faults occur in one of the units, control
unit reassigns RBGs and its DS to healthy units. Kith
this aechanisa Me achieve that only the RB6 and their DS
which is assigned in the faulty unit aust be reexecuted.
Control unit also deletes RBSs and DS when the
computations of thea Mrt finished. It also controls the
aaount of available aeaory and the utilization of units.
If deadlock is occured the control unit reassigns RBGs,
its sets of tokens and DS froa too busy units to idle
units.
Our proposed aodel has high bandwidth (HB) and low
bandwidth (LB) coaaunication paths. HB paths are
intended for transsaiting RBBs, DS, activities and
tokens betwen units. The LB paths are intended for
transsaiting status, diagnostics, control and
aeasureaent inforaation. Our aodel has I/O
instructions, instructions which are executed in
exception condition, instructions iapleaenting the
special operators for dynaaicaly creating instances of
node resulting froa recursive nodes in dataflow graphs,
loop and streaas, table-oriented instructions (for
readind an entry, writing into an entry and aadifying
parts of an entry), structure-oriented instructions (for
selecting an eleaent froa a structure, appending an
eleaent to a structure and testing for eaptiness),
string aanipulation instructions (to search for a
substring and a lenght of a string to coapare strings,
and concatenate strings), streaa oriented instructions,
fixed-point instructions, logical/shift instructions,
floating-point instructions, coapound instructions (for
reducing token aaveaent).
5 CONCLUSION
Today a vast collection of single-board computers
*re available which offers roughly 1 HIPS at low cost:
these are touted as building blocks for aultiprocessors.
Can dataflow aachines coapete? It is not clear if a
single dataflow processor can achieve the perforaance of
a von Neuaann processor at the saae hardware cost. The
dataflow instruction-scheduling aechanisa is clearly
aore coaplex than increaenting a prograa counter. An
engineering effort substantially beyond any of the
current dataflow projects is required to aake fair
coaparison. The SI6HA-1 (SHIM 86) project is an
iaportant step in this direction. The question becoaes
aore interesting when we consider aachines with aultiple
processors, where the dataflow scheduling aechanisa
yields significant benefits. The dataflow approach can
be viewed as an extreae solutions to the aeaory latency
problea - the processor never waits for responses froa
aeaoryi it continues processing other instructions.
Instructions are scheduled based on the availability of
data, so aeaory responses are siaply routed along with
the tokens produced by processors. It is our belief,
that dataflow architectures together with iaproved
hardware, software, set of instructions needed to
support an efficient fault tolerant dataflow coaputer
and with powerfull high level functional languages, will
show the prograaaing generality, perforaance and cost
effectiveness nedded to aake parallel aachines widely
applicable.
6. References
(ARVI 82) Arvind, Gostelow K. P.:
•The U-interpreter'
IEEE Coaputer, February 19B2, pp. 42-48.
(ARVI 85) Arvind, Culler D. E.:
'Managing Resources in a Parallel Machine',
Proc. of the IFIP TC-U., Conference an
Fifth-generation Coaputer Architecture,
Manchester U.K., July 198S, pp. 113-121.
(KOK 84) Kokol P., Stiglic B., zuaer V.i
Pretvorba aplikativnega jezika v grafe
pretoka podatkov', ETAN, XXVIII
Jugoslavenska konferencija Split,
4.- 8. juna 1984 Split,
pp. IV. S47 - IV SS4, - in Slovene.
(OJST B4) Ojstersek H.:
"Siaulacija podatkavno vodenega racunalnika*
Tehniska fakulteta Naribor,
Maribor 1984 - diploasko delo -in Slovene.
(OJST B6> Ojstersek H., 2uaer V., Kokol P.:
'Data Flow Coaputer Siaulation",
in Proceedings of the 8th International
syaposiua Coaputer at the University,
Cavtat, pp. 2.87 1 - 2.17 1», Hay 1986.
(OJST 87-1) Ojstersek H., 2uaer V., Kokol P.,
Zaraan A.: "Izboljsana strojna in
prograaska opreaa ter nabor instrukcij,
ki oaogaCajo delovanje ucinkayite in na
napake neobtutljive podatkovno vodene
arhitekture", XI Bosanskohercegovacki
siapoziua iz inforaatike 'Jahorina 87*,
zbornik radova, knjiga I, Sarajevo,
april 1987, pp. IS* 1-8 - in Slovene.
(OJST 87-2) Ojstersek N., 2uaer V., Kokol P.:
'Dataflow Coaputer Models',
COUP EURO 87, VLSI and Computers, Haaburg,
aay 1987, pp. 884-GBS.
(OJST 87-3) Ojstersek H., turner V., Kokol P.,
Zoraan A.: "An Overview and Coaparison
of Data Flow Coaputer Systeas*, Proceedings
of the 9th International Syaposiua Coaputer
at the University, Cavtat, Yugoslavia,
pp. 2R.»6 1 - 2R.I6 6, Hay 1987.
(SHIM 86) Shiaada T., Hiraki K., Nishida K.,
Sekiguchi S.: "Evaluation of a Prototype
Data Flow Processor of the SI6HA-1 for
Scientific Computations', Proc. of the 13th
Annual International Syaposiua on Coaputer
Architecture, Nashington, USA, 1986,
pp. 226-234.
(2UN 8S-1) zuaer V., Kokol P., Ojstersek M.:
"Modeli podatkovno vodenih sisteaov in ocena
zaogljivosti", IX Bosanskohercegovacki
siapoziua iz inforaatike "Jahorina 85",
zbornik radova, knjiga I, Sarajevo,
april 198S, pp. 168-1-8, - in Slovene.
(2UM 86) Zuaer V.,Kokol P., Ojstersek H., Zoraan A.:
"Parallel Coaputer Architectures,
Prograaaing Languages and Algorithms,"
Tehniska Fakulteta Haribor,
Haribor 1986, Technical Report - in Slovene.