DESIGN AND TESTING OF HOMOGENOUS	INFORMATICA 3/89
SINGLE BUS TIGHTLY COUPLED MULTIPROCESSOR SYSTEM FOR REAL TIME SIMULATION
Keywords: multiprocessor system, real time simulation, Kresimir Cosi6, Ivan Miler i Igor RaSeta single bus, parallelization of mathematical models	wtS KoV jna Zagreb
ABSTRACT - The real time simulation for hardware or man In the loop testing presents the cost effective way for design, development, modification and testing of a complex and sophisticated weapons and Industrial systems. This simulation technology is a constant challenges for most powerful computer systems. Therefore one short chronological review of computer architecture for time critical real time simulation is given. For such kind of application one homogenous single bus tightly coupled multiprocessor system based on 8086/8087 single board computers has been designed. Furthermore, this article presents one concept for parallelization of mathematical models given by ordinary differential equations in real time environment. Simulator design for one spinning missile system, according to the accepted procedure, Illustrate the abilities of realized multiprocessor simulator system.
SAŽETAK - Simulacije u realnom vremenu za testiranje realnog hardvera 111 operatora u zatvorenoj petljl, predstaviJaJu eflkasan put za projektiranje, razvoj, modiflkaciju i testiranje kompleksnih soflstlciranlh vojnih i industrijskih sistema. Ovakva simulaclona tehnologija predstavlja stalan izazov za naJsnaSnlJe rafiunarske sisteme. Iz tog razloga dan Je jedan kratak kronološki pregled raCunarsklh arhitektura za vremenski krltlCne simulacije u realnom vremenu. Za takvu vrstu primjena, u okviru ovog rada, realiziran Je Jedan homogeni, čvrsto spregnuti multlprocesorskl sistem s Jednom sablrnlcom i nizom procesorskih ploCa baziranih na procesorima 8086/8087. Pored toga, izveden Je i prikazan Jedan koncept za parale11zac1Ju matematlCkih modela zadanih oblfinlm diferenciJalnlm Jednadžbama. Projektiranjem slmulatora, prema usvojenoj proceduri za Jednu rotirajuču raketu, prikazane su mogučnosti realiziranog multlprocesorskog slmulatora.
1. introduction
The Increasing complexity and sophistication of modern process control and weapon systems has established a category of real-time simulation which uses hardware components integrated in the process of simulation. This kind of simulation, so-called hardware-ln-the-loop (HIL) simulation technology, has proved to be a very cost effective method In design, development, modification, and testing of complex weapons and Industrial systems [1,2,3,4]. In HIL simulations, for example, adequate, computer equipment can be used to simulate the aerodynamics and flight equations of the missile, while the real hardware subsystem such as RF sensors, IR sensors, fin actuators, autopilots, guidance and homing on board computers can be embedded and used for design and testing of a closed-loop system. In this case, HIL simulation provides a reproduction of what the real hardware subsystem (missile seeker) really processes in real environment, on the basis of simulation of the
missile aerodynamics, flight equations and targets movement simulated by suitable pseudo-target generator. In a this way It is possible to perform nondestructive testing, verification and validation of actual missile or process control subsystem in near realistic environments. This prefllght check and similar Industrial testing in the early phase of design and development provides effective way to analyze the overall performance capability of the closed loop system and to predict a performance at minimal cost.
2. computer architecture for time critical real time simulations
Hardware-in-the-loop	and man-ln-the-loop
simulations, which require time critical real-time simulations have proved to be constant challenges for the most powerful computer systems. Demands for more and more fidelity and accurate simulation of dynamic system characterized by ordinary differential
2
equations, slgnlficantly increase demands for additional speed and power from simulation computers. These demands have increased at the rate at least as fast as the rate of development and Improvement In computer technology. Thus, today, available simulation capabilities have the same relationship to the requirements as 10 years ago [3]. Furthermore the speed requirements of the more challenging simulation applications, very frequently exceed the capabilities of even the most powerful mainframe computers. During the 1960's hardware-ln-the-loop simulation of time critical processes depended on analog computers, such as EA.I 231, EAI 781. In analog computers parallel operations of many computing elements provide very high speed of processing which is the most significant for time critical real-time simulation. Programming of these processors was a manual process using the patch boards and fixed point scaled equations. In early 1970's hardware-ln-the-loop simulations has predominantly shifted to hybrid computations i.e. combination of analog and digital hardware, such as EAI PACER 100, to provide the required computational capabilities. In the middle of 1970's with the advent of fast mainframe digital computers the emphasis in application is based exclusively on digital hardware. But at that time the speed requirements of more challenging simulations frequently exceeded the capabilities of even the most powerful mainframe computer such as IBM 360, CDC 7600, Univac 1108 and so on. Today, currently available mainframe supercomputers such as CRAY-1, CRAY X-MP-1, CRAY X-MP-2, IBH 3090/VF-200, NEC SX-1E, NEC SX-2, CDC CYBER 20S, Amdahl 1200, Hitachi S-810/20 and so on, In majority of cases provide necessary computational power, but very often doesn't provide the cost effective approaches for such applications. Therefore in the late 1970's the trend in architecture of a digital computer system for time critical real-time simulations was toward the more specialized architectures. The first of these devices was peripheral array processor AD-10 (1979) manufactured by Applied Dynamics Inc. It was simulation-oriented peripheral processor Intended primarily for the simulation of systems of ordinary differential equations. Performances of this system are given in Table 1. They are related to estimation of computer power necessary for development of helicopter simulator.
Impressive speed of the machine when applied to ordinary differential equations results from Its advanced technology which provides very high processing speeds and by the extensive pipelining and parallelling and from specialized computing and memory units suitable for solving nonlinear ODE-s [7], New version of this system AD-100, which Is characterized by significant improvement in performance (Table 2) has appeared on the market in 1984.
TABLE 1
HELICOPTER SIMULATORS [S] computer and achieved frame times CYBER 17S 45 ms FPS AP-120B 4.5 ms CDC 7600	15 ms AD-10	0.8 ms
TABLE 2
MODEL OF A WHIRLING FLEXIBLE BEAM [6] computer and AD-100 advantage ADI AD-100 1.00	IBM 3033 7.45
CRAY 1-S	3.35 FPS 164 17.95
IBM 3081	5.40 HEP H-1000 36.65
But very often hardware-ln-the-loop real-time simulation requires simulator systems that are more cost effective and portable. The first cost effective way for attaining analog computer speed leads to parallel operation of multiple digital microprocessors. However, the price/performance ratio of multlmicroprocessor system was very attractive and therefore a number of attempts to Interconnect relatively Inexpensive general-purpose microcomputers have been made over the past years for designing a complex simulator systems. With the increasing availability of very fast and very economical single board computers, it becomes feasible to design the construction of network of microprocessors which will form special-purpose simulator. This approach was very attractive and more favorable In speed/cost ratios In relation to other solution. Therefore a number of microprocessor networks were developed for real-time simulation throughout 1980. The real-time multiprocessor simulator (RTMPS) project at the NASA Lewis Research Center for the simulation of Jet engines (1984) was one of the first and most significant [8). The recent introduction of powerful multiprocessor systems by a large number of vendors (1985-1987), such as Ametek Computer Research Division, Alliant, BBN Advanced Computers, Elxsl, Encore Computer, Flexible Computer, Intel Scientific Computers, Ncube, Thinking Machines and so on, has Increased the Interest of engineers and scientists In this approach to high speed real-time scientific computations. But it is necessary to maintain that this approach Is not cost effective and quite attractive for the majority of customers and simulator vendors. Today, very cost effective approach to parallel digital real-time slmulatlorf is based also on the network of transputers 19,10], The T800 Transputer contains a 10 MIPS 32-bit processor, on chip RAM, timer and I/O Interfaces which are based on serial communication channels. T800 have four links per chip and they use a clock rate of 20 MHz on the serial link, so that the communication channel between two Transputers requires only the connection of two wires for the link. Two T800 Transputers with floating point hardware with a speed of 1-2 MFL0PS, in the simulation of 2-nd order system by the RK 4 integrators is
3
approximately twelve times faster than the Intel 80386/80287 and six times faster than the Motorola 68020/68881 [10]. Further evolution of transputer networks by Inmos Inc. and Micro Way provides abilities to design and build arbitrarily large parallel processing machines. The 32-transputer array has been used in simulator design for modelling the flow through a Jet engine's turbine-blade cascade by the Rolls-Royce at Derby [11], This model has required ten minutes to run on a 32-transputer array and two minutes on a Cray XMP-48. Since the Cray cost over 125 times as much as 32-transputer array, the price performance ratio is 25:1 In favor of the transputer network.
But parallel operation and parallelism doesn't guarantee performance, and may. In fact, limit it. Example for this is successful replacement of 64-processor system Illiac IV with higher performance serial processor Cray I. The number of processors, lnterprocessor communications, memory organization and numerous other factors interact to.limit or to enhance processor performance. The effective utilization of a network of processing elements or microcomputers poses difficult scheduling and allocation problems. This means that the major difficulty in using parallel processor Is the effective software support, so that the total performance Improvement in relation to the serial processing Is dependent In the same time on the numerical procedures which are used i.e. techniques of discretization, techniques for decomposition and then, on the power of hardware for simulation.
3. ARCHITECTURE OF THE REAL-TIME MPS-AMS MULTIPROCESSOR SIMULATOR
The real-time Multiprocessor Simulator - MPS organized on a shared single bus -AMS, shared dual-port memories i.e. on tightly coupled multiprocessor topologies [12] Is shown in Figure 1. Modular design of this system provides a number of benefits which are related with its flexibility in modification, reconflgurratlon and maintenance [13]. Four Siemens 8086/87 based single board computers (SBC) AMS-M6-A8 are used to realize simulator hardware on principles of master/slave relationship. SBC's boards are connected through the 16-blts data bus AMS-M (European realization of the IEEE 796 Multibus I) which supports the real-time processing features. Access to the analog 1/0 world has been provided by the 16/32 channels analog to digital Input board 12-blts AMS-230-A1, and by the four digital to analog channels 12 bits, realized on AMS-M596 standard bus interface board. Such system is characterized by low functional complexity, physical compactness, and relatively low-cost. Communication among the processors is performed via message passing in "mailboxes" that reside In distributed dual port memory. Access to this memory
occurs via a single time shared bus.
In the phase of the configuration of real-time multiprocessor simulator, development of the simulation real-time software requires selection of the modules, their editing, compiling and linking on the host PC XT system. The following step in the procedure is related to the assembling modules according to the block diagram of simulated process or system and their testing, evaluation and validation. After that the tested modules are downloaded onto the master processor board and finally, created modules are mapped from the master boards to the slaves boards of MPS according to the accepted decomposition schemes [14].
Furthermore the host system provides the overall control of the simulation activities through the suitable graphic language. With additional multiuser microprocessor development system Tektronix 8560 i.e. through different integration station Tektronix 8540, this system enables efficient development of custom design hardware and software modules [15]. Two processors, host and master, communicate using interrupt system via a PC bus window, through the high speed SMP - PC lnterbus SMP-E570-A1. This technique is selected as the fastest available communication between the two processors, which allows one system to access the address on a companion system's bus as through the address on its own bus. To prevent conflicts in sending and receiving data between PC-XT and AMS system through PC memory, the synchronization mechanism uses the flag test-and-set procedure (semaphore) [16].
pc - bus s ni* - bub interface		
	^pc-bus >	1 I,
|cvber cdc 170/
960
development system
1i0st computer pc/kt
\ s kp raal-tufl »onllorlnq bu« \
« M
1 o
\
ÎBÇeI^ce
master processor board 6088/87
iKmt* MEMORY	
	

tektronix
8B60
» I

?ÛÇ er	rlUllJ
i-	J,
	
slav proce boaf 308b/	e ssor id 87 1
E
mr

Hr
prototype hardware 8086/87

"HiERKEI1
m
»-D		
DÎ*		P10
CO*		
ams beel-time procaa»lnq bua
converter
converter
Figure 1. Multiprocessor architecture for Hardware-in-the-loop testing
AMS-M bus protocol defines master-slave communication and multiprocessor arbitration which is strongly problem dependent. A real-time monitoring SMP-M bus is 8/16 bits data bus architecture with its own bus interface on SBC board but without multiprocessor arbitration. The local bus on the AMS
4
boards connects the processors to all on board Input/output devices (24 lines PIO 8255, RS 232 SIO 8251), local memory (EPROM 2764, SRAM 6116) and communication memory (dual port memory SRAM 6116), as shown In Figure 2. This bus permits Independent execution of onboard activities.
F al I
■ •Te
11 a«r
[Peripheral connector
hait cycle lenerat
m
flgîêêfe uk it 8068-2


cr tiber 82b3
P»R£i,L
i/o 82b8a
pm
RAH 6118
RaB0"
mi
I/O 8201A
\ 8 o a r d - 1		n t e r n a 1	bue V	
ïiftftyiEi	9	8		SHP SUS 1 NTERFACE
H				II
M
r
a m s - h bu
Figure 2. Single board computer AMS-MB-A8
A backplane provides the physical connections of AMS-M and SMP-M bus signals and priority resolver lines to set the priorities. Each board has a fixed priority which can be changed through Jumpers on the backplane.
WÉÔL ous
rrrrr
KcfiS»?1
re ooo
dual port ran
ipr
4VE
H«
8Kv«r
I
Bap§t cr
used
I 3
iqt used
.EHl 2
IOARD
us cd 1
'Mil"
program hehory
homitor hehory
a hs hehory
shared hehory
s hp hch0ry ~pc uihdom-
oual-p hehory
data hehory
hKEoh
8bft?Ï
M?l8iCE local	AKS
«drei«	adroaa
W'
Am
m
OK-BOARD HASTER

Çg-|Ei6Ê
m
0h-80ard >-
Figure 3. Multiprocessor memory organization
In this tightly coupled multiprocessor configuration, the processors communicate over the parallel bus, AMS-M, through a common i.e. shared memory. Generally speaking the common memory can be concentrated or distributed. In the accepted configuration, common memory Is partitioned on each processor board as dual-port memory (DPM) to reduce bus occupation. Additionally, each processor board has a local memory which Is especially interesting In loosely coupled decomposition algorithm which frequently use only local memory and seldom common memory. Furthermore, at the same time, this type of memory organization allows parallel access to shared memory without using the AMS-M real-time bus through the local bus. In addressing DPM all read accesses are local and
do not use common bus. AH write accesses use two different addresses. Onboard addresses are used to address its local memory and local dual-port RAM. Addressing DPM on the other boards are provided through AMS bus controller in memory space OxOOO-OxFFF depending on the selected boards. The memory organization of this system Is shown in Figure 3. To access the shared memory, it is necessary to gain the AMS-M bus, which then Is locked-on through standard protocol.
The most important aspect of the bus interconnection topology used in this MPS Is the bus arbitration technique. The priority level Is determined by user through wrapping technic on backplane according to the Figure 4.
Figure 4. AMS-H bus arbitration
Each SBC has two arbitration lines, bus request (BREQ) and bus priority in (BRPN), which are used to gain access to the AMS bus. BREQ line comes from a SBC to priority resolver and indicates a request for control of the AMS bus. BPRN signal comes from priority resolver to a SBC and indicates that the processor may go ahead and use the bus since there Is no other higher priority request for the AMS bus.
n
Buf1
iertace
J
3IBK
i kt7— iktb-»
Figure 5. Interrupt system of MPS-AMS
5
To optimize the communications efficiency and to provide real-time processing of the real-time clock request, MPS system Is predominately Interrupt driven. The lnterprocessor communication begins by passing an interrupt request signal from one processor to another. The priority assignment In the Interrupt system is problem dependent and Is shown In Figure S. for one typical closed loop guided missile system.
The Interconnection strategy is adapted to this structure, so that request for each processor Is wrapped to the corresponding Interrupt level.
AMS system bus uses the non-bus-vectored mode for Interrupts, When an Interrupt request line Is activated the Interrupt controller generates an Interrupt vector address and transfers It to the processor over the local bus [17].
4. ONE CONCEPT FOR PARALLELIZATION OF MATHEMATICAL MODELS GIVEN BY ODEs IN DIGITAL REAL TIME SIMULATION
4.0. Real time simulation in
multiprocessor environment
The abilities of parallel real time simulation predominantly depend on the , performance and architecture of parallel multiprocessor system, types of Interaction between parallel processors, on the problem under consideration I.e. Its Inherent level of parallelism, on numerical methods for numerical Integration, strategy of task allocation and finally the cooperation between parallel architecture and parallel discrete time model of a original continuous system. But It Is Important to note that the main contribution to the Improvement of efficiency in multiprocessor real time simulation depends on decomposition of model, on a process of discretization and on mapping of given problem onto parallel multiprocessor architecture. This process usually Involves several phases that start with decomposition of mathematical model given by algebraic equations (AEs) and ordinary differential equations (ODEs) or by partial differential equations (PDEs). Since these equations are defined over continuous time domain in the second step some kind of discretization or numerical approximations must be employed in order to enable digital Implementation. Finally, It is necessary to perform suitable partitioning of derived discrete time model onto multiprocessor environment. Mapping of decomposed and dlscretlzed model onto parallel architectures can be performed In a different way.
In dynamic load balancing, discrete time models are allocated to microprocessors at run time. Discrete models automatically migrate from heavily loaded microprocessor to lightly loaded ones. By attaining a well-balanced load, better processor utilization can be
achieved and thus higher performance of complete system.
In a static load balancing method, models are allocated to microprocessors after compile time i.e. before start up run time. Such static techniques require fairly accurate predictions of the resource utilization for each model.
For programs with unpredictable run-time resource utilization, dynamic load balancing Is more desirable because It allows the system to continuously adapt to rapidly changing run time conditions. As a popular measure of load balancing It is possible to use CPU time utilization, communication time, the number of concurrent microprocessors In active operation, the number of concurrent models etc..
In real time simulation correct prediction of the computational requirements and resource demands for each software module I.e. for each program must be
known in advance to enable real time Implementation. This estimation can be derived after the process of discretization. On the basis of this Information, algorithm for task allocation provides well load balance and real time execution. Therefore the concept of static load balancing In the real time simulation is a natural one. In simulator design, what is primary Interest of our research, the objective function must provide real time simulation of related problem with desired accuracy and with minimal number of microprocessors. The system, which enables developments of the simulator In this way can be considered as development system for simulator design and realization. The accepted objective function Is natural in designing and realization of digital simulator for operators training or hardware In the loop testing.
Real time simulator described In this article hosted on IBM PC/AT has been realized in order to provide the user with such abilities and furthermore to allow generation of real time machine code for target processor based on Intel microprocessor 8086 and arithmetic coprocessor 8087. Through attached peripheral homogenous tightly coupled multiprocessor system based on single board computers, such simulator system provides user with different experimental abilities in operator training or control system design and testing. The software support of such simulator development system provides the abilities of extensive non real time simulation which leads to such decomposition technique, numerical integration and task allocation strategy which guarantees the minimal number of parallel processing units i.e.1 minimal hardware complexity necessary for realization of a different kind of simulator system. This hardware complexity strongly depends on desired level of accuracy in process of simulation, so that with Increasing level of approximation the number of microprocessor can be significantly increased. The substantial Influence on
6
this facts have choices of decomposition techniques and procedures for numerical integration. Therefore special attention will be dedicated to the relationship between hardware complexity on one side and decomposition techniques, numerical methods for discretizations and algorithm for task allocation on the other side.
Now wo can define the procedure for real time simulation in multiprocessor environment with following steps:
Step 1. Structural and dynamics decomposition of system corresponds to physical partitioning of problems Into a sequence of modules with lower level of complexity.
Step 2. Tuning between numerical methods and sample rates for discretization and mathematical modules according to its nature; linear or nonlinear, time variant or time Invariant, stiff or nonstlff, with or without discontinuities, spectral characteristics of input signals and so on.
Step 3. Task allocation process follows physical arrangement of system and In cooperation with numerical methods for discretization determines desired level of granularity In order to achieve equal load balancing between processors, minimal hardware requirements and real time execution.
Presented procedure is not straightforward, it may be iterative one and all these steps must be taken into account very briefly in order to achieve optimal real time simulation of given problem in accordance with the accepted objective function.
4.1. System decomposition
The first step In the above procedure requires decomposition of a mathematical model of complex dynamic system into a number of hierarchical functional modules or blocks of different complexity. With such a modular or block approach a realistic complex problem can be subdivided Into a sequence of smaller modules or blocks. By applying this formalism, complex mathematical models can be easily transformed Into one block level distributed structure which enables Isolation of standard mathematical models and their further efficient processing. Decomposition scheme, which we shall prefer, is heuristic one and is mainly based on physical partitioning of complex system. Now we shell define the parallelism degree of model on level of block diagram,as the maximum number of functional blocks that can be executed at the same time on different processors if necessary in real time execution. For very fast system I.e. for time critical real time simulation. Inherent parallelism degree can be further Increased by partitioning from block level to an equation level or even an arithmetic operation level. If It Is necessary. It Is possible to determine computation of the longest execution time i.e. critical
data flow trajectory in running through the block diagram. The efficiency of modular partitioning is problem dependent, but only with such an approach it is possible to perform the optimal adaptation of numerical methods of discretization to each module of distributed system. Such decomposition of a complex system Into multiple low level computational modules which enables adaptation of numerical method and period of discretization to each module Is most Important for digital real time simulation. After structural decomposition which corresponds to physical topology of the system, suitable dynamic decomposition Is necessary in order to adapt periods of discretization or Integration to each module. The concept of multlrate sampling enables different sampling rates In different modules or In different loops, and leads to the significant reduction of computational load I.e. CPU time savings and to improving the numerical conditioning. Such structural and dynamic decomposition detects the level of parallelism which is very often Inherent in complex original continuous system. Separation of differential equations according to their type, particularly linear differential equations from nonlinear e.g. nominal trajectory from perturbated, separation according to the spectral characteristic I.e. separation of fast portions from slow ones and separation according to the frequency contents of Input signal are substantial for the achievement of high efficient digital real time simulation.
A heuristic method for system decomposition used In this concept Is based on the fact that the modules that follow from partitioning of the overall system are very similar to the physical topology of the system. The main advantage of this approach Is that the processing blocks are associated with physical sections, and model Implementation and verification can be easily done. In addition, variables used in the lnterprocessor communication have physical meaning, which can aid In the understanding and Interpretation of the model. Furthermore by exchanging only the output variables between blocks It is possible to reduce significantly communication requirements between processors.
Because decomposed system Is similar to the physical system, this method is problem dependent. However, the advantages of the method far outweigh this disadvantage. The main advantages of this method are:
-	Decomposition Is easy to make, since it follows the physical arrangement of the system.
-	Highly modular approach Is very flexible In the case of structural or modules modifications.
-	lnterprocessor communication requirements are minimized.
-	Program design Is simplified, which Is a direct consequence of modular structure.
-	Program coding is straightforward. It is easier
7
to code the small modules that result from the decomposition than complex ones.
- Checking, testing and debugging Is also easier, especially message Interchange between microprocessors during	lnterprocessor
communications.
This decomposition has been applied on original continuous models and Its abilities are determined by the nature of the problem. Modules or blocks that are Independent enable simultaneous or concurrent calculation and provide parallel decomposition. Modules or blocks that are sequential lead to the cascade decomposition. This level of parallel or sequential computation Is predominately determined by the nature of the problem, but in the following steps it will be shown that this inherent level of parallel or sequential properties can be significantly modified by the choices of numerical methods for discretization or Integration.
4.2. Numerical Integration
This step transforms original continuous mathematical model into equivalent discrete one, that must guarantee desired level of approximation with minimal arithmetic complexity normalized on some common frame time. By this transformation using different techniques of numerical integration it is possible to additionally Increase the degree of model parallelism from the Inherent one to some desired level which enables real time simulation of related problems on given multiprocessor configuration. It means that the complete parallelism in discrete time model for implementation depends not only on efficiency of functional decomposition i.e. topology of system, but also on numerical method for discretization. The level of parallellzation which will be added or extended with the choices of numerical methods determines the final level of granularity of discrete time model for implementation. The level of granularity In equivalent discrete time model can vary in a range from Instructions and statement level to program or a group of programs level. This depends on the complexity of mathematical model and Its dynamic characteristics i.e. real time constraints and architecture of multiprocessor system that would be used for Implementation. For example, if a model of some problem is represented by a sequence of sequential modules or blocks which require sequential computation, by using any of explicit numerical methods for Integration, they can be easily transformed to parallel procedures until the desired level of granularity is reached. Assignment of the suitable integration or discretization method to each module or block, and suitable period of discretization require extensive non real time analysis, checking, testing and
verification. In the following part of this article we shall focus our attention on numerical methods that can be considered as suitable for digital real time integration of ordinary differential equations and on dependance analysis between the parallelism of simulation models and numerical methods used for discretization or numerical integration.
For a linear or linearized problem given by
x(t) = A x(t) + B u(t) , x(0) = xq	(1)
some modification of standard discretization methods can be considered as optimal one.
Modification of standard step Invariant method [20,21) is given by the following equations
4( A, T, W) X* + r(A.B.A.r.T,W) uJt	(2)
where the state vector, and system matrices are determined by
V w_lxk
» = W_1eATW
r = W-1X[A_1(eAT- I) + *T(eAT- I)JB State space version of T-lntegrator is given by [22]
x*+1= A*(A,L,r,T,W)x* + B1(A,L,r,T,B.W)uk+1+
(3)
where is
+ B2tA,L,r,T,B,W)uk
V W Xk
A = [I - LTrA]_1[I + LTCI - DA] B*= [I - UTA]"1 LTTB
(4)
(5)
.-1
x(0) = xQ	(6)
1
B2= [I - UTA) LT(I - DB
For general nonlinear problem given by x(t) = f(x(t),t,u(t)) following numerical Integration algorithms can be considered as suitable for real time digital simulation [231;
* Single-pass Integration algorithm
-	Euler explicit
n+1 n	n n n
-	AB - 2
xn..= x+T/2[3f(x .t .u )-f(xn ,t .,u .)] (8) n+1 n	n n n n-1 n-1 n-1
-	AB - 3
X x +T/12[23f(x ,t ,u )-16f(x ,,t ,,u ,)]+ n+1 n	n n n	n-1 n-1 n-1
+ 5f(x ,t _,un -)] n-2 n-2 n-2
* Real time Runge-JCutta version
-	RK - 2
p
xn+l/2° xn + T/2 f(V W
xn+r xn + T "Vl/Z'1 n^'Vl/a1
-	RK - 3
p
x j.1 m~ x + T/3 f <x , t , U )
n+1/3 n	n n n
(9)
(10)
x_ + 2T/3 f(xL,/,.t
n+2/3 n
'n+1/3* "n+1/3'"^!/^ (11)
Xn+1= VT/4[f(xn-tn-Un)+3f(V2/3'tn+2/3-Un+2/3)1
• Adams-Moulton serial predictor-corrector - AM - 2
XLl = Xn+T/2[3f(X~'t„'UJ-f(X„	1>> (12)
n+1 n	n n n n-1 n-1 n-1
x_.4= x + T/2[f(x^ ,,t .,u .) + f(x ,t ,u )) n+1 n	n+1 n+1 n+1	n n n
8
-AM - 3
xP = x +T/12[23f(x ,t ,u ) -16f(x ,,t ,,u ,) + n+1 n	n n n	n-1 n-1 n-1
♦ 5f(x ,,t ,u ) n-2 n-2 n-2
(13)
vr V T/l2I5f(VrWW + 8f(V W"
- ftxn-l'tn-l'Vl)!
For complex nonlinear modules or blocks, which unable further physical decomposition and single processor real time implementation, It Is necessary to use some technique for equation segmentation or numerical methods for parallel integration. Standard approach for solving such problem [24) requires partitioning the set of n equations and then allocation of a certain number of the n equations to each of the microprocessors to achieve the speed needed for real time simulation. Each microprocessor would be responsible for performing the function evaluations and integrations associated with its assigned equations. Such subset of equations can operate In parallel, only the values which are necessary in the other equations would be transferred periodically between microprocessors. Since the function evaluations I.e. the computation of derivative would be done In parallel, a good deal of time can be reduced In comparison with a single microprocessor Implementation. Just how much time is saved depends on;
-	How closely coupled Is the system of equations;
-	How the equations are allocated to the microprocessors;
-	What Integration algorithm is used for discretization of each subset of equations;
-	What types of communication channels and protocols are available between microprocessors.
Alternative solution to the above standard approach is concerned with utilization of parallel predictor-corrector methods. In this case partitioning is performed in the sense of the algorithm I.e. numerical method but not In the sense of the system of equation as In previous approach. One possible parallel predictor-corrector method has been presented by Miranker and Liniger [241 for system ^ = f(x,y) and Is given by
(14)
W ^'^f 5fl-l+ 4fi-2- fl-3)
yl = yl-l+ >^4(9<+	Sfl-2+ fl-3'
For parallel real time simulation parallel block implicit methods can be also very useful. One version of a fourth-order block presented by Shamplne and Watts [24] Is given by:
-	Predictor equations	(IS) W 1/3(yi-2+ yi-l+ yi)+h/6(3f^_2- 4f^_1+ 13f^)
yi+2= 1/3(yl-2+ yl-l+ yi>+h/12(29fj_2-72f|_1+79f^)
-	Corrector equations
..C _ _ C. t	, „„P
yI+1= y1+ h/12(5fj 4 8f'1+1- fJ+2, yi+2= yi+ + 4fl+l+ fl+2>
(16)
The choice of the optimal discretization method I.e. tuning the method to each separate module requires
well understanding of the character and dynamic of each module and well understanding of numerical characteristics of all of the above mentioned numerical methods. But this can be done only on the basis of separation between linear differential equations, and nonlinear, fast portions of the problem from slow portions, time invariant from time variant. It means that the decomposition of problem has great influence on the efficiency of the complete procedure. By such approach we can significantly increase the solution bandwidth of the problem, which is the most important in real time simulation.
4.3. Task allocation
The last step in the presented procedure requires distribution of derived dlscretlzed modules among microprocessors in multiprocessor system and can be used in Iterative fashion with previous two steps. This process of a task allocation i.e.process of assigning software modules which constitute distributed discretlzed mathematical model of original continuous system to each processing element, requires an understanding of data or block dependencies that exist among problem variables. There sire two different ways in task allocation strategy which depend on applications.
In the first case, architecture of the multiprocessor system Is completely defined and determined by the type and number of available processors, their performance and way of their communication. The computing tasks involved in solution of simulation problem, in this case, must be partitioned on the available processors in order to minimize idle time of each processor and to minimize the time lost in the communication of program segments. Furthermore special constraints and limitations on real time processing are not supposed i.e. solution can be generated faster than real time or slower then real time, which depends on the problem and performance of multiprocessor system. In such environment It Is necessary to attain such load balancing which leads to a better multiprocessor resource utilization. Well-balanced load provides a higher performance of complete system I.e. minimal total run time. Measures of load balance in this case may include the CPU time utilization for each processor, communication time In relation to the computation time, the number of concurrent processes or modules In simultaneous processing and so on. The combination of these different objectives can be expressed by the fast improvement achieved on multiprocessor system. This improvement can be characterized by factor S, which is ratio between solution times using one processor and M processors127).
S = W	(17)
where Is:	Tg - single processor solution time,
T - multiprocessor solution time.
M
The efficiency of complete multiprocessor implementation can be defined by:
E = S/N	(18)
where H denotes the number of microprocessors, and S Is given by (17). Efficiency Is expressed as a percentage by multiplying the above expression by 100. The assignment must be made so as to optimize relation (17) and (18), i.e. to enable minimization of execution time.
The second approach to the problem of task allocation Is related to design and development of digital real time simulator for operator training or hardware in the loop testing I.e. restricted on real time simulation. Therefore In such applications It may be assumed that the architecture of multiprocessor system Is not given In advance and must be determined by this procedure. In this context, task allocation strategy on the basis of results generated In first two steps must provide high fidelity real time simulation and minimization of hardware requirements, since for some kind of simulator system such computer configuration must be duplicated In high number of copies. As already pointed out. In this system all program segments are known in advance as its time required for their execution and therefore static allocation concept will be appropriate one. If the execution times for each module and communication times between processors are known, then the problem of task allocation is to determine such distribution of modules on parallel processors which support real time simulation with minimal hardware requirements. The number of software modules which form one task which will be assigned to one microprocessor depends on arithmetic complexity of problem modules and real time requirements. Possible assignment can be several modules to one microprocessor, or several microprocessors to one module. This relationship depends predominately on the processing power of each processing element, complexity of dlscretlzed modules and their dynamics, intensity of lnterprocessor communication and so on. At the beginning of the procedure we start with allocation of task to first microprocessor on the basis of block level critical path method [281. For some common frame time we add modules to first microprocessor until the sum of periods of implementation of these modules is less than or equal to this common frame time T.After that we are going on, with assignment of a tasks to the following microprocessors. So that for each microprocessor must be valid
n j
.LTlmp (19) T-Sl
disc
where	denotes the period of Implementation module
1 normalized on common frame time T and n denotes the number of modules assigned to each microprocessor. In this way It Is necessary to continue with the process of task allocation for each succeeding microprocessor, until all modules have been allocated. If equation (19) Is valid for all modules, single processor Implementation of related problems is possible. With such an approach it is possible to reduce idle time of each microprocessor in multiprocessor configuration. But on the other side this approach increases time delay due to finite computation time. This delay may cause the effects of instabilities In closed loop simulation with Included external hardware or In the operator training applications.
In the case of fast and complex modules it Is very often necessary to assign several microprocessors to one module. Using the equation segmentation methods or some form of parallel predictor corrector algorithms or block implicit methods it Is possible to achieve the speed of real time processing.
The level of granularity in the process of task allocation depends predominantly on the architecture of multiprocessor system, decomposition techniques, complexity of modules, their dynamic real time constraints, numerical methods used for discretization and so on. On one side very fast modules lead to the fine level of granularity, for example statement level or Instruction level. Massively parallel processor In order to achieve the high efficient utilization of available processing power needs fine level of granularity. On the other side relatively slow modules lead to the high level of granularity preferred in single bus multiprocessor configuration. Different level of granularity leads to the different level of intensity	in	lnterprocessor communication.
Communication time Is not negligible specially In fine level of granulations since communication time Is comparable with computation time. In such circumstances, task allocation optimized only on maximization of parallel operation would not be at all optimal one, when communication times are taken into account, especially if a large amount of data is to be a shared. Therefore parallel multiprocessor simulation requires detail consideration and analysis In order to reduce I.e. to minimize lnterprocessor communication. Waiting for intermediate results and delays during communication time decreases throughput per processor as the number of processors grows.
With careful decomposition schemes which follow physical topology of problems and with choices of suitable methods for numerical discretization and task allocation strategy, hardware requirements for real time simulation can be significantly reduced. It Is therefore desirable to perform extensive non real time simulation and analysis in order to determine the best manner of allocating program modules and to achieve the
10
most cost effective solution.
X8 ■ X9
4.4. Simulator design for one spinning missile system
The above presented procedure can be illustrated by the following example which requires simulator design that must provide real time simulation of one spinning missile according to the desired level of accuracy.
After linearization of 6-DOF model about corresponding nominal trajectory and suitable physical partitioning, block diagram form of one spinning missile can be shown by Figure 6.
mlsslla	I	i i	7 HynamlcB
Jsefvo group i
i
| Xi.Xa
U<
GUM
1.1
■1« ♦ o»
Si1"
ul
jfor
T
Xl,X2
Jn2
	5.2
	«
Uy
G22U)
St.
inl
ÎOT
■ iero proc«if©r no. 1
H
X?,Xa.Xs.XIo
111
iiE?
Y|cu(e)[-^,|cis(e)[j| V/T2
pitch channel
Xa.Xa Xl2,Xl3


yaM channol
<W>
Xl4.Xl8.Xl8.Xl7

fyk
Xie
Figure 6. Block diagram of spinning missile in nonrotatlng coordinate system
Servo group mathematical model can be defined by the following system of ordinary differential and algebraic equations [29],
*2 = "1/Tp X I" ^p X2 + 1/T5 UZ
5ml = KpC2 X1 + Vl X2
J , = K D, x. + K D, x_ »2 3P 2 41 p 1 2
x3 = x4
*4 = "1/Tp x3 " 2Ç/Tp x4 + Uy Snl = KpCl ^ + KPC2 X4
Sn2 = KpDl *3 + KPD2 X4
or In state space form with,
X, = A . x, + B.U. pk pk pk pk pk
y . = C , x , 'pk pk pk
(20)
(21)
In this case state, input and output vectors correspond
Xpk= lx! X2 *3	Upk = Uy'T	<22)
*pk = [5m	'
while system matrices A ^ Bp^ and C^ follow directly from (20).
Dynamics equations for pitch channel are given by XS = X6
*6 = "1/T0 X5 "	X6 + 1/T0 5z
*mn " K<r X5 + V2 x6
*7 = X8
X10° "al " *2 X8 _ ®3 X9 " a4 X10
" 3mn>
< ' Kq *7 + Vl X8 + KqP2 X9 + *qP3 X10
V
*11= - 1/T2 X11 + + V* V
r = x
n
fzk= V/T2 *
State space form of equation (23) is determined by xdz= [X5 X6 *7 •• X11]T
u,. = [5. «_JT	(24)
'dz	m	V
!dz	= A. dz	xdz
zk	<= C. dz	xdz
+ Bdz Ud + D. U.
Dynamic of yaw channel Is identical and Is given by,
j
Xdy= [x12 Xi3 X14 •• X18]
U. = [3 «JT	(25)
ay m n
x. » A. x, ♦ B. U. dy dy dy dy dy
f . = C. x. ♦ D. U, yk dy dy dy dy
Non real time simulation of the above equations has been performed on Cyber 170/850 using standard routine for numerical integration of ODEs from IMSL library (DVERK) In order to generate reference solution. With common period of integration of 1ms, and with 6-order Runge-Kutta methods reference solutions have been obtained and are shown In Figure 7.
Derivation of discrete time model for real time simulation requires careful choice of a method for numerical Integration and period of discretization. For this linearized problem one of suitable numerical methods for discretization is the approach given by equation (4) and (5). After detailed dynamic analysis of original continuous system based on eigenvalues Inspection and their distribution, the concept of multl rate processing has been accepted. Real time processing In servo group has been performed with 1ms period. In simulation of rigid body dynamic equations the period of discretization is 5ms. The complete discrete time model has been tested through extensive non real time simulation which are shown In Figure 7.
Program for testing discrete time model Is based on equations (21),(24),(25) and (4) is given by
for 1=1 to .... •servo block
for J=1 to 5
lnPut upk,k
xPk,k° Apk V.k-l' Dpklupk,kT "pk.k-i
ypk,k= Cpk Xpk.k output ypkik(«B>k. Vk>
xpk,k-r xpk,k
Upk, k-l= Upk,4c next J
•dynamic of pitch and yaw channel xdz.k= AL xdz,k-l+ °LCypk.k+ ypk.k-lJ xdy,k= Ady xdy, k-l+ VW+ ^.k-l'
+ B . (u . .+ u

(23)
11
f , t ■ C . x . , + D. y , ,
zk,k dz dz,k dz 'pk,k • •
f > .= C. x . + D. y , , yk,k dy dy,k dy pk,k
output fzk. fyk
xdz, k-l= xdz, k Xdy,k-1= Xdy, k ypk,k-l= ypk,k next 1
Comparative analysis of results presented In Figure 7, shows a good tuning of discrete time model that has been provided with compensation matrices L=I and r=l/2I (bilinear transformation) and corresponding periods of discretization.
Strategy of static task allocation which follows
physical decomposition of problem and choices of the
numerical methods for discretization must meet
requirements for real time simulation with minimal
hardware requirements. Computational load I.e. speed up
factor of first module-servo group is determined by
period of discretization (1ms) divided by execution
time of arithmetic operations needed for implementation
of this discrete time model. The speed up factor
computed for this module gives K »1.0013. This
servo
means that real tine simulation of this module Is possible on only one microprocessor board and that the load balancing for this microprocessor is near ideal. The speed up factor for the following group which presents dynamics of pitch and yaw channel is Kdlnamic=1'00124, Thls means that real tlme simulation

T « I (
Q> </)
in t-t
- , I-1-1-1-1-1-1-1-1-1-1 *~T
° a an 2.88 <|| o 00 g as
UZ=10 , UY=0

T « (0
-2
S> o
o
_ to I I . . I I	I	I	I	I . ta
° 8 06 2 80 *.it t t> 8 88 •>•>
UZ=0 , UY=10

to QJ
> I 0
UZ=10 , UY = 0
UZ = 0 , l)Y= 10
Q 0 00 0
Figure 7. Time responses of continuous and equivalent discrete time models
(T = 1ms, T. = 5ms, L = I, T = 1/2 I) servo	dynam
12
of this module is also, possible and that It also needs only one microprocessor with near Ideal load balancing. From the obtained results follows that the real time simulation of the model given in Figure 6, is possible with two near Ideal equally load balance microprocessors. Assignment of tasks Is determined by allocation of the discrete time model of servo group to microprocessor no.l, while pitch and yaw dynamic equation to microprocessor no.2. The Intermediate results computed by first microprocessor must be - transferred to the second one. However second microprocessor must check that intermediate results or
0 0.2 0.4 0.6
Figure 8. Real time simulation of one
data have been received before using them. In this example we have used for Integration implicit method from accuracy and stability reason. But implicit Integration methods are not suitable for parallel simulation, since they require careful consideration to Insure correct synchronization between microprocessors. In execution of Iteration k, first microprocessor must compute the output of servo block y . . and broadcast
pk, R
the results to second microprocessor as soon as
possible. The second microprocessor in corresponding
iteration for computation of Its state vector and
output equation must wait for values y . . from first
pk, k
microprocessor. These data dependencies that exist among servo block and dynamic block determine the
sequence of computation and require synchronization between microprocessor no. 1 and microprocessor no. 2. To Insure correct synchronization and acceptance of correct results, source microprocessor also broadcasts a ready flag with its results. The destination microprocessor i.e. the second microprocessor waits for ready flag before using the result broadcasted by the first microprocessor. Than It resets the ready flag after It detects that ready flag Is set. Procedure executed by a source and destination microprocessor in transferring and waiting for exchange data and variables are standard in such cases.
U* = 10
0 0.5 1 1.5
spinning missile system
Program for parallel real time simulation for these two microprocessors is coded in assembly language according to the following relations.
•microprocessor no. 1 - servo block *
•Interrupt routine (real time clock - 8253 - 1ms) «
input Up^ ^	(from a/d converter over AMS-bus)
p •
xpk,k" xpk,k + Bpk upk,k *
ypk,k= Cpk xpk,k
output y ^ ^	(to microprocessor no. 2
p	■	„	over dual port memory)
x,,=A,x,, + B , u . , pk, k pk pk, k pk pk, k
idle	(wait for next real time clock)
13
•microprocessor no.2 -dynamic of pitch and •yaw channel
•Interrupt routine (real time clock - 8253 - 5ms)
input y	(from microprocessor no.1 over
dual port memory - wait for semaphor synchronization)
P	•
x . , = x. , + B, y , .
dz,k dz,k dz pk, k p •
Xdy,k~ Xdy,k + ^y^k.k
' m	#
f , , = C. x . , + D. y , ,
zk,k dz dz,k dz pk,k • »
f u ■.= C. x. . + D, y , . yk,k dy dy,k dy pk,k
output f . , f , .	(to d/a converters over
zk.k ylc,Kt ams - bus)
xdz,k= Adz xdz,k+ Bdz ypk,k • •
x . , = A , x . , + B , y , , dy.k dy dy,k dy pk,k
Idle	(wait for next real time clock)
The obtained real time solutions that satisfy the accepted objective function are shown In Figure 8 and can be considered as near optimal ones.
3. conclusion
Presented methodology enables high efficient real time integration of complex dynamic system described by ordinary differential equations on multiprocessor system. It means that In the simulator design it is the most Important to find the optimal combination of decomposition techniques, discretization algorithms and strategy of task allocation, that leads to the minimal number of microprocessors necessary for simulator realization in agreement with desired accuracy specifications. Through the attached peripheral homogenous single bus tightly coupled multiprocessor system, realized digital simulator provides a user with different experimental abilities in control system design, testing, modification and in the operator training.
6. REFERENCES
[1] V. H. Christal, D. C. Mackey: Guidance and Control Simulations for Laser Culded Weapons, Proceedings of the Conference on Aerospace Simulation, SCS, February 1984.
(21 K. Cosió, S. Deskovsk1: Digltalna simulaclja prostornog kretanja rotlrajutfe rakete u realnom vremenu, XXVIII konferenclja ETAN-a, Split, 1984.
[3]	K. L. Hall: A simulation system architecture for real time applications, The Proceedings of the Summer Computer Simulation Conference, 1986.
[4]	K. Cosltí, S. Deskovski, I. Mller, M. SlamlC : Dlgltalnl simulator za razvoj 1 testlranje PO sistema vodenja, XXXI konferenclja ETAN-a, Bled, 1987.
[5]	J. W. Karplus: Computer hardware in the simulation, Lecture notes, ETH ZUrlch 1986.
[6]	R. J. Hickman: Hardware-in-the-loop simulation of dynamics system with high performance multl -purpose digital computers, The Proceedings of the Summer Computer Simulation Conference, 1987.
[7]	A. C. Watts: Aerospace system simulation at Sandla National Laboratories, The Proceedings of the Summer Computer Simulation Conference, 1987.
[8]	R. A. Bleck, D. J. Arpasi: Hardware for a real-time multiprocessor simulator, The Proceedings of the Summer Computer Simulation Conference, 1985.
[9]	R. Cluck: Hope for simulating flexible spacecraft, Aerospace America, November 1986.
[10]	J. 0. Hamblen: Parallel continuous system simulation using the Transputer, Simulation, December 1987.
[11]	K. Owen: Rise of the supercomputer. Aerospace America, July 1988.
[12]	D. Chosal, L. M. Patniak: SHAMP: An Experimental Shared Memory Multiprocessor System for Performance Evaluation of Parallel Algorithms, Microprocessing and Microprogramming 19, 1987.
[13]	E. Pearse 0*Grady, C. H. Wang: Multibus-based parallel processor for simulation. Proceedings of the Conference on Aerospace Simulation, SCS, February 1983.
[14]	K. Cosl<5, I. Mller, D. Had21oraerovl<5: Koncept multiprocesorskog slmulatora za testlranje slstema vodenja u realnom vremenu, Zbornlk MIPRO 88, 1988.
[15]	K. Cos 16, S. Deskovski, I. Mller, M. Slamli, I. Koprlva: Razvoj multiprocesorskog ekspertnog sistema za proJekt1ranje slstema vodenja 1 upravlJanJa, XXXII konferenclja ETAN-a, 1988.
[16]	M. C. Gllllland, B. J. Smith, W. Calvert: HEP: A semaphore - synchronized multiprocessor with central control, Proceedings of the Conference on Aerospace Simulation, SCS, February 1976.
[17]	H. Klrrmann: Events and Interrupts in Tightly Coupled Multiprocessors, IEEE Micro, February 1985.
[18]	K. Cosie, S. Deskovski: PC-based development system for simulator design, 3rd European Simulation Congress, Edinburgh 1989 (to be published).
[19]	J. R. Pimentel; Real time simulation using multiple microcomputers, Simulation, March 1983.
[20]	K. Cosid, I. Koprlva: Slnteza diskretnlh modela za dlgitalne slmulaclje u realnom vremenu, Automatlka, br.5-6, 1987.
[21]	K. Cosid, I. Koprlva: Design of the optimal discrete time model for digital real time simulation, European Simulation Multlconference, Roma 1989.
[22]	K. CosId: Design and Implementation of discrete time model for real time digital simulation, All About Simulators, Simulators series, vol. 14, No.1, 1984.
[23]	R. M. Howe: Special considerations In real time digital simulation. Summer Computer Simulation Conference, 1983.
[24]	M. Franklin: Parallel solution of ordinary differential equations, IEEE Transactions on Computers, vol.c-27, No.5, May 1978.
[25]	0. A. Palusinskl: Simulation of dynamic systems using partitioning and multlrate integration techniques. Summer Computer Simulation Conference, 1983.
[26]	0. A. Palusinskl: Simulation methods for combined linear and nonlinear systems, Simulation, March 1978.
[27]	E. Pearse O'Grady, Chang-Hseln Wang; Parallel processor performance In a Jet engine simulation. Summer Computer Simulation Conference, 1984.
[28]	A. Makol, W. J. Karplus: ALI: A CSSL/multlprocessor	software	Interface, Simulation, August 1987.
[29]	S. Deskovski, M. Slamld: MatematlCkl modell 1 slmulaclje slstema upravljanja rotlrajucilh raketa, NauCno tehniCkl pregled, vol.XXXIV, br.1., 1984.