Linear Algebra in One-Dimensional Systolic Arrays Gregor Papa and Jurij Šile Computer Systems Department, "Jožef Štefan" Institute, Jamova 39, 1001 Ljubljana, Slovenia email: gregor.papa@ijs.si, jurij.silc@ijs.si Keywords: systolic array, QR, LU, decomposition, Gauss elimination, matrix multiplication Edited by: Rudi Murn Received: June 1, 1999 Revised: July 20, 1999 Accepted: April 5, 2000 Frequently used problems of linear algebra, such as the solution of linear systems, triangular decomposition and matrix multiplication, are computationally extensive. To increase the speed, those problems should be solved with systolic structures, where many processors are used concurrently to compute the result. Since two-dimensional array of processors is very consumptive, considering space and resources, it is better to use one-dimensional array of processors. This leads to the operation reallocation and causes unequal utilization of processors, but it is much easier to implernent since there is only one linear array of processors. 1 Introduction Many scientific problems can be solved by linear algebraic computations, but even some basic operations are compu­tationally extensive. Computation time could be shortened by synchronous data processing, which is enabled through the systolic structure. Systolic solving is presented by the processor structure, where data is fiowing through the net of specialized processors, vvhich are locally connected and work synchronically. This approach has some disadvan­tages, while there is a lot of connections. It is difficult to monitor aH processors and to read data from them. Be­sides, they are poorly utilized, since they mostly wait for their data to compute. It is possible to compose the struc­ture with higher utilization, time suitability and lower com­plexity [3], which would remove the mentioned disadvan­tages. To realize that, we can merge some processors, i.e. one processor performs tasks of more processors, and we can put them into one straight array, to reduce the num­ber of connections and to make easier access to the pro­cessors. This work presents the linearization of different matrix transformation algorithms, such as elimination, de­compositions and multiplication, and also some compar­isons of two-dimensional and linear arrays are given. 2 Linear system of equations SystoIic arrays can be used to solve the system of linear equations [2] in the form: A-; b . Suitable triangular systolic array for realization of Gauss elimination and various decompositions (QR and LU) [4,9] is presented in Fig. L Shapes O ^"d Q represent two types of processor (diagonal and inner), performing their own instructions; diagonal operations are executed in di­agonal processors and inner operations are executed inside the structure. Inputs of the structure are matrix coefficients (aij) and at the end there are coefficients of the upper-triangular matrix inside the structure and the coefficients of the lower-triangular matrix on the outputs. Dotted square represents a delay r . According to the matrix size n x n the number of required processors n* is: . n( n + l ) Where n diagonal processors and (n* — n) inner proces­sors are required. Figure 1: Triangular systolic array (n=4) Two-dimensional array in Fig. 1 can be transformed into one-dimensional [II , 6] in several directions; horizontal linear array (Fig. 2), vertical linear array (Fig. 4), diag­onal linear array (Fig. 6) and interweaved linear array (Fig. 8). Symbol O represents the processor that performs the tasks of processors O ''"d Q . Next, the operations of di­agonal and inner processors are presented. Ali mentioned operations [1] are executed in one systolic cycle (step), but of course, more cycles are needed to finish a transforma­tion, i.e. those operation are repeated (operations present only the set of processor's instructions). Gauss elimination [5] and LU decomposition [7]: 1. y= Odi r 2. r = 1. = Xi r Xo -y Oj: 2. r = Xi In such structure there is a simi]arity of Gauss elimina­tion and LU decomposition (results of LU decomposition are just transformed Gauss coefficients) [7]. QR decomposition [5]: '\ixi=0, r=0 then c=l,s=0 Od.- else t=^>r^+xf C = rit, s = Xilt, r= t (c.s) -. (CS) Oj: r=c-r+s-Xi Input or output (c, s) of QR decomposition vvill be treated as y in the follovving sections. Because of the transformations the instruction sets of the processors are changed as described in the follovving sec­tions. 2.1 Horizontal array Li l r^ rt—» 1 ril O O O 10 Figure 2: Transformation into horizontal array A spresented in Fig.pi^^^^iu^u lig . 2 processor 1 is mapped into pro-la iiiappcu m z,, j.;iucc:>aui i iiiiu [jiu­ cessor A ; processor s 2 an d 5 int o processo r B , processor s G. Papa et al. 3, 6 and 8 into processor C; processors 4, 7, 9 and 10 into processor D. So, processor A takes over the tasks of one processor and performs operation o^, but processor D takes over the tasks of four processors and performs operations Od and Oj. They work in different modes: - mode 1: operation Od with one input Sj, - mode 2: operation Oi with two inputs ixi,y), - mode 3: operation Oj with one input y and one input from its output {Xo to Xi). Each processor works in these modes: - processor A always in mode 1, - processor B in modes 2 and i, - processor C in modes 2, 3 and 1, - processor D in modes 2, 3, 3 and 1, - additional processors would work in modes 2, 3, ... 3 and 1. Occupation of processors is presented in Table 1. Table 1: Processor occupation in horizontal array A B C D 1 1 2 2 3 1 2 4 3 2 5 1 1 3 6 2 3 7 1 2 1 8 3 2 17 1 3 18 3 19 1 Figure 3: Data inputs in horizontal array Input valuesan, ai2, flis and ai4 are delayed foroner, and values 021, 022, «23 and a2A are delayed for {n — l) r according to values an , 012, 013 and 014, where n is the number of processors, as presented in Fig. 3. 2.2 Vertical array As it can be seen in Fig. 4, processors i, 2, 3 and 4 are mapped into processor A; processors 5, 6 and 7 into pro­cessor B; processors 8 and 9 into processor C; processor 10 into processor D. Processor A is the most loaded, vvhile processor D takes over the tasks of only one processor. 1 3 • 4 a 6 1(B)_J 7 as CH © ­ Figure 4: Transformation into vertical array Processors A, B and C perform operations Od and Oi, vvhile processor D performs only operations Od- They work in different modes: - mode 1: operation Od with one inputXi, - mode 2: operation Oi with one input Xi and one input from its output.(2/ to y). Table 2: Processor occupation in vertical array A B C D 1 1 2 2 3 2 1 4 2 2 5 1 2 1 6 2 2 7 2 1 1 8 2 2 17 2 1 18 2 19 1 Each processor vvorks in these modes: - processor A in mode 1, 2, 2 and 2, - processor B in mode 1, 2 and 2, - processor C in mode 1 and 2, - processor D always in mode 1, - additional processors \vould work in modes 1, 2, ...2 and 2. Occupation of processors and their work modes are pre­sented in Table 2. Values an , ai2, ois and ai4 follow each other Nvithout delay, values a2i, a-22, «23 and a2A are im­mediate successors of values cn , ai2, aia and au, as pre­sented in Fig. 5. When transformed into horizontal or vertical array, the processors' occupation and their instruction set are equal. The only difference can be noticed in data inputs. 2.3 Diagonal arra y Fig. 6 presents the diagonal contraction, where processors 1, 5, 8 and 10 are mapped into processor A; processors 2, 6 and 9 into processor B; processors 3 and 7 into processor C; processor 4 into processor D. Even here the most loaded is processor A and at least processor D, but ali processors cC^• Figure 5: Data inputs in vertical array y \ \ \ X, 1 » 2 « 3 • 4 Figure 6: Transformation into diagonal array execute only one type of operations (processor A performs only diagonal operations, the others only inner operations). Processor occupation and their operations are presented in Table 3. Table 3: Processor occupation in diagonal array A 13 C D 1 Od 2 Od O i 3 Od O ; Oi 4 Od Oi Oi Oi 5 Od Oi Oi Oi 6 Od Oi Oi 7 Od o i S Od 14 Od Oi Oi 15 Od Oi 16 Od Values aii , ai2, flis and 014 are one r delayed and are followed by values 021, 022, «23 and 024. Values asi , 032, a33 and 034, are delayed 2(n — l)r , where n is the number of processors, as presented in Fig. 7. Contraction of the array in the direction of the other di­agonal is not reasonable, while there would be too many delays and inputs/outputs on each processor. Figure 7: Data inputs in diagonal array 2.4 Processor mirroring To decrease the number of processors and to eniiance the performance of transformations, mirroring can be used. The processor can be mirrored into another processor, so that its tasks are executed while another processor would be idle otherwise. The exampie of processor mirroring in horizontal linear array is presented in Table 4. Processor A is mapped into processor B, and merged processor A+B ex­ecutes tasks of both processors. Similarly other mirrorings can be used. Table 4: Processor mirroring a)original array, b)array with mapped processor a) A B C D b) A+B C D 1 1 1 1 2 2 2 2 3 1 2 3 1 2 4 3 2 4 3 2 5 1 1 3 5 1 1 3 6 2 3 5 2 3 7 1 2 I 7 I 2 1 8 3 2 8 3 2 2.5 Interweaved arra y When there is an odd number of processors in the first line of the triangular array, the intervveaved method can be used, as presented in Fig. 8 [11], where the isomorphic embed­ding of the graph is employed. Processors in Fig. 8a are mapped into processor array in Fig. 8b: processors 1, 6, 10, 13 and 15 are mapped into processor A; processors 2, 5, 7, 11 and 14 into processor B; processors 3, 4, 8, 9 and 12 into processor C. Ali processors (A, B, C) are evenly loaded, while each of them takes over the tasks of five pro­cessors. The method is similar to processor mirroring, but it oc­cupies processors almost completely and evenly. Instead L2 X3 X 1 S a) 1 C')­ T B 0 00 \ • ©^{U iL Figure 8: Transformation into intervveaved array of n = 5 processors only n* = '^^' = 3 are needed, which are fully utiiized. Processor A performs operations Od, while B and C perform operations Oj. Processors occu­pation and their operations are presented in Table 5. Table 5: Occupation of interweaved array AB C 1 Od 2 Od 3 Od 4 Od 5 Od 6 Od 7 Od 8 Od 9 Od 10 Od 27 Od 28 Od 29 Od Values aii , ai2, ais, an andai s are delayed oner , val­ues a2i, 022, ^23. «24 and 025 are delayed (n* + l)r , ac­cording to values an , ai2, «13, 014 and O15, as presented in Fig. 9. 3 Matrix multiplication SystoIic arrays can be also used when performing matrix multiplication [2] of the form C = A B + Co . ki H n "2 1 ^2 2 ^1 5 1 1 H 1 i ^12 1 ( i T B C'^:tu Figure 9: Data inputs in intervveaved array Square array of processors for multiplication of two square matrices is presented in Fig. 9 [8]. Inputs of the structure are coefficients (ajj in bij) of the matrices and at the end of the process there are coefficients Cij inside the structure. According to the matrix size n x n the number of required processors n* is: Figure 10: Square systolic array (n=4) Ali processors in the square array in Fig. 10 perform the same operations [8]: 2.r = Xo 3.1 Horizontal arra y Horizontal array is obtained when ali processors of the first column are merged into processor A, processors of the second column into processor B, etc, as presented in Fig. 11. Processors perform the same operations, as before the transformation, beside that, there is an additional input from one of its outputs. 1 1 — • ! • » 4 - —•• • e Ha \k 9 1 h — * • 10 • 11 » 12 14 • 15 » 16 Figure 11: Transformation into horizontal array Occupation of the processors is presented in Table 6, where numbers represent the processor of the adequate (square) array that would be used in that moment. Table 6: Processor occupation in horizontal array A B C D 1 1 2 5 2 3 9 6 3 4 13 10 7 4 5 1 14 11 8 6 5 2 15 12 7 9 6 3 16 17 14 11 8 18 15 12 19 16 Due to the processor merging the data inputs are changed as presented in Fig. 12. "? 4 "2 3 "aa b^, K 6, 3 K b„ 111 1 °A3 °33 "13 °13 °4Z °32 °22 °12 °41°31 °21 *"l1 D O D O Figure 12: Data inputs in horizontal array 3.2 Vertical array Vertical array is made when we merge the processors of the first rovv into processor A, processors of the second row into processor B, etc, as presented in Fig. 13. Proces­sors perform the same operations as when they vvere trans­formed into horizontal array. 1 1 2 ^3 ca 5 —* 6 1^ 03 <-=> i, 10 • 11 —iij-* a X ^ 1 1 3 1 14 15 Figure 13: Transformation into vertical array Occupation of processors is presented in Table 7 and changed data inputs are presented in Fig. 14. Table 7: Processor occupation in vertical array A B C D 1 1 2 2 5 3 3 6 9 4 4 7 10 13 5 1 8 11 14 6 2 5 12 15 7 3 6 9 16 17 8 11 14 18 12 15 19 16 ^,o:ija,a::r]—[D]3 ="LJ,-.L>I: Figure 14: Data inputs in vertical aiTay Actually there is no significant difference between hor­izontal and vertical transformation, since aH piocessors in tvvo-dimensional array perform the same operations. Thus, it is insignificant what the contraction direction is, hovvever we can choose which coefficients are delayed when enter­ing the array. G. Papa et al. 3.3 Diagonal array Figure 15: Diagonal transformation (n=4, n*=4) Due to the square array structure, diagonal transforma­tion is a bit more complicated. Accoiding to the meiging process, there can be different linear soIutions, but only some typical will be presented in this paper. If there is an even number of processors (e.g., n=4) in a tvvo-dimensional array, vve can choose betvveen two possi­bilities. In the first one, as presented in Fig. 15, the processor array is transformed as follovvs: processors 9, 13 and 14 are merged into processor A, processors 1, 5, 10, 11 and 15 are merged into processor B, processors 2, 6, 7, 12 and 16 are merged into processor C and processors 3, 4 and 8 are merged into processor D. So there is even number of processor (n*=4) in linear processor array. Table 8 represents the occupation of the processors, vvhile data inputs are set as presented in Fig. 16. In the second čase, there is an odd number of proces­sors (e.g., n*=5) in the linear aiTay. According to Fig. 15, processors are merged as follovvs: processors 9, 13 and 14 are merged into processor A, processors 5,10 and 15 are merged into processor B, processors 1, 6, 11 and 16 are merged into processor C, processors 2,7 and 12 are merged into processor D and processors 3, 4 and 8 are merged into processor E. Processor occupation is shovvn in Table 9, vvhile Fig. 17 presents the data inputs. But vvhen there is an odd number of processors (n=5) in the tvvo-dimensional array, the linear array consists of odd number of processors(n*=5) . The situation is presented in Table 10. Here processors 11, 16, 21 , 22 and 23 are merged into processor A, processors 6, 12,17,18 and 24 into processor Table 8: Processor occupation (n=4, n*=4) Table 9: Processor occupation (n=4, n*=5) A 13 C D A 13 C D E 1 1 1 1 2 5 2 2 5 1 2 3 9 1 6 3 3 9 5 6 2 3 4 13 5 2 4 4 13 10 1 7 4 5 9 1 6 3 5 9 5 6 2 3 6 10 2 6 14 10 1 7 8 7 14 5 7 7 13 5 11 2 4 8 13 1 G 8 S 9 6 3 9 9 10 2 4 9 14 11 8 10 14 5 7 3 10 13 15 6 12 4 11 13 6 8 11 9 10 16 7 3 n 12 9 10 12 4 12 14 15 11 12 8 13 15 7 3 13 13 10 16 7 4 14 11 16 14 14 15 11 12 S 15 14 10 12 8 15 15 16 12 16 13 15 7 4 16 16 17 14 11 16 8 18 15 12 19 11 16 20 15 12 21 16 b43 ,b34 b33 b24 ?44 'b,3 b42 b„ ,^34 b. , 5? b„ i « ^44 b, . 5» b4. bj2 \ ?43 b4 j a, 3 bs. ."« L-1 -^ 833 ^.2A i j »74 b?3 «3. -« Žj j b2, :.^ ! 3, , b„ b„ b„ b,. ,"' b.. bj ; ' ^ Bj j ^2 2 \\ N ^^/^ \A ''a i \ i i 841 32 ! 1 C > ^ r ' ,.^.1? a, , i, k^ J 1 r "•a ^2 1 a-i »l l b„ a . ^ / Figure 17: Data inputs (n=4, n*=5) sor B, processors A and B mirrored into processor C, ...). Figure 16: Data inputs (n=4, n*=4) Number of steps is the number of systolic cycles needed to perform the algorithm. Number of processors is the num­B, processors 1, 7, 13, 19 and 25 into processor C, proces­ber of needed processors, and utilization is their use accor­sors 2, 8, 9, 14 and 20 into processor D and processors 3, ding to the number of steps (min and max utilization repre­4, 5, 10 and 15 into processor E. sent smallest and largest utilization of a single processor). Data inputs have to be set according to the nevv processor As it can be seen in Table 11 and Fig. 19, mirroring improves the differences betvveen the smallest and largest utilization, as presented in Fig. 18. processor utilization in the array. Table and figure show these conclusions: Conclusions - The number of steps, to execute the algorithm, in­creases with the transformation, but the number of According to the results, there are important differences processors decreases significantly, while their utiliza­when transforming original triangular array in different di­tion is increased. rections and with different mirrorings. The difference is in execution tirne, processor utilization and complexity of - When transforming triangular arrays with even num­ processor's operations. Table 11 represents characteristics ber of processors in the first row of the array, the best of n = 4 and n = 5 arrays. Different transformations transformation is diagonal one vvith mirroring. Diag­ are considered (horizontal, vertical, diagonal, interweaved) onal transformation is the best even if there is no mir­ and different mirrorings (processor A mirrored into proces­roring. Table 10: Processor occupation (n=:5, n*=5) A B C D E 1 1 2 6 1 2 3 11 6 1 2 3 4 11 6 7 2 3 5 16 12 1 8 4 6 11 6 7 2 3 7 16 12 1 8 4 g9 21 11 17 6 7 13 9 2 5 3 10 16 12 7 8 4 11 21 17 13 9 5 12 22 18 14 10 13 11 3 14 16 12 7 8 4 15 21 17 13 9 5 16 22 18 19 14 10 17 23 15 18 16 12 8 4 19 21 17 13 9 5 20 22 18 19 14 10 21 23 24 13 20 15 22 21 17 19 9 5 23 22 18 25 14 10 24 23 24 19 20 15 25 22 18 25 14 10 26 23 24 19 20 15 27 23 24 25 20 15 28 24 25 20 29 25 - When transforming triangular arrays with odd num­ber of processors in the first row of the array, the best transformation is interweaved, while it offers largest utilization and needs only a few processors. - When transforming square arrays any transformation is better than initial array. Since aH processors per­form the same operations it is irrelevant in which di­rection we contract the array, but horizontal or vertical arrays are much simpler to implement than diagonal. - Processor utilization can be even higher if there are consecutive multiplication computations used one af­ter another. - In ali untransformed arrays the number of steps is de­fined as 3n — 2 and the number of processors is 2il^—'­in triangular and n^ in square arrays, where n x n is thesizeof matrix. - In transformed arrays the number of steps is defined as n^ -I-n — 1 and the number of processors is n. In some common problems there are very big matrices, e.g., 250 X 250, which lead to the large number of pro­cessors required. Therefore in those cases it is appropriate to use also some other techniques with even fewer number of processors [5, 10]. References [1] H.Barada, A.E1-Amawy, Systolic architecture for ma­trix triangularisation with partial pivoting, IEEE Proč, Vol. 135, Pt. E, No. 4, July 1988, pp. 208-213. G. Papaet al. Figure 18: Data inputs (n=5, n*=5) [2] RBlaznik, J.Tasič, D.J.Evans, Parallel Solving the Up­dated Linear Systems of Equations, ESolina and B.Zaje (ed.), Proceedings of the second Electrotechnical and Computer Science ERK'93, Volume B, 1993, pp. 115­ 118. [3] J.Kanievvski, O.Maslennikow, R.Wyrzykowski, VLSI implementation of linear algebraic operations based on the orthogonal Fadeev algorithm, Parallel Computing: State-of-the-Art and Perspectives, Elsevier, 1996, pp. 641-644. [4] S,Y.Kung, VLSI Array Processors, Prentice Hali, En­glewood Cliffs, New Jersey, 1988. [5] J.G.Nash, S.Hansen, Modified Faddeeva Algorithm for Concurrent Execution of Linear Algebraic Operations, Proč. IEEE Transactions on Computers, Vol. 37, No. 2, February 1988, pp. 129-136. [6] J.G.Nash, C.Petrozolin, VLSI Implementation of a Linear Systolic Array, Proč. 1985 Int. Conf. Acoust., Speech, Signal Processing, T