Elektrotehniški vestnik 77(1): 25-30, 2010 Electrotechnical Review, Ljubljana, Slovenija An Iterative Logarithmic Multiplier Zdenka Babic*, Aleksej Avramovic*, Patricio Bulic t * Faculty of Electrical Engineering, University of Banja Luka, Patre 5, Banja Luka, BiH t Faculty of Computer and Information Science, University of Ljubljana, Tržaška c. 25, Ljubljana, Slovenia Email: patricio.bulic@fri.uni-lj.si Abstract. The paper presents a new multiplier enabling achievement of an arbitrary accuracy. It follows the same idea of number representation as the Mitchell's algorithm, but does not use logarithm approximation. The proposed iterative algorithm is simple and efficient and its error percentage is as small as required. As its hardware solution involves adders and shifters, it is not gate and power consuming. Parallel circuits are used for error correction. The error summary for operands ranging from 8-bit to 16-bit operands indicates a very low error percentage with only two parallel correction circuits. Key words: Computer arithmetic, digital signal processing, multiplier, logarithmic number system Iterativni logaritemski množilnik Povzetek. V članku predstavimo izvedbo logaritemskega množilnika, ki nam omogoča nastavljivo natančnost. MnoZilnik je zasnovan na Mitchellovem postopku množženja in uporabi logaritemske predstavitve podatkov [1]. Množenje je časovno zahtevna operacija, poleg tega pa množilniki v primerjavi z drugimi aritmetičnimi vezji porabijo veliko moči. V kar nekaj aplikacijah pri digitalni obdelavi signalov, kjer je velika prisotnost šuma, pa ne potrebujemo natančnega rezultata mnozenja. Zato je v teh primerih primernejše logaritemsko mnozenje [1,3]. Mitčhell je v delu [1] predstavil osnovno idejo in izvedbo logaritemskega mnozilnika. Ideja je v tem, da se binarna števila predstavijo v t. i. logaritemskem zapisu (enačba 1) ter se produkt aproksimira po enačšbi 3. Mitčhellov postopek mnozenja je zapisan v Algoritmu 1. Mitčhell je v svojem delu pokazal, daje največja relativna napaka okrog 11%, relativna napaka pa 3,8%. V literaturi zasledimo veliko poskusov izboljšanja natančnosti takega mnozilnika, največkrat s korek-čijskimi tabelami, ki zahtevajo prečej pomnilnika [2, 3, 4]. V tem delu predlagamo modifičiran postopek mnozenja, s katerim izboljšamo natančnost, ob tem pa ne potrebujemo dodatnega pomnilnika za hranjenje korekčijskih termov. Osnovna ideja je, da ostanke, ki jih dobimo pri logaritemskem zapisu števil, še enkrat zmnozimo na enak način ter pristejemo h končnemu produktu kot v enačbi 5. Tako sproti tvorimo korekčijske terme. To lahko počšnemo iterativno ali vzporedno, dokler ne dosezšemo zelene največje relativne napake (enačbi 14 in 15). Predlagani postopek mnozenja je prikazan v algoritmu 2. Predlagani postopek mnozenja smo implementirali v FPGA čipu. Osnovni blok mnozšilnika je prikazan na sliki 1. Posamezne korekčijske terme nato izračšunavamo z dodajanjem osnovnih blokov, kot je to prikazano na sliki 2. Poraba sredstev in močši v FPGA čšipu brez, oz. z dodanimi korekčijskimi termi, je prikazana v tabelah 1 in 2. Rezultati analize relativne napake (tabeli 3 in 4) pokazejo, da je ze s samo tremi korekčijskimi termi, največja relativna napaka pod 0,5%. Ključne besede: Računalniška aritmetika, digitalna obdelava signalov, mnozilniki, logaritemski sistem Received 16. september, 2009 Accepted 13. januar, 2010 1 Introduction Multiplication has always been hardware, time and power consuming arithmetic operation, especially for large value operands. This bottleneck is even more emphasized in digital signal processing applications that involve a huge number of multiplications. In many signal processing applications the results of arithmetic operations do not have to be exactly accurate. For example, in signal compression techniques, quantization is usually performed after cosine or some other transform. Therefore, calculation of true transform coefficients values is not necessary and rounded products after multiplication by signal transformation are acceptable. Also, many digital signal processing systems can deal with an extra noise introduced. In these signal processing applications, where speed of calculation is more important than accuracy, logarithm number system (LNS) for multiplication seems to be a suitable method. The main advantage of this method is substitution of multiplication with addition, after conversion operands into logarithms. But this simple idea has a significant weakness: a necessity for approximation of logarithm and antilogarithm. Therefore, logarithmic-based solutions are trade-off between time consumption and accuracy. In the well known Mitchell's algorithm (MA) [1] for multiplication in LNS, a higher error is caused due to the piecewise straight line approximation of the logarithm and antilogarithm curve. Later, many methods for MA error correction are introduced with more or less success [2], [3], [4], [5], [6]. LNS multipliers can be divided into two categories, one based on methods that use lookup tables and interpolations and the other based on Mitchell's algorithm, even there is a lookup-table approach in some of the MA-based methods. MA-based solutions suppressed lookup tables due to hardware area savings. This paper presents a new iterative solution for multiplication, where error correction is realized with parallel correction circuits. It is is organized as follows: Section 2 presents the basic Mitchell's algorithm and its modifications with their benefits and weaknesses, Section 3 describes the proposed iterative solution, in Section 4 hardware implementations of the proposed algorithm are discussed, Section 5 gives an experimental results overview, and Section 6 is a conclusion. 2 MA-Based Multipliers A logarithmic number system is introduced to simplify multiplication, especially in cases when accuracy requirements are not rigorous. One of the most significant multiplication methods in LNS is the Mitchell's algorithm, which approximates the logarithm with a piecewise straight line function. MA multiplies two operands by finding their logarithms, adding them and looking for the antilogarithm of the sum. Approximation of the logarithm and antilogarithm is essential. It is derived from binary representation of numbers: k-i N = 2k (1 + YJ2i-k zi )=2k (1 + x) (1) i=j where k is the characteristic number or place of the most significant bit with the value of '1', Zi is the bit value at i-th position, x is a fraction or mantissa and j depends on the number precision. By logarithm of the product computation, log2(Ni-N2) = ki+k2 +log2(1+xi)+log2(1+x2) (2) log2(1 + x1) is approximated with x1 and the logarithm of the two number product is expressed as a sum of their characteristic numbers and mantissas: log2(Ni • N2) « ki + k2 + xi + X2 (3) The antilogarithm uses the similar approximation. The final MA approximation for multiplication, depending on the carry bit from sum of mantissas is given by: ( 2kl+k2(1 + xi + x2), xi + x2 < 1 (Ni • n2) ma = < [ 2kl+k2+i(xi + x2), xi + x2 > 1 (4) The sum of the characteristic numbers determines MSB of the product. The sum of mantissas is added to complete the final result. The proposed MA-based multiplication for the case xi + x2 < 1 is given in Algorithm 1. MA produces a significant error percentage. The relative error increases with the number of bits with the value of '1' in the mantissas. The maximum possible relative error for MA multiplication is some 11% and the average error is some 3.8%. Mitchell analyzed this error and proposed analytical expressions for error correction. To sum up, the most significant advantage of MA is simplicity, efficiency, i.e. non power-consuming. The most significant disadvantage is a high error percentage. Algorithm 1 Mitchell's Algorithm for the case xi + x2 < 1_ 1. Ni, N2: n-bits binary multiplicands, Pma = 0 : 2n-bits approximate product 2. Calculate ki: leading one position of Ni 3. Calculate k2: leading one position of N2 4. Calculate xi: shift Ni to the left by n — ki bits 5. Calculate x2: shift N2 to the left by n — k2 bits 6. Calculate ki2 = ki + k2 7. Calculate xi2 = xi + X2 8. Decode ki2 and insert '1' in that position of Papprox 9. Append xi2 immediately after this one in Papprox 10. Ni • N2 = Pma Numerous attempts have been made to improve the MA accuracy. Hall [4] derived different equations for error correction in the logarithm and antilogarithm approximation in four separate regions, depending on the mantissa value, reducing the average error to 2%, but increasing complexity of realization. Abed and Siferd [5], [6] derived correction equations with coefficients that are power of two, reducing the error and keeping the simplicity of solution. Among many methods that use look-up tables for error correction in the MA algorithm, McLaren's method [2], which uses a lookup table with 64 correction coefficients calculated in dependence on the mantissas values, can be selected as one with a satisfactory accuracy and complexity. A recent approach to MA error correction reducing the number of bits with the value of '1' in mantissas by operand decomposition was presented by Mahalingam and Rangantathan [3]. The proposed method decreases the error percentage of the MA by 44.7% on the average. 3 Proposed Solution As already mentioned above, the basic disadvantage in the Mitchell's algorithm and similar solutions comes from logarithm approximation. Therefore, the proposed solution avoids logarithm approximation and introduces an iterative algorithm with various possibilities for maximally minimizing the errorand getting an exact result. As the proposed solution is an iterative but not a recursive algorithm, it can be realized with parallel circuits for error reduction. From the binary representation of numbers in (1), we can derive a correct expression for multiplication: Ptme = Ni • N = 2kl (1+ X1 ) • 2k2 (1+ x2) = 2kl+k2 (1 + xi + X2) + 2kl+k2 (X1X2) (5) The similarity with MA is evident. The error in MA is caused by neglecting the second term in (5). To avoid the approximation error, we have to take into account the bellow relation: x • 2k N - 2 k (6) The combination of (5) and (6) gives: Ptme = (Ni • N2)=2(kl + k2) + +(Ni - 2kl )2k2 + (N2 - 2k2 )2kl + + (Ni - 2kl ) • (N2 - 2k2 ) (7) Let po = 2(fci+k2) + (Ni - 2kl)2k2 + (N2 - 2k2)2kl (8) be the first approximation of the product. It is evident that Ptr = Po + (Ni - 2kl ) • (N2 - 2k2 ) If we approximate the product with P(o) Po (9) (10) E(o) = P - P(o) approx p - Po = (N1 - 2kl ) • (N2 - 2k2 ) (11) Note that E(0) > 0. The two multiplicands in (11) are binary numbers that can be obtained simply by removing the leading '1' in numbers Ni and N2 so we can repeat the proposed multiplication procedure with these new multiplicands where C(1) is the approximate value of E(0) and E(1) is an absolute error when approximating E(0). The combination of (9) and (12) gives Ptrue = Po + C(1) + E(1) (13) We can now add the approximate value of E(0) to approximate product Papprox as an correction term with which we decrease the error of approximation Pa(pProx = P0 + C(1) (14) If we repeat this multiplication procedure with i correction terms, we can approximate the product as P (i) approx Po + C(1) + C(2) + ... + C(i) Po + ¿ cj) j=1 (15) then the proposed method is very similar to MA. Actually, Po is equal to the first case in the MA approximation (4). Mitchell proposed an exact correction term as in (9), but the logarithm approximation-based multiplying of the given residues was not sufficient to achieve significant error decreasing. Avoiding the logarithm approximation enables parallel error corrections and more accurate product. For this reason, an iterative calculation of correction terms is proposed as follows. An absolute error after the first approximation is E(o) = C(1) + E(1) (12) The procedure can be repeated in order to achieve the smallest possible, or until at least one of the mantissas becomes a zero. Then the final result is exact: Papprox = Ptrue. The number of iterations required for exact result is equal to the number of bits with the value of '1' in the operand with a smaller number of bits with the value of '1'. The proposed iterative MA-based multiplication is given in Algorithm 2. One of the advantages of the proposed solution is the possibility to achieve an arbitrary accuracy by selecting a number of iterations, i.e. a number of parallel correction circuits. 4 Hardware Implementation In order to evaluate the device utilization and performance of the proposed multiplier, we implemented different multipliers on Xilinx xc3s500e-4fg320 FPGA [7]. We implemented four 16-bits multipliers: a multiplier with no correction terms, and three multipliers with two, three and four correction terms, respectively. 4.1 Basic Block The basic block is the proposed multiplier with no correction terms. The task of the basic block is to calculate one approximate product according to Equation 8. The 16bit basic block is presented in Figure 1. The 16-bit basic block consists of two leading-one detector units (LOD), two encoders, two 32-bit barrel shifters, a decoder unit and two 32-bit adders. Two input operands are given to LODs and the encoders. The LOD units are used to remove the leading one from the operands, which are then passed to the barrel shifters. The LOD units include zero detectors, too. They are used to detect zero operands. The LOD units and zero detectors are implemented as in [5]. Algorithm 2 Iterative MA Based Algorithm with i correction terms_ 1. Ni, N2: n-bits binary multiplicands, P0 = 0 : 2n-bits first approximation, C(i) = 0 : 2n-bits i correction terms, PapproX = 0 : 2n-bits product 2. Calculate k1: leading one position of N1 3. Calculate k2: leading one position of N2 4. Calculate (N1 - 2kl )2k2: shift (N1 - 2kl) to the left by k2 bits 5. Calculate (N2 - 2k2 )2kl: shift (N2 - 2k2) to the left by k1 bits 6. Calculate k12 = k1 + k2 7. Calculate 2(fcl+fc2) : decode k12 8. Calculate P0 : add 2(fcl+fc2), (N1 - 2kl)2k2 and (N2 - 2k2 )2kl 9. Repeat i-times: (a) Set: N1 = N1 - 2fci, N1 = N2 - 2k2 (b) Calculate k1: leading one position of N1 (c) Calculate k2: leading one position of N2 (d) Calculate (N1 - 2kl )2k2: shift (N1 - 2kl) to the left by k2 bits (e) Calculate (N2 - 2k2 )2kl: shift (N2 - 2k2) to the left by k1 bits (f) Calculate k12 = k1 + k2 (g) Calculate 2(kl+k2) : decode k12 (h) Calculate C(i) : add 2(fcl+fc2), (N1 - 2kl )2k2 and (N2 - 2k2 )2kl 10. P (i) approx Po + Ei C(i) Figure 1. Block diagram of a basic block of the proposed iterative multiplier. The barrel shifters are used to shift residues according to Equation 8. The decode unit decodes k1 + k2, i.e. puts the leading one in the product. The leading one and two shifted residues are then added to form the approximate product. The basic block is used in further implemena-tions to calculate P0 and C(i). 4.2 Parallel Implementation We implemented multipliers with parallel correction circuits. For this purpose, we used the cascade of the basic blocks. The block diagram of the proposed logarithmic multiplier with a parallel error-correction circuit is shown in Figure 2. The multiplier is composed of two basic blocks of which the first one calculates the first approximation of the product (P0) while the second one calculates the error correction term C(1). 4.3 Device Utilization For design entry we used Xilinx ISE 10.1.02 - WebPACK and design with VHDL. The design was synthetised with Xilinx Xst Release 10.1.02 for Linux. Device utilization (the number of slices, number of 4-input LUTs and number of input-output blocks) for all the four implemented multipliers is given in Table 1. Table 1. Device utilization. Multiplier 4-LUTs Slices IO Bs Basic Block 353 182 96 Basic Block + 1CT 736 381 96 Basic Block + 2 CT 1088 577 96 Basic Block + 3CT 1438 751 96 Figure 2. Block diagram of the proposed multiplier with one parallel error-correction circuit. The multiplier is composed of two basic blocks of which the first one calculates the approximate product while the second one calculates the error-correction term. The maximum combinational path delays along with the total power consumptions for the basic block and the three parallel implementations are given in Table 2. 5 Error Analysis In order to evaluate the proposed algorithm, it is is applied to all combinations of the n-bit non-negative numbers. The error percentage is calculated from the well-known equation: Ei = Ptrue P( Ptr appro* _ 100% (16) For evaluation, the average error percentage value is used: AE 1 N N Ei (17) i=i where N is the number of multiplications performed. For example, for the 12-bits numbers, all combinations of the numbers from 1 to 4095 are multiplied and the average error percentage is calculated. Error calculation is made in four cases: without error-correction parallel circuit, with one parallel correction, with two parallel corrections and with three parallel corrections. Results from Table 3 present the error percentage rate for various cases. Results from Table 4 give a more precise view on how the algorithm can be modified to wanted error percentage. The obtained results are compared with results from [3], for neing the latest paper providing a complete overview of various solutions. Comparing these solutions with other solutions, a similar error table is obtained. Comparing 8-bit and 16-bit average error percentages, we can notice that our solution with three iterations outperforms others logarithm based-multipliers. 6 Conclusions In this paper, we investgate and propose a new approach to improve the accuracy of the Mitchell algorithm-based multiplication. The approach is based on the iterative calculating of the correction terms. We show that the calculation of correction terms can be performed parallelly in hardware. The iterative approach improves the average error percentage and the error rate compared to the basic MA multiplication. By using only three correction terms, the error of any multiplication result is less than a 0.5%. The parallel implementation of the iterative MA multiplier with only one correction circuit almost doubles the area required compared to the original MA multiplier, but the power consumption increases only from 2% (one correction term) to 16% (three correction terms). This is still significantly less than the area and power required for a standard multiplier. The maximum combinational delay increases by 3045% with each added correction circuit. This can be further significantly improved by pipelining the three main stages in the basic block and the correction circuits. Acknowledgments This work was funded by Slovenian Research Agency (ARRS) through grants P2-0359 and BA/10-11-026. 7 References [1] J.N. Mitchell, Computer multiplication and division using binary logarithms, IRE Transactions on Electronic Computers, vol. EC-11, pp. 512-517, August 1962. [2] D.J. Mclaren, Improved Mitchell-based logarithmic multiplier for low-power DSP applications, Proceedings of IEEE International SOC Conference 2003 pp. 53-56, 1720 September 2003. [3] V. Mahalingam, N. Rangantathan, Improving Accuracy in Mitchells Logarithmic Multiplication Using Operand Decomposition, IEEE Transactions on Computers, Vol. 55, No. 2, pp. 1523-1535, December 2006. [4] E.L. Hall, D.D. Lynch, S. J. Dwyer III, Generation of Products and Quotients Using Approximate Binary Logarithms for Digital Filtering Applications, IEEE Transactions on Computers, Vol. C-19, No. 2, pp. 97-105. February 1970. [5] K.H. Abed, R.E. Sifred, CMOS VLSI Implementation of a Low-Power Logarithmic Converter, IEEE Transactions on Computers, Vol. 52, No. 11, pp. 1421-1433, November 2003. [6] K.H. Abed, R.E. Sifred, VLSI Implementation of a Low-Power Antilogarithmic Converter, IEEE Transactions on Computers, Vol. 52, No. 9, pp. 1221-1228, September 2003. Multiplier Max. combinational delay (ns) Total power (W) Basic Block 24.818 0.13031 Basic Block + 1CT 32.214 0.13855 Basic Block + 2 CT 37.261 0.14743 Basic Block + 3CT 41.555 0.15687 Table 3. Error percentage rate [%]. Error 1CT 8 bits 2CT 3CT 1CT 12 bits 2CT 3CT 1CT 16 bits 2CT 3CT < 0,1 % 32.9 79.9 99.0 20.6 71.6 98,2 19.3 70.6 98.0 < 0,5 % 54.8 96.9 100 48.1 95.7 100 47.4 95.5 100 < 1 % 69.9 99.6 100 65.6 99.4 100 65.2 99.4 100 Table 4. Average relative errors [%] for 0,1,2 and 3 correction terms. No. bits Basic MA 1 C. term 2 C. terms 3 C. terms 8 8.9131 0.8337 0.0708 0.0048 9 9.1336 0.8980 0.0845 0.0069 10 9.2595 0.9369 0.0936 0.0086 11 9.3301 0.9597 0.0994 0.0098 12 9.3692 0.9726 0.1029 0.0106 13 9.3906 0.9799 0.1049 0.0111 14 9.4023 0.9840 0.1060 0.0114 15 9.4086 0.9862 0.1067 0.0116 16 9.4124 0.9874 0.1070 0.0117 [7] Xilinx Inc. Spartan-3E FPGA Family: Complete Data Sheet, DS312. http://www.xilinx.com, April 18, 2008. Zdenka Babic received her B.Sc., M.Sc. and Ph.D. degrees in electrical engineering from the Faculty of Electrical Engineering, University of Banja Luka, Bosnia and Herzegovina, in 1983, 1990 and 1999 respectively. She is an Associate Professor at the same faculty. Her main research interests are digital signal processing, image processing, circuits and systems. She is a member of the IEEE Signal Processing Society and IEEE Circuits and Systems Society. Aleksej Avramovic received his B.Sc. degree in electrical engineering from the Faculty of Electrical Engineering, University of Banja Luka, Bosnia and Herzegovina, in 2007. He is a Teaching Assistant at same faculty. His research interests include digital signal processing and digital image processing. He is a student member of IEEE. respectively. He is an Assistant Professor at the Faculty of Computer and Information Science, University of Ljubljana. His main research interests include computer architecture, digital design, parallel processing and vectorization techniques. He is a member of the IEEE Computer Society and ACM. Patricio Bulic received his B.Sc. degree in electrical engineering, and M.Sc. and Ph.D. degrees in computer science from the University of Ljubljana, Slovenia, in 1998, 2001 and 2004,