https://doi.org/10.31449/inf.v49i14.5742 Informatica 49 (2025) 111–118 111 
Improved Memory Efficient Computing Unit DWT Architecture For 
Satellite Images  
A. Azhagu Jaisudhan Pazhani, P. Gunasekaran and A. Rameshbabu  
Department of ECE, Ramco Institute of Technology, Rajapalayam, India 
E-mail: alagujaisudhan@gmail.com, mailtogunasekar@gmail.com, rameshbabu@ritrjpm.ac.in 
 
Keywords: VLSI architecture, look-up table, distributed arithmetic, DWT, 2D, ROM, memory 
 
Received: February 21, 2024 
The 2D Discrete Wavelet Transform is a signal transform that is frequently used in picture and video 
compression. It is a computationally costly signal transform. VLSI implementation of 2D DWT is 
susceptible to a set of restrictions such as area and power consumption due to its increasing use in high 
data rate communication and storage in portable and handheld devices. The Distributed Arithmetic 
architecture is one of several architectures for constraint-driven VLSI implementation of 2D DWT that 
have been developed in recent years. The Distributed Arithmetic architecture is used efficiently to execute 
inner product computations, eliminating the need for multiplication and increasing computation speed. 
Filtering is the most power-intensive process in DWT, and multipliers are more expensive, so in 
Distributed Arithmetic architecture, multipliers are substituted with shifts and ROM lookup tables. 
However, as the number of filter coefficients grows, the size of the ROM look-up table grows, which can 
be decreased using the lookup table compression technique. In this paper, an Improved Memory Efficient 
Distributed Arithmetic Architecture for DWT has been proposed. The look-up table is used to stock the 
inner product values and then compressed. The performance of the improved LUT compressed algorithm 
is superior than the existing technique. 
Povzetek: Predlagana je optimizirana pomnilniško učinkovita VLSI arhitektura za 2D DWT pri obdelavi 
satelitskih slik. Z uporabo porazdeljene aritmetike in stiskanja LUT zmanjša stroške računanja, izboljša 
hitrost in učinkovitost za aplikacije z visoko hitrostjo prenosa podatkov.
1   Introduction  
Wavelet-based approaches are used to tackle complicated 
problems in math and engineering, with current 
applications including data compression, signal 
processing, image processing, pattern recognition, 
computer graphics, aeroplane and submarine detection, 
and other medical imaging technologies. A wavelet is an 
orthogonal function that may be applied to a limited set of 
data in the sense of the Discrete Wavelet Transform 
(DWT). 
Mohanty B.K. Meher P.K. introduced a distributed 
arithmetic (DA) formulation for DWT computation 
utilising 9/7 filters in 2009, and transferred it to bit-parallel 
and bit-serial architectures for high-throughput and low-
hardware implementations, respectively. For low-
hardware solutions, the bit-serial structure processes the 
input vector's bit-slices in a serial fashion, whereas the bit-
parallel structure processes all the bit-slices in parallel for 
high-throughput computing. The hardware usage 
efficiency of the bit-parallel structure is 100 percent. The 
suggested DA DWT structure has a much greater 
throughput rate and requires less area-delay product than 
conventional multiplier-less arrangements. 
To process N-bit input operands, the fundamental serial 
architecture needs N clock cycles [3]. The primary 
disadvantage of the serial DA design is that it consumes  
 
more clock cycles and the filter's performance is slow. To 
expedite the procedure, it is preferable to apply the DA in 
parallel. The input data is separated into even and odd 
samples based on their location in the parallel 
implementation. Even samples convolve with even and 
odd filter coefficients, while odd samples convolve with 
the same set of coefficients at the same time [2]. The result 
is achieved concurrently for both even and odd input 
samples. The number of clock cycles is lowered, resulting 
in faster processing and less memory. 
Distributed arithmetic calculations are bit-serial 
in nature in their most evident and direct form, i.e., each 
bit of the input samples must be indexed before a new 
output sample becomes available. When the input samples 
are represented with B bits of accuracy, an inner-product 
computation takes B clock cycles to complete. By 
replicating the LUT and adder tree, a parallel realisation 
of distributed arithmetic allows multiple bits to be 
processed in one clock cycle. The odd bits are sent to one 
LUT and adder tree in a 2-bit parallel implementation, 
while the even bits are fed to an identical tree. To suitably 
weight the outcome, the bit partials are left shifted and 
added to the even partials before aggregating the 
aggregate. All input bits can be calculated in parallel and 
then concatenated in a shifting adder tree in the extreme 
scenario [4]. 
112 Informatica 49 (2025) 111–118                                                                                                                     A.A.J. Pazhani et al. 
An LUT, a cascade of shift registers, and a 
scaling accumulator make up the distributed arithmetic 
implementation of the Daubechies 8-tap wavelet FIR 
filter. All potential sums of the Daubechies 8-tap wavelet 
coefficients are stored in the LUT. The bit-wide output is 
delivered to the bit serial shift register cascade, one bit at 
a time, as the input sample is serialised. The input sample 
is stored in a bit-serial format in the cascade, which is then 
utilised to generate the requisite inner-product 
computation. The shift register cascade's bit outputs are 
utilised as address inputs to the LUT. The scaling 
accumulator adds together partial LUT results to generate 
a final result at the filter output port. 
The benefit of utilising DA for a wavelet with a 
greater number of coefficients, on the other hand, may be 
lost over time due to a huge rise in memory size. The 
needed number of table entries is 2n. As the number of 
filter coefficients 'n' rises, the size of the look-up database 
grows exponentially. 
A recent 2D DWT implementation on the NVidia 
GeForce GTX TITAN Black GPU was proposed in [7]. 
The authors of the paper [7] used a register-based 
technique to propose their DWT algorithm, which they 
claimed was four times quicker than existing GPU-based 
software implementations of DWT. 
Darji et al. [8] presented a lifting DWT-based 
multiplier-less 1D/2D DWT architecture. They employed 
an innovative z-scanning method to reduce the transposing 
buffer size to 0 by using an innovative z-scanning method. 
Their temporal buffer size, on the other hand, is 
proportional to the number of input data points. Their 
requirement for adders is likewise quite great. Other newer 
methods may be able to outperform their architecture in 
terms of real-time image decomposition. 9/7 and 5/3 filter 
architectures were proposed by Meher et al. [9]. They 
offered 9/7 and 5/3 architectures with and without 
pipelines, as well as reconfigurable 9/7 and 5/3 systems. 
They concentrated on drastically lowering the size of the 
area and memory. Despite the fact that their design is 
space-efficient and their working speed is sufficient, there 
is still room to reduce their CP and thus increase the 
maximum operating frequency, which is a critical design 
component for real-time signal processing. 
A multiplier-less lifting-based 2D DWT 
architecture was proposed in the work [10]. A flipping-
based 2D DWT architecture was also presented in the 
same paper [10]. The inherent low critical-path delay of 
flipping-based architecture might be realised utilising 
lifting-based DWT design, according to the paper [10]. To 
validate the contributions, both designs were compared to 
other existing works. Despite the fact that the designs 
provided in [10] claim to greatly minimise critical-path 
delays, the critical-path delays of both lifting- and 
flipping-based architectures are significantly higher than 
any convolutional DWT architecture. As a result, there is 
plenty of room for improving timing performance. 
In the work of Hegde et al. [11], the authors 
proposed one lifting- and flipping-based DWT 
architecture which is memory and power efficient. They 
used area consumption, critical-path delay, and power 
consumption as the main performance metrics. They 
proposed ‘look-up table’ (LUT)-based multiplier to 
reduce area and critical-path delay. They developed the 
architecture using gate-level HDL language and provided 
the ASIC implementation details. By proposing LUT-
based multiplier, they successfully achieved to reduce the 
critical-path delay and area consumption of their 
multiplier than any conventional popular multiplier. 
However, they did not completely omit multipliers from 
their designs. Therefore, their design’s critical-path delay 
and power consumption are greater than any other 
multiplierless design. Moreover, LUT-based design uses a 
lot of registers or memory. Therefore, their design is also 
memory extensive. 
We are now concentrating on briefly mentioning 
some of the most current works in the domain of DWT 
architectural design, having discussed some of the most 
recent and benchmark works in the subject. The authors 
introduced 1D/2D DWT architectures based on floating-
point multiply and accumulator circuit' (MAC) units in 
their paper [12]. The 45 nm CMOS technology was used 
to implement the design. Though the validation and 
verification of the work is commendable, the performance 
in terms of critical-path delay, CT, and memory 
consumption should be improved further. 
The study given in [13] is about the LeGall 5/3 
DWT filter's DA-based DWT architecture. The work was 
implemented on an Altera FPGA, and the design's quality 
was compared to that of previous DWT-based works to 
demonstrate its superiority. However, there is still a lot of 
room for improvement in terms of area usage, power 
consumption, and operation speed with the DWT 
architecture. The authors of the paper [14] described a 
LeGall 5/3 DWT filter with a 1D DWT architecture based 
on 'canonical sign digit' (CSD)-based DA. 
The authors used the CSD-based DA approach to 
propose a hardware-efficient DWT architecture that only 
required seven adders, a few shift registers, and 
multiplexers. However, their clock period is 100 ns [14]. 
This means that the working frequency of their design is 
only 10 MHz, which is far too low for many real-time 
applications. The work of [15] offered another major and 
current DWT architecture. A dual-memory controller-
based 2D DWT architecture with a focus on real-time 
image processing was presented in the study [15]. The 
design's memory requirements were said to be streamlined 
to allow for real-time image processing. 
  An architecture that reduces the number of 
adders in a 1D Daub-4 filter module architecture and 
enhances the conventional Daub-4 very large-scale 
integration (VLSI) architecture design was proposed by 
Tiancai Lan et al [16]. The input image has a size of N × 
N matrix, and the output result is saved in the TM. Four 
sub-bands are obtained by reading the high and low 
frequencies one at a time to the second Daub-4 filter 
following the first Daub-4 filter's process.  
Hussin et al. [17] proposed the 2D DWT and 
Huffman encoding for image compression. Once the input 
image has been chosen, the first step begins with RGB 
layer division. Next, superfluous image data at each RGB 
layer is eliminated using the lossy compression (DWT) 
technique. The output of the DWT process is then encoded 
Improved Memory Efficient Computing Unit DWT Architecture… Informatica 49 (2025) 111–118 113 
and stored using lossless compression (the Huffman 
encoding approach). 
 
The major purpose of this study is to create a 
DWT with a memory-efficient multiplier-less 
architecture. In DWT filtering, the distributed arithmetic 
architecture is used to produce multiplier-less computing. 
The size of the ROM look-up table increases when the 
filter coefficients rise in DWT with DA architecture, 
which can be lowered by employing a more effective LUT 
compression mechanism. 
The size of the LUT can be lowered by counting 
the number of toggles between each pair of entries and 
compressing the result. The idea behind compressing the 
table is to reduce the amount of bit transitions per column 
as much as possible, then save the indices just where a bit 
toggling occurs rather than the entire column. Using the 
look-up table decoding approach, the needed inner 
product value is created from the compressed look-up 
table. 
The following is a breakdown of the paper's 
structure. The DA architecture for DWT implementation 
was covered in part II. The suggested DA-based DWT 
architecture with better compression algorithm is 
described in Section III. In section IV, the findings and 
debates are discussed. Section V brings the paper to a 
close. 
 
2 Distributed arithmetic 
architecrure for dwt implementation 
  
FPGA implementation may be difficult due to 
their lack of arithmetic capabilities compared to general-
purpose DSP processors. The reprogrammable 
configuration of FPGA is, nevertheless, its most 
significant benefit. Field Programmable Gate Arrays 
(FPGAs) are utilized in this study to implement DWT in 
hardware. With a large reduction in calculation time, 
DWT gives enough information for analysis and synthesis 
of the original signal. 
The DA-based DWT has several uses in science, 
engineering, mathematics, and computer science. The use 
of DWT as an analogue filter bank in biomedical signal 
processing for the creation of low-power pacemakers, as 
well as in ultra-wideband wireless communications, is 
demonstrated. 
To disguise the multiplications, DA is a bit level 
rearrangement of a multiply accumulate. It's a useful 
strategy for shrinking parallel hardware multiply 
accumulates that's ideally suited to FPGA designs. Since 
its introduction over two decades ago, DA has been 
frequently employed in VLSI implementations of DSP 
systems. The majority of these applications rely heavily 
on computing, with multiplication and/or addition being 
the most common operations. The key benefit of the 
distributed arithmetic technique is that it speeds up the 
multiply process by computing and storing all potential 
medium values in a ROM. After that, the input data may 
be used to address the memory and the result directly. 
 
Formulation of algorithm 
An illustration of normal Multiply Accumulate (MAC) 
operation 
                
1 1 2 2
...........
ii
y A X A X A X = + +
 
             (1) 
A i = Coefficient,  X i = Input       
                              
Distributed arithmetic implementation of DWT 
 
Let Xk be a N-bits scrambled 2’s complement number 
|X k|<1 
 
X k: {b k0, b k1, b k2……, b k(N-1),   
 Where b k0 is the sign bit 
 
X k is expressed as 
 
                              X k = -b k0 + ∑ 9
𝑁 𝑛                         (2)                                                    
 
Substitute equation (2) in equation (1),       
 
y =  ∑ 𝐴 𝑘 𝑘 =1
𝑘 + ∑ 9
𝑁 𝑛   
 𝑦 = ∑ 𝑏 𝑘 0
𝐴 𝑘 + ∑ 𝐴 𝑘 ∑ (𝐴 𝑘 𝑏 𝑘𝑛
) 2
−𝑛 𝑁 −1
𝑛 =1
𝑘 𝑘 =1
𝑘 𝑘 =1
 
                                                                                   
  𝑦 = − ∑ 𝑏 𝑘 0
𝐴 𝑘 +
𝑘 𝑘 =1
             ∑ ∑ (𝐴 𝑘 𝑏 𝑘𝑛
)2
−𝑛 
𝑁 −1
𝑛 =1
𝑘 𝑘 =1
                       (3)                                  
 
Expanding this part        
                                                            
𝑦 = − ∑ 𝑏 𝑘 0
𝐴 𝑘 + ∑ (𝐴 𝑘 𝑏 𝑘 1
)2
−1
+
𝑘 𝑘 =1
𝑘 𝑘 −1
(𝐴 𝑘 𝑏 𝑘 2
)2
−2
+ ⋯ + (𝐴 𝑘 𝑏 (𝑁 −1)
)2
−(𝑁 −1)
      (4) 
𝑦 = −[𝑏 10
𝐴 1
+ 𝑏 20
𝐴 2
+ ⋯ + 𝑏 𝑘 0
𝐴 𝑘 ]           
+ [(𝑏 11
𝐴 1
)2
−1
+ (𝑏 12
𝐴 1
)2
−2
+ ⋯ + 𝑏 1(𝑁 −1)
𝐴 1
2
−(𝑁 −1)
] + ⋯
+ [(𝑏 𝑘 1
𝐴 𝑘 )2
−𝑘 + (𝑏 𝑘 2
𝐴 𝑘 )2
−𝑘 + ⋯ (𝑏 𝑘 (𝑁 −1)
𝐴 𝑘 )2
−(𝑁 −1)
] 
 
y =  − ∑ 𝑏 𝑘 0
𝑘 𝑘 =1
𝐴 𝑘 + ∑ [ 𝑏 1𝑛 𝐴 1
+ 𝑏 2𝑛 𝐴 2
+
𝑁 −1
𝑛 =1
              … + 𝑏 𝑘𝑛
𝐴 𝑘 ] 2
−𝑛                                    (5)                            
 
y = − ∑ 𝐴 𝑘 (𝑏 𝑘 0
)
𝑘 𝑘 =1
+ ∑ [∑ 𝐴 𝑘 𝑏 𝑘𝑛
]
𝑘 −1
𝑘 =1
𝑁 −1
𝑛 =1
 2
−𝑛  
      (6)                                                    
 
Because each b kn can only take on values of 0 and 1, there 
are only 2k potential possibilities. The memory holds the 
result y after N such cycles. 
 
Hardware reduction in DA method  
 
Figure 2.1 gives the hardware realization of the 
original equation (3) and for this original equation, the 
hardware utilization is high. The DA approach decreases 
hardware use, allowing the operation to run faster. 
 
114 Informatica 49 (2025) 111–118                                                                                                                     A.A.J. Pazhani et al. 
y = − ∑ 𝐴 𝑘 (𝑏 𝑘 0
)
𝑘 𝑘 =1
+ ∑ ∑ (𝐴 𝑘 𝑏 𝑘𝑛
)
𝑁 −1
𝑛 =1
𝑘 𝑘 =1
2
−𝑛    
      (7)
                                                            
 
Figure 2.1: Hardware utilization for original equation 
 
Figure 2.2 shows the hardware utilization in bit level 
rearrangement. In that hardware is reduced compared to 
original equation 
 
y = − ∑ 𝐴 𝑘 (𝑏 𝑘 0
)
𝑘 𝑘 =1
+ ∑ [∑ 𝐴 𝑘 𝑏 𝑘𝑛
]
𝑘 −1
𝑘 =1
𝑁 −1
𝑛 =1
 2
−𝑛    
      (8)
                                                                  
         
Figure 2.2: Hardware utilization in bit level 
rearrangement 
 
 
 
DA architecture 
The LUT, Shift registers, and scaling 
accumulator make up the DA architecture of a FIR filter. 
Various sums of the four coefficients make up the LUT 
data. The operands are loaded into the registers through a 
register chain in the shift registers. Depending on whether 
a serial or parallel architecture is used, the operands are 
then shifted 'n' bits at a time. In the scaling accumulator, 
the output of the DA LUT is added to the scaled output. 
It's made with an M-bit adder and a N+M-bit shift register 
at the output. 
 
Serial DA architecture 
As illustrated in Figure 2.3, the basic serial 
architecture requires N clock cycles to handle N-bit input 
operands. The LUT, adder tree, and scaling accumulator 
are all part of the critical path in the DA architecture, 
which runs from the input shift register to the output. The 
critical path delay is dominated by adder delays without 
the pipeline registers. When the design is fully pipelined, 
the significant fan-out loading delay incurred at the output 
of the shift register feeding the DA LUT inputs entirely 
masks the adder delays. If the loading factor is taken into 
account, the adder delays dominate the critical route 
latency, which may be considerably reduced by applying 
the technique outlined in. However, there will be little 
benefit from adopting quicker adder stages until the fan-
out delays are addressed. 
 
 
 
Figure 2.3: Serial DA architecture 
 
The implementation findings show that by using 
parallelism with more than one bit at a time, the 
performance of DA systems may go up virtually linearly. 
Adding parallelism is the same as repeating the 
fundamental structure as many times as needed, each of 
which may function independently without clock 
frequency deterioration caused by pipelining. 
Due to pipelining, the frequency of both 
operations stays the same. Furthermore, because each 
stage of the DA calculation is only a single basic FPGA 
element, the highest potential clock frequency for a 
particular FPGA device may be exploited. The main 
drawback of the serial DA architecture is, it requires more 
clock cycles and the speed of filter is low. 
 
Parallel DA architecture 
The procedure will be slower because the DA 
architecture is bit serial in nature. A parallel distributed 
arithmetic architecture is built to speed up the procedure 
[4]. Figure 2.4 depicts the parallel DA architecture. The 
input data is separated into even and odd samples based 
on their location in parallel implementation. Filter 
coefficients are also divided into even and odd samples. 
Even samples convolve with even and odd filter 
coefficients, while odd samples convolve with the same 
coefficients at the same time. 
It is possible to receive results for both even and 
odd samples of input at the same time. The number of 
clock cycles is lowered, resulting in faster processing and 
less memory. The registers are loaded with the input 
Improved Memory Efficient Computing Unit DWT Architecture… Informatica 49 (2025) 111–118 115 
values for each cycle, and then the reloading procedure to 
registers is enabled for the following set of cycles. The 
serial shift register, which must access the look-up table, 
will receive the input x[n]. The old value will be moved 
into the next register when the new input arrives in the first 
register. Similarly, as new values enter registers, the old 
values are removed from the registers. 
 
 
 
Figure 2.4: Parallel DA architecture 
 
Consider the bit locations and retrieve the values 
of inputs from that bit position to get the address from the 
input values. Consider the LSBs of all serial registers to 
determine the initial address, for example. The initial 
position value will be generated using this address. Obtain 
all of the bit position addresses and the accompanying 
values from the look-up table in the same manner. Shift 
the values by the bit position value and provide them to 
the adder during adding. Finally, the output, which is the 
convolution of the filter coefficients and the inputs, will 
be generated. 
Both the high-pass and low-pass filters will be 
built using the same design. If the input is 8 bits long, the 
convolved value takes 8 clock cycles to compute. The 
filter operations are stated using floating point arithmetic 
while computing the wavelet coefficients. In practice, 
though, integer arithmetic is employed. The filter 
coefficients are shortened as a result. The precision of the 
calculated coefficients suffers as a result of this reduction. 
 
3 Proposed memory efficient da 
architecture for dwt 
Implementing DWT with DA architecture may 
improve computation speed, but it will also increase 
memory size as the number of wavelet coefficients grows. 
The multi-level decomposition requires a high level of 
DWT implementation complexity. As a result, the benefit 
of employing DA will be effectively gone. The size of the 
look-up table in the DA architecture for DWT is reduced 
using a novel way. A table compression approach, as 
shown in Figure 3.1, can be used to minimize the size of 
the look up table required to record all possible 
combinations of input in DA architecture. The algorithm 
for compressing the LUT is the same as that used to save 
a processor's assembly language instructions [5]. A similar 
approach can be used to reduce the number of LUTs in DA 
architecture [1].  
After going through high pass and low pass 
filters, the DWT coefficients are created. The filter 
coefficients are convolved with inputs to perform the filter 
operation with N input variables. The coefficients are 
fixed in this case. Binary can be used to represent inputs. 
The inputs are scaled to have absolute values less than one. 
In ROM look-up tables, the inner product for several 
inputs can be computed and saved in advance. If there are 
n wavelet coefficients, the look-up table will be 2n. All 
LSBs are assumed to be the first to receive data. Similarly, 
all bit positions are determined, and the look-up database 
is used to determine the appropriate values.  
 
 
Figure 3.1: Memory reduced DA architecture 
 
LUT encoding algorithm 
 The size of the LUT can be lowered by counting 
the number of toggles between each entry and 
compressing them [1]. The idea behind compressing the 
table is to reduce the amount of bit transitions per column 
as much as possible, then save the indices just where a bit 
toggling occurs rather than the entire column. Figure 2 
displays an example of a LUT with seven symbols, each 
with eight bits. The table is 56 bits in size (before 
compression). There are 8 distinct binary words in the 
table, with an index length of 3 bits. As a result, if the 
column contains no more than two transitions, it can be 
compressed. Seven columns will be compressed in this 
example, but one column will remain uncompressed. After 
compression, the table's size is reduced to 34 bits (from 56 
bits before). FPGA RAM blocks are used to hold the 
compressed table. 
 
 If the lookup table compression is modified using 
the following steps auxiliary compression can be 
achieved. The steps to be incorporated in the modified 
lookup table compression are as follows: 
 
Total number of locations:  LUT size: 2
n
 
if index< 2n/2  
      use rep with (n-1)-bits 
else 
       n-bits 
 
Using the above steps the table is further compressed as 
shown in Figure 3.2. Hence the LUT compression of 28 
bits can be achieved. 
116 Informatica 49 (2025) 111–118                                                                                                                     A.A.J. Pazhani et al. 
 
 (a)                                              (b) 
 
(c) 
Figure 3.2: a) Uncompressed LUT b) Existing 
Compressed LUT c) Improved LUT Compression 
 
Using the LUT compression methodology and 
the improved LUT compression, the size of the 
compressed LUT is decreased by 39.28% and 50%, 
respectively. Thus the modified LUT can be an efficient 
method for compressing the DWT coefficients. 
 
LUT decoding algorithm 
The needed inner product value is created from 
the compressed look-up table in this decoding process. 
When a certain input to a look-up table comes, it 
determines its location in each compressed table column. 
 
• If the input is greater or equal to the compressed 
look-up table value, then generate ‘1’ 
• If the input is lesser to the compressed look-up 
table value, then generate ‘0’ 
 
The uncompressed table columns' original bits are 
received straight from the ROM. 
 
DA DWT architecture 
 The parallel implementation of DA architecture 
is exposed in Figure 3.3. The input data is separated into 
even and odd samples based on their location in parallel 
implementation. As a result, even samples convolve with 
even and odd filter coefficients, whereas odd samples 
convolve with the same set of coefficients. The results for 
both even and odd samples of input are obtained. Here 
number of clock cycles are abridged which results in 
increased speed and decreased memory.  
 
 
 
 
 
 To access the LUT, the same number of registers 
must be used for accessing filter quantities. The data will 
be sent into a serial shift register, which will need to 
consult the look-up table. The old value will be moved into 
the next register when the new input arrives in the first 
register.  
 
Similarly, when new values enter registers, the 
old values are removed from the registers by examining 
bit positions and determining the values of inputs based on 
that bit position. Finally, all bit position addresses are 
obtained from the look-up table and are given as input to 
adder by shifting its values. Finally, the result, which is 
the convolution of the filter coefficients and the inputs, 
will be achieved. 
The DA architecture speeds up the operation by 
lowering memory use, but as the size of the look-up table 
grows larger, the decoding process becomes more time 
demanding. 
 
Figure 3.3: DA DWT architecture 
 
4 Results and discussion 
In this work, the distributed arithmetic 
architecture for DWT is designed and simulated using 
Verilog in MODEL SIM 0.61xd. Simulation verifies the 
functionality of both high pass and low pass filters. Then 
it is synthesized into Spartan3E FPGA platform using 
Xilinx ISE Design Suite 13.2.  
 
Simulation and synthesized results for single level 
DWT 
The synthesized results for the suggested design 
are presented in Figure 4.1 for the low pass and high pass 
filters. The Parallel DA-DWT Architecture reads input 
vectors from a ROM. The shredded outputs are saved, and 
simulated waveforms are used to illustrate them. 
Improved Memory Efficient Computing Unit DWT Architecture… Informatica 49 (2025) 111–118 117 
 
 Figure 4.1: Synthesized result of single level DWT 
 
Comparison of uncompressed and compressed DA 
ROM Size 
The Table I give the memory size of the look-up 
table for low pass and high pass filter with uncompressed 
DA is reduced to 60% and 40% compared to compressed 
DA respectively. The proposed technique gives the 
compression efficiency of 50% for low pass and 72% for 
high pass filter whereas the existing technique gives the 
compression efficiency of 60% for low pass and 63% for 
high pass filter. 
 
Table 1: Comparison of distributed arithmetic schemes 
Architecture 
Memory 
size (ROM) 
Lowpass 
filter 
Memory size 
(ROM) 
Highpass 
filter 
Uncompressed DA 
[1] 
80 bits 256 bits 
Existing Compressed 
DA [8] 
48 bits 96 bits 
Proposed Improved 
DA 
40 bits 72 bits 
 
Performance comparision 
The performance comparison of different architecture for 
DWT is given in Table II. 
 
Table 2: Performance comparison of various DWT 
architecture 
Scheme Level = 1 
Filter implementation [9] 16   multipliers 
Lifting implementation [8] 6 multipliers 
Serial DA  based 
implementation [14] 
43 adders 
Compressed DA based 
implementation 
4 adders 
4 subtractors 
 
The Table II gives the requirement of adder and 
multiplier for different architectures to design DWT. The 
filter based implementation involves direct multiplication 
for inner product calculation in the filter, which requires 
more number of multipliers. The filter based 
implementation of DWT for single level requires 16 
multipliers. The lifting scheme is implemented to reduce 
the arithmetic computation which requires 6 multipliers to 
implement the DWT for single level. The serial DA based 
architecture involves multiplier less operation for inner 
product calculation; it requires 43 adders to design single 
level DWT. The proposed method reduces up to 4 adders 
and 4 subtractors. 
 
Hardware utilization comparision 
The Table III gives the device utilization of DA 
architecture. It is less compared to convolution based 
architecture. The DA architecture uses LUT instead of 
multiplier for MAC unit to get inner product calculations. 
 
Table 3: Hardware utilization comparison 
LOGIC 
UTILIZATION 
CONVOLUAT
ION BASED 
ARCHITECT
URE (one 
level) [6] 
DA BASED 
ARCHITECT
URE (one 
level) 
Number of slices 
Flip Flops 
47 102 
Number of 4 
input LUTs 
294 115 
Total number of 
occupied slices 
209 91 
Number of 
bonded IOBs 
91 35 
Number of 
BUFGMUXs 
1 1 
 
Images transform comparisons using 2D-DWT 
  
 
     
(a) Input image (b) DWT Processed image (c) Output    
                                                                              image 
 
5 Conclusion 
The memory efficient DA architecture for 
discrete wavelet transform is implemented using Spartan 
3E FPGA. The DA architecture is built on the Look-up 
table technique for effective inner product computation. 
When using DA architecture to implement DWT, the size 
of the ROM look-up table grows as the filter coefficients 
grow. The revised look-up table compression technique 
reduces the size of the LUT up to 115. The compressed 
LUT is kept in the FPGA's ROM. Data can be decrypted 
by decompressing the table while conducting DWT 
calculation. The memory-based method enables the 
Parallel DA-DWT to achieve high computation speeds 
118 Informatica 49 (2025) 111–118                                                                                                                     A.A.J. Pazhani et al. 
while using a little silicon area by replacing multipliers 
with compact ROM tables. Saving adders, quick 
processing time, regular flow of data, and minimal control 
complexity are all advantages of the suggested 
architecture, making it suited for image compression 
systems. The proposed method reduces the memory size 
from 80 bits to 40 bits for LPF and 256 bits to 72 bits for 
HPF, but the decoding process will be time consuming 
while increasing the filter coefficients. The focus of future 
research will be on improving the speed of retrieval from 
LUTs and quick decoding. 
 
Author contributions statement 
"All authors contributed to the study conception and 
design. Material preparation, data collection and analysis 
were performed by [A. Azhagu Jaisudhan Pazhani], [P. 
Gunasekaran] and [A. Rameshbabu]. The first draft of the 
manuscript was written by [A. Azhagu Jaisudhan 
Pazhani]. All authors read and approved the final 
manuscript.” 
 
Conflict of interest 
 There is no conflict of interest in this paper regarding 
publication. 
 
Data availability statement 
The data that supports the findings of this study are 
available within the article. 
 
Funding 
No funding was received for this study. 
 
 
References 
[1]  Remya Ajai A S, Nithin Nagaraj (2012), “A Novel 
methodology For Memory Reduction in Distributed 
Arithmetic Based DWT” International Conference on 
Communication Technology and System Design 
procedia Engineering 30, pp. 226-233.  
[2] K.B. Sowmya, Dr. SavitaSonoli and M. 
Nagabushanam (2012), “Implementation of Parallel 
DA Technique for DWT-IDWT on FPGA for Image 
Compression”, International Journal of Power 
Systems and Integrated Circuits, Vol. 2, pp 143 – 148. 
[3] Mohanty B.K. Meher P.K (2009), “Efficient 
multiplier less designs for 1-D DWT using 9/7 filters 
based on distributed arithmetic”, Dept. of Electronics 
and Communication Engineering., Jaypee Inst. of 
Eng. & Technol., Guna district, India, vol 1.1, no 6, 
pp 364 – 367. 
[4]  Al-Haj AM (2005), “An FPGA-Based Parallel 
Distributed Arithmetic Implementation of the 1-D 
Discrete Wavelet Transform,” Informatica vol. 29, no 
2, pp 241-247. 
[5]  Xixin Cao, QingqingXie, ChunganPeng, Qingchun 
Wang (1996), “An Efficient VLSI Implementation of 
Distributed Architecture for DWT” IEEE Transaction 
VLSI System, vol. 2, no 6, pp. 521-543. 
[6] Basant kumar mohanty, Pramod kumar (2013), 
“Memory-Efficient High-Speed Convolution-Based 
Generic Structure for Multilevel 2-D DWT” IEEE 
Transaction on circuits and systems for video 
technology, vol. 23, No. 2.  
[7]  Enfedaque, P, Auli-Llinas F, Moure J.C (2014), 
“Implementation of the DWT in a GPU through a 
register-based strategy” IEEE Trans. Parallel Distrib. 
Syst. 26(12), 3394–3406. 
[8]  Darji A, Arun R, Merchant S.N, Chandorkar A 
(2015), “Multiplierless pipeline architecture for 
lifting-based two-dimensional discrete wavelet 
transform” IET Comput. Dig. Tech. 9(2), 113–123. 
[9]  Meher P.K, Mohanty B.K., Swamy M.M.S (2015), 
“Low-area and low-power reconfigurable 
architecture for convolution-based 1-D DWT using 
9/7 and 5/3 filters” 28th International Conference on 
VLSI Design, Bangalore, pp. 327–332.  
[10]  Mohanty B.., Meher P.K, Srikanthan T (2015), 
“Critical-path optimization for efficient hardware 
realization of lifting and flipping DWTs” IEEE 
International Symposium on Circuits and Systems 
(ISCAS), Lisbon, pp. 1186–1189. 
[11]  Hegde G, Reddy K.S, Ramesh, T.K.S (2018), “A new 
approach for 1-D and 2-D DWT architectures using 
LUT based lifting and flipping cell”, AEU Int. J. 
Electron. Commun. 97, 165–177. 
[12]  Mohamed Asan Basiri M, Noor Mahammad S (2018), 
“An efficient VLSI architecture for convolution-
based DWT using MAC”, 31st International 
Conference on VLSI Design and 2018 17th 
International Conference on Embedded Systems, 
Pune, pp. 271–276. 
[13]  Aziz, F, Javed S, Iftikhar Gardezi S.E, Jabbar Younis 
C, Alam M (2018), “Design and implementation of 
efficient DA architecture for LeGall 5/3 DWT”, 
International Symposium on Recent Advances in 
Electrical Engineering (RAEE), Islamabad, pp. 1–5. 
[14]  Gardezi S.E.I, Aziz F, Javed S, Younis C.J, Alam M, 
Massoud Y (2019), “Design and VLSI 
implementation of CSD based DA architecture for 5/3 
DWT”, 16th IEEE International Bhurban Conference 
on Applied Sciences and Technology (IBCAST), 
pp.548–552. 
[15]  Naik P, Guhilot H, Tigadi A, Ganesh P (2019), 
“Reconfigured VLSI architecture for discrete wavelet 
transform”, Soft Computing and Signal Processing. 
Springer, Singapore, pp. 709–720. 
[16]  Tiancai Lan, Chih-Hsien Hsia, Po-Ting Lai, Hsien-
Wei Tseng and Cheng-Fu Yang (2022), “Memory 
efficient Very Large-Scale Integration Architecture of 
2D Algebraic-integer-based Daubechies Discrete 
Wavelet   
 Transform”, Sensors and Materials, Vol. 34, No. 9 
3623–3636. 
[17]  M.A. Hussin, F.A. Poad, A. Joret (2021), “A 
comparative study on the performance of DWT and 
huffman compression technique on a 2D signal”, J. 
Electron. Voltage Appl. 2 (1) 11–19.