ERK'2022, Portorož, 24-27 24 On the employment of approximate multipliers in high-level synthesis toolkits Ratko Pilipovi´ c 1 , Patricio Buli´ c 1 , Uroˇ s Lotriˇ c 1 1 University of Ljubljana, Faculty of Computer and Information Science E-mail: ratko.pilipovic@fri.uni-lj.si Rising demands for processing complex and large data volumes stretch the capabilities of modern high-perfo- rmance computing centres. As an answer, computing cen- tres started to employ FPGAs, which offer comparable performance to GPUs with significantly lower power con- sumption. Moreover, high-level synthesis toolkits offer easy-to-use design flows which allow programming FP- GAs in high-level languages like C, C++ and Python. However, although the HLS toolkits have significantly im- proved in last years, it is usually beneficial to prepare the core modules in more efficient hardware-description lan- guages. In this paper, we illustrate the employment of an approximate multiplier in the high-level synthesis of the Sobel edge detector. First, we overview the Intel FPGA SDK programming flow and go through the steps for in- tegrating custom-made modules in high-level synthesis. Then, in the experiments, we compare the synthesis of two implementations of the Sobel filter, one with an ex- act and one with an approximate multiplier. The results show that the Sobel filter with an approximate multiplier offers noticeably smaller FPGA resource utilisation than the Sobel filter with the exact multiplier. 1 Introduction HPC systems mainly rely on the general-purpose graphic processing units (GPUs) to boost the performance of typi- cal HPC workloads, such as machine learning, ODE solvers, and N-body problems. GPUs owe their success to par- allel structure and excellent floating point performance. However, for specific applications, such as multimedia processing and artificial intelligence, field programmable gate arrays (FPGAs) offer a viable alternative [1]. Due to their reconfigurable nature, the FPGA can overcome common bottlenecks in processing AI algorithms and mul- timedia content. Moreover, FPGAs offer lower energy consumption in data centres and edge devices than GPUs. Traditionally, FPGAs are programmed using hardware description languages (HDLs), such as VHDL or Verilog, which offer unique features to design combinational or sequential logic. However, HDLs have many drawbacks, like program verbosity, ridged and error-prone syntax, and longer development time [2]. Contrary to HDLs, the high-level synthesis (HLS) toolkits utilise high-level programming languages (C, C++, SystemC) to describe hardware designs. HLS tools parse and translate the pro- gram written in a high-level language into a correspond- ing register-transfer level (RTL) representation that meets certain user-specified design constraints. Compared to HDLs, HLS offers faster development, design reuse, rapid search of the design space state, and many more. The reconfigurable nature of FPGAs enables the em- ployment of custom computing units specially tailored for different applications. Such dedicated circuits are ap- proximate arithmetic circuits, which facilitate energy-effi- cient processing in error-tolerant applications. Among approximate arithmetic circuits, the approximate multi- pliers deliver significant gains in area utilisation and en- ergy consumption with negligible impact in various ap- plications, e.g. machine learning and multimedia pro- cessing. Unfortunately, employing HLS toolkits to syn- thesise approximate multipliers is an unfruitful task, as they cannot achieve the same design efficiency as HDLs. However, with the recent development of HLS toolkits, it is possible to integrate HDL modules into more complex designs described by high-level languages and synthesise them together using HLS toolkits. In this paper, we illustrate the employment of an ap- proximate multiplier in the HLS design flow and assess its influence on synthesised design. We first implement the approximate multiplier as an RTL module written in Verilog language. Next, we integrate the RTL module in the Sobel filter implemented as the OpenCL kernel and perform high-level synthesis using the Intel FPGA SDK HLS toolkit. Finally, to assess the benefits behind the employment of an approximate multiplier, we compare resource utilisation of the Sobel filter with the exact and approximate multiplier. The remainder of this paper is organised as follows. In Section 2 we describe existing HLS toolkits, while Section 3 concentrates on the Intel’s HLS solution. Sec- tion 4 described the tested applications employing an ap- proximate multiplier. The synthesis results are presented in Section 5 and in the Section 6 we conclude the paper. 2 Related work The need for highly efficient HLS flow led to the de- velopment several HLS toolkits. This section describes some of the commonly used HLS toolkits in the academic 25 world and industry. 2.1 Academic HLS toolkits Bambu [3] is a modular HLS tool developed at Politec- nico di Milano. It preserves the semantics of a program given in the C language without requiring any code ma- nipulation. Furthermore, Bambu’s modularity offers easy customisation and extension with new HLS algorithms or flows. Lastly, it offers standalone verification of gen- erated design on the given dataset. LegUp [4] is a HLS compiler developed at the Univer- sity of Toronto. It leverages a low-level virtual machine (LLVM) framework, which enables LegUp to synthesise most of the C commands. LegUp offers a complete and partial synthesis of C code to the hardware. Here the mi- croprocessor executes one part of code while the rest is synthesised as a hardware accelerator. LegUp support threads and OpenMP to automatically convert the paral- lel code to parallel-operating hardware. It also supports automatic datapath pruning and bitmask analysis. 2.2 Industry HLS toolkits Stratus HLS [5], provided by Cadence, offers synthesis of SystemC, C and C++. The key benefit of Stratus HLS is physically aware HLS, which produces smaller delays and enables low-power optimizations. Stratus HLS pro- vides a rich library of intellectual property (IP) building blocks for easier and faster development. For verification purposes, it provides an automated verification flow. Vivado HLS, developed by Xilinx, can compile al- most any C/C++ program except for dynamic language constructs. For the operation synthesis, Vivado HLS em- ploys the operation chaining technique that performs op- eration scheduling within the clock period. For condi- tional statements, Vivado HLS generates circuits for all conditional branches. Therefore, runtime execution in- volves the selection between all possible results. To im- prove loop executions, it uses loop unrolling and pipeline execution. In Vivado HLS functions are mapped into mod- ules capable of concurrent execution and self-synchro- nization. Vivado HLS serves for programming Xilinx FP- GA chips, supporting Xilinx on-chip memories, DSP el- ements and floating-point operations. The Intel FPGA SDK toolkit [6] employs OpenCL kernels to describe a hardware accelerator. The Intel FPGA SDK translates every command of the OpenCL kernel into custom hardware that may provide more power-effi- cient and flexible use than CPU and GPU architecture would allow. Besides design synthesis, the Intel FPGA SDK offers valuable tools to validate the design func- tionality. Also, a profiler is available to evaluate system performance and reveal the architecture bottlenecks. 3 Intel FPGA SDK toolkit 3.1 Elements and features Figure 1 illustrates the programming flow for Intel FPGA SDK. The main SDK components that participate in pro- gramming FPGA are: Figure 1: Intel FPGA SDK flow • the OpenCL kernel and the Intel FPGA offline com- piler, • the host application and host compiler. Compilation of the OpenCL kernel consists of two phases. In the first phase, the Intel FPGA offline com- piler translates the OpenCL kernel into the Verilog mod- ule. Then, it invokes the Quartus Prime software [7], which synthesises the obtained Verilog modules and gen- erates the FPGA image stream. The FPGA image con- tains the host’s data to create a program object for the targeted FPGA. Compiling an OpenCL kernel is time- consuming, so Quartus must synthesise the OpenCL ker- nel before compiling the host application. The host compiler compiles the host application, which manages the execution of OpenCL kernels on FPGAs. First, the host application discovers existing accelerator boards and allocates the buffers accessible both from the device and the device. Finally, the host application runs the OpenCL kernel on the accelerator device and controls its execution. 3.2 Employment of custom-made RTL modules in OpenCL kernels Although the Intel FPGA SDK offers efficient high-level synthesis, in some cases, the OpenCL kernels need to be expanded with the custom-made RTL modules. For example, we want to use optimized RTL modules inside OpenCL code or implement some features we cannot de- scribe in OpenCL code. Figure 2 depicts the integration of the RTL modules in OpenCL kernel. First, we must create an OpenCL li- brary which includes RTL modules, header files, emula- tion kernel, and meta files. We need to extend the RTL module with additional features to employ it inside the OpenCL kernel. The employed module must be synchro- nized with the rest of the design and needs to support the Avalon interfaces [8] to communicate with the system. Next, we need to add the header file, which acts as an interface between RTL modules and OpenCL kernels. In 26 OpenCL Library OpenCL Kernel Intel FPGA Offline compiler FPGA Image stream RTL modules OpenCL Kernels Meta files Header files Figure 2: Intel FPGA SDK for OpenCL’s Library support addition to the header file, we need to provide a C emu- lation kernel of the RTL module during software simula- tions. Finally, we define the properties of RTL modules with metafiles, describing the number of pipeline stages, expected latency, processing stall, etc. After the OpenCL library is built, the Intel FPGA offline compiler can use the generated OpenCL library to link the RTL modules with the OpenCL kernel and generate the FPGA image stream. 4 Edge detection with an approximate mul- tiplier The Sobel operator [9] is the most commonly used edge detector. It relies on the spatial change in brightness to detect edges. The operator uses kernels K h =K T v =   − 1 0 1 − 2 0 2 − 1 0 1   , (1) whereK h calculates the changes in brightness in the hor- izontal direction, andK v computes brightness change in the vertical direction. By convolving both filters with the original imageI, we get the spatial brightness change E =|K h ∗ I+K v ∗ I| (2) In the final step, the algorithm compares the brightness changeE of each pixel to a predetermined threshold. Pix- els whose value is greater than the threshold constitute edges; otherwise, they belong to the background. Edge detection and other image processing operations represent error-tolerant applications. Therefore, we re- place the exact multiplier in the Sobel edge detector with an approximate one to achieve more energy-efficient pro- cessing. For an approximate multiplier, we chose the it- erative logarithmic multiplier (ILM) [10], which repre- sents a simple and efficient multiplier that achieves ar- bitrary accuracy through an iterative procedure. Unlike the Mitchell multiplier, ILM employs a simpler approxi- mation term. Basic ILM has a relatively high mean rela- tive error of around 10%. However, when the procedure is iteratively applied, the accuracy of the multiplier in- creases. With only one iteration for error correction, the ILM design delivers the mean relative error below 1%. A good feature of ILM is the ability to perform each it- eration concurrently in a pipelined fashion which made the ILM multiplier especially useful in hardware neural networks [11]. X Y PP 1 PP 2 ILM Y 0 X 0 X Y PP 1 PP 2 ILM Y 0 X 0 P Intermediate product addition A B Figure 3: Proposed ILM design with one iteration Let X and Y be two unsigned input numbers. ILM approximates the product ofX andY as X· Y ≈ 2 kx+ky +2 ky · X 0 +2 kx · Y 0 , (3) wherek x =⌊ log 2 X⌋ ,k y =⌊ log 2 Y⌋ ,X 0 = X− 2 kx , andY 0 =Y− 2 ky . Eq. 3 can be rewritten by adding the first and third summand to X· Y ≈ 2 kx · Y +2 ky · X 0 =PP 1 +PP 2 . (4) Figure 3 shows the overall design of ILM with two iter- ations. The ILM block has four outputsPP 1 ,PP 2 ,X 0 , and Y 0 . PP 1 and PP 2 represent intermediate products which are obtained by Eq. 4. On the other hand, X 0 andY 0 represent remainders from the first stage. The In- termediate product addition block adds the intermediate products from both stages using Wallace tree [12]. 5 Results In this section, we present the synthesis results for the Sobel filter. We implemented the Sobel filter with the exact and the approximate multiplier and synthesised it using the Intel FPGA SDK tool. The exact multiplier is synthesised using the multiplication operator in C, while the ILM is integrated into the kernel as a custom RTL module. The synthesis results are reported for Cyclone V FPGA chip present on the C5P development kit. Table 1 shows the resource utilisation and latency for the Sobel operator when the exact multiplier and ILM are employed. We can see that the employment of ILM deliv- ers up to 12% savings in look-up tables (LUTs) utilisation and around 5% savings in flip-flops (FFs) utilisation. Rel- atively small resource utilisation savings can be attributed to the synthesis of the exact multiplier. The Intel FPGA compiler is optimised to map standard arithmetic opera- tions onto FPGA efficiently. Besides multiplication, the synthesised contains additional logic, e.g. memory ac- cess, addition, and flow control, that significantly affects resource utilisation. Concerning the latency, we can see that the employment of ILM increases latency by one cy- cle. Moreover, the HLS toolkit fails to schedule the inte- grated RTL module efficiently. 6 Conclusion This paper illustrates the employment of an approximate multiplier in the HLS toolkits to improve synthesised de- 27 Table 1: Resource utilization and latency of Sobel filter when different multipliers are employed Multiplier Lookup tables Flipflops Latency (No. of cycles) Exact multiplier 7079 8924 28 ILM 6203 8473 29 signs’ resource consumption. We employ ILM with one iteration for error correction for the approximate multi- plier. We describe the employed multiplier with Verilog language and its integration into the Sobel edge detector. The Sobel edge detector is implemented as an OpenCL kernel and synthesised using the Intel FPGA SDK toolkit. This way, we harness the power of both HDL and HLS: HDLs for efficient implementation of small and straight- forward arithmetic circuits and HLSs for synthesising com- plex designs. With the help of the Intel FPGA SDK tool- kit, we have successfully integrated the employed multi- plier into the Sobel edge filter. The synthesis results show that approximate multiplier offers noticeable savings in resource utilisation – LUTs and FFs utilisation. However, a slight increase in latency presents a drawback of the proposed design. In future into the topic, we should in- vestigate the influence of approximate multipliers on the HLS of neural networks. As multiplication dominates in the neural network’s inference and training, we anticipate that our approach will significantly improve resource util- isation in neural network processing. Acknowledgements This research was supported by Slovenian Research Age- ncy under Grants P2-0359 (National research program Pervasive computing), P2-0241 (Synergy of the techno- logical systems and processes) and by Slovenian Research Agency and Ministry of Civil Affairs, Bosnia and Herze- govina, under Grant BI-BA/19-20-047 (Bilateral Collab- oration Project). References [1] Y . Sano, R. Kobayashi, N. Fujita, and T. Boku, “Per- formance evaluation on gpu-fpga accelerated computing considering interconnections between accelerators,” in In- ternational Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, ser. HEART2022. New York, NY , USA: Association for Comput- ing Machinery, 2022, p. 10–16. [Online]. Available: https://doi.org/10.1145/3535044.3535046 [2] R. Mill´ on, E. Frati, and E. Rucci, “A comparative study between hls and hdl on soc for image processing applica- tions,” arXiv preprint arXiv:2012.08320, 2020. [3] C. Pilato and F. Ferrandi, “Bambu: A modular framework for the high level synthesis of memory-intensive applica- tions,” in 2013 23rd International Conference on Field programmable Logic and Applications. IEEE, 2013, pp. 1–4. [4] A. Canis, J. Choi, M. Aldham, V . Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Czajkowski, “Legup: high-level synthesis for fpga-based processor/accelerator systems,” in Proceedings of the 19th ACM/SIGDA inter- national symposium on Field programmable gate arrays. ACM, 2011, pp. 33–36. [5] Cadence, “Stratus HLS,” https://www.cadence.com/ content/dam/cadence-www/global/en US/documents/ tools/digital-design-signoff/stratus-ds.pdf, 2019, [Online; accessed 4-July-2019]. [6] “Intel® fpga sdk for opencl™ software technology.” [Online]. Available: https://www.intel.com/content/www/ us/en/software/programmable/sdk-for-opencl/overview. html?wapkw=intel+fpga+sdk+for+opencl [7] “Fpga design software - intel® quar- tus® prime.” [Online]. Available: https: //www.intel.com/content/www/us/en/products/details/ fpga/development-tools/quartus-prime.html [8] Intel, “Avalon Streaming Interface,” https://www.intel.co. jp/content/dam/altera-www/global/ja JP/pdfs/literature/ fs/fs avalon streaming.pdf, 2019, [Online; accessed 10-July-2019]. [9] I. Sobel, “A 3x3 isotropic gradient operator for image pro- cessing,” Stanford Artificial Intelligence Project, 1968. [10] Z. Babi´ c, A. Avramovi´ c, and P. Buli´ c, “An iterative log- arithmic multiplier,” Microprocessors and Microsystems, vol. 35, no. 1, pp. 23–33, 2011. [11] U. Lotriˇ c and P. Buli´ c, “Applicability of approximate mul- tipliers in hardware neural networks,” Neurocomputing, vol. 96, pp. 57–65, 2012. [12] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Transactions on electronic Computers, no. 1, pp. 14–17, 1964.