High-level synthesis and optimization of video signal processing IP
Tadej Skuber, Andrej Trost, Andrej Zemva
Faculty of Electrical Engineering, University of Ljubljana, Slovenia E-mail: ts7305@student.uni-lj.si
Abstract. Experiment described in the paper studies design methodology by implementing image rotation with bilinear interpolation. Algorithm is firstly coded and tested in high level language (C). Code is than converted to RTL design using special commands (directives). To reach desired result latency and small circuit size, code style and directives are experimented with. Effects of different fixed-point decimal places is also analyzed.
1	Introduction
Advanced designs used in today's increasingly complex electronic products are stretching the boundaries of density and performance. They create a challenge for design teams to hit a target release window within their allocated budget. A productive methodology for addressing these design challenges is one where more time is spent at higher levels of abstraction, where verification time is faster and productivity gain is the greatest [1].
The main goal of this paper is to use High Level Synthesis (HLS) to find algorithm with optimal performance, from the choice of multiple versions of known algorithm for image rotation, evaluating performance and resource use. The usage of HLS enables us a fast way of checking both criteria without much coding effort.
Image rotation and translation are widely used operations in image processing with possible parallelism potential. The only problem is a need for additional resampling algorithm.
In computer vision and image processing, bilinear interpolation is one of the basic resampling techniques. It can be used where perfect image transformation with pixel matching is impossible, so that one can calculate and assign appropriate intensity values to pixels. Unlike other interpolation techniques, such as nearest-neighbor interpolation and bicubic interpolation, bilinear interpolation uses values of only the 4 nearest pixels, located in diagonal directions from a given pixel, in order to find the appropriate color intensity values of chosen pixel [2].
2	Rotation and bilinear interpolation
Used algorithm iterates over resulting image in x' and y' direction. For every pixel (x', y') it looks where does it mirror in original image (x, y) (Fig. 1). Algorithm uses
the following formula:
- T A +
where
A
aoo aio aoi aii
T =
IIH/2" IIw/2
cos(0) sin(0) — sin(0) cos(0)
tl		OIh/2 — tid"
t2.		OIw/2 —12 d
(1)
(2)
(3)
Matrix T translates original image for t1d pixels in x direction and for t2d pixels in y direction. Rotation is handled by matrix A, which rotates image for for angle 0 in anti-clock wise direction. IIH and IIW are constants that represent original image (Input Image or II) width and height. Constants OIH and OIW represent width and height of Output Image (OI).
Getting values from source
Figure 1: Representation of algorithm operation [3].
Image transformation does not result in integer coordinates values of the pixels. Therefore bilinear interpolation is used to find appropriate pixel values at given coordinates. Pixel color value f (x, y) is calculated using the four neighbor pixels color values (Fig. 2) by the following formula:
f (x,y) = [1 — x x]
f (0,0) f(1,0)
f (0,1)' f(1,1)
1 — y y
(4)
Values f (0,0), f (1,0), f (0,1), f (1,1) are neighbor pixel color values. Variables x and y are distances of calculated (x, y) from the nearest integer coordinate value.
Formula is applied to each of three RGB (Red, Green and Blue) pixel colors, giving three f (x, y) result values that are combined in one piece of data.
x
y
ERK'2018, Portoroz, 17-20
17

Figure 2: Geometric visualization of bilinear interpolation [2].
2.1 C code
The following code (Lst. 1) was firstly used to test the algorithm and asses performance of HLS (High Level Synthesis).
1 2
3
4
5
6
7
8 9
10 11 12
13
14
15
16
17
18
19
20 21 22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
typedef int	cnt_t
typedef ap_fixed <32, 16> fix_t typedef ap_uint <24>	col_t
void bilinear (
col_t input_image [ II_H *II_W], col_t output_image [OI_H*OI_W] , fix_t A[4] , fix_t T [2 ] ) {
loopi: for (cnLt i = 0; i < OI_H ; i+ + ){ loopj : for (cnLt j = 0; j < OI_W; j+ + ){
x = ((fix_t)(i) - OI_H_2 - T [ 1 ] ) *(A[0]) + (( fi^_t ) (j ) - OI_W_2 - T [0] ) * (A [ 1 ] ) + II_H_2 ; y = ((fi^_t)(i) - OI_H_2 - T [ 1 ] ) * (A [ 2 ] ) + (( fix_t ) (j ) - OI_W_2 - T [0] ) * (A [3]) + ILW2 ;
if ((x0 >= 0 && y0 >= 0) && (x 1 < II_H && y1 < II_W)){ dx = x - ( fix_t ) (x0) ; dy = y - ( fix_t ) (y0) ; dx_dy = dx *dy;
loopk : for (cnLt k = 0; k < 3; k + + ){ shift = k << 3;
output_image [ i *OI_W + j] |= ((coLt)( ( fix_t ) ((( input_image [x0 * II_W + y0 ] ) & ( col_t ) (0xFF << (shift))) >> (shift)) * ( 1 + dx_dy- dx- dy ) +
( fix.t )((( input-image [x1 * II_W + y0 ] ) & (coLt) (0xFF << (shift))) >> (shift)) * (dx-dx_dy) ( fix_t ) ((( input_image [x0 * II_W + y1]) & ( coLt ) (0xFF << (shift))) >> (shift)) * (dy-dx.dy) ( fix.t )((( input-image [x1 * II_W + y1]) & ( coLt ) (0xFF << (shift))) >> (shift)) * dx_dy ) )<< (shift);
Listing 1: Basic unmodified C code.
Main algorithm input is 1D array input_image of size II_H*II_W, with each 24 bit element. This array contains 2D image that is transformed into single row. Other inputs are matrix A and T, containing information of image rotation and translation. Output is rotated and translated image output_image with size OI_H*OI_W, each element also being 24 bits in size.
The RGB color values are coded in 24 bit ap_uint type variables with identifier type coLt, giving each color a standard 8 bit value. Variables with identifier fix_t are of
type apfixed and are used for calculations where decimal places are needed.
Basic algorithm implementation loops through resulting image with two for loops, each pixel assigned (with coordinates (i,j)) a proper pixel value form original image (with coordinates (x, y)). Pixel color calculation is done with bilinear interpolation according to equation (4).
3 Directives
Using Vivado HLS compiler directives, we can assist it in mapping C/C++ code to hardware [6]. In design four directives were used, unroll, pipeline, interface and array reshape. Unroll was used in for loop "loopk", for calculation of the individual RGB color value. Loops "loopi" and "loopj" are too large for unrolling, so directive was not used here. Directive pipeline was used on all loops, reducing the initiation interval for loops by allowing the concurrent execution of operations [5]. Directive interface with argument ap_memory was used on function array parameters, for implementation of arguments as standard RAM interface. Images were then conveniently stored in RAM. Array reshape was used for parallel access to data in RAM.
Algorithm synthesis settings were set for ZedBoard using Vivado HLS 2017.4. Synthesis clock period was set to default 10 ns, and implementation was done in C and supports various input (II_H being height and II_W width) and output (OI_H and OI_W) image sizes. For implementation, standard test image size of 512 * 512 was used.
Results of using different directives are visible in the following table:
DIRECTIVE	LATENCY/INTERVAL
/	8127489
only pipeline and unroll	524299
all directives	524299
Table 1: Latency when using different directives.
It can be seen(Tab. 1) that it is possible to achieve latency of about two clock cycles per pixel (512 * 512 * 2 = 524288), but it can also be noticed that additional directives (interface memory and array reshape) do not help to decrease latency. Problem is that only one RAM array element can be accessed in one clock cycle. Solution would be in loading multiple pixel data in one read cycle, and since input data repeats when iterating true image pixels. This can be used to lower latency to one clock cycle per pixel (Lst. 2).
4 Code modification
Code modification was done so that it does consider repetition of data input [11]. That was done with 48 bit data input instead of 24. Image data was so divided in one array of size II_W/2*II_H. Each element of array therefore includes 2 pixel data. This small change in code gives us desired clock cycles per pixel. For image size 512 * 512 giving us latency of 262155 (262155/(512 * 512) « 1).
+
+
18
But this method is only applicable for images of size 2n * 2m. For all other sizes latency is the same. Algorithm with different image size inputs has latency of at least IIH * II-W * 2. The test with image size 650 * 975 has latency of 2535038 (2535038/(650 * 975) « 4), or image of size 640 * 540 has 691235 (2535038/(640 * 540) « 2) clock cycles of latency. This modification works only when directive array reshape is applied.
	basic	modified
DSP48E	64/29%	64/29%
FF	2620/2%	3135/2%
LUT	7690/14%	7295/13%
Table 2: Resources used by different algorithms. Both using 32 bit fixed point numbers with 16 bits for calculations.
1
2
3
4
5
6
7
8 9
10
11
12
13
14
15
16
17
18
19
20 21 22
23
24
25
26
27
28
29
30
31
32
33
if((y0 & 0x1) == ( cnt_t)(0)){ temp_image [0] =
( color_t) (input_image [x0 * II.W/2 + (y0 >> 1)]); temp_image [ 1 ] =
( color_t) (input_image [x0 * II.W/2 + (y0 >> 1)] >>
24) ;
temp_image [2] =
( color_t) (input_image [x1 * II.W/2 + (y0 >> 1)]); temp.image [3] =
(color.t )(input.image [x1 * II.W/2 + (y0 >> 1)] >>
24) ; }else {
temp.image [0] =
(color.t )(input.image [x0 * II.W/2 + (y0 >> 1)] >>
24) ;
temp.image [ 1 ] =
(color.t )(input.image [x0 * II.W/2 + (y1 >> 1)]); temp.image [2] =
(color.t )(input.image [x1 * II.W/2 + (y0 >> 1)] >>
24) ;
temp.image [3] =
( color.t) (input.image [x1 * II.W/2 + (y1 >> 1)]); }
shift = k << 3; temp_out[k] = ((color_t)( (fixed_t)((( temp.image [ 0])
shift))) >> (shift)) * (1 + dx_dy-dx-dy) +
( fixed.t) (((temp.image [2]) & ( color.t) (0xFF << (
shift))) >> (shift)) * (dx-dx.dy) +
( fixed.t) (((temp.image [1]) & ( color.t) (0xFF << (
shift))) >> (shift)) * (dy-dx.dy) +
( fixed.t) (((temp.image [3]) & ( color.t) (0xFF << (
shift))) >> (shift)) * dx.dy ) )<<( shift);
Figure 3: Fixed point images with 1, 2, 3 and 16 decimal bits of accuracy.
MSE = -1- £ £ [i (ij) - K (ij )]2 (5)
i = 0 j = 0
Results of comparison are given in Table 4. Image comparison was done with two algorithms, Mean Squared Error (MSE) and Structural Similarity Measure (SSIM) [6]. MSE was simply implemented according equation 5. More complicated image assessment with SSIM [7] was done using existing python library [9]. For simplifying image quality estimation, gray-scale image was used.
	1	typedef int	cnt.t ;
< 3; k++) {	2	//typedef ap-fixed <32, 16>	fix-t;
	3	typedef float	fix.t ;
	4	typedef ap.uint <24>	col.t ;
& ( color.t) (0xFF << (	5		
Listing 3: Changing fixed point to floating point to compare images.
Listing 2: Modified C code.
Use of resources is visible in table 2. It can be seen that there are no major differences in use of resources between basic and modified algorithm for image of size 512 * 512.
5 Effects of fixed point
Effects of using fixed point, instead of floating point, were also studied. Different number of fixed point decimal bits were used to see the effects on image quality. For assessment of image quality fixed point images were compared to floating point image. Code was only slightly changed (Lst. 3).
NDB	DSP48E	FF	LUT
8	17	2423	2423
9	17	2467	5982
10	17	2696	6452
11	17	2724	6826
12	17	2778	7014
13	17	2836	7188
14	23	2894	7324
15	43	2820	7351
16	51	3092	7334
17	51	2916	7332
18	51	2964	7311
Table 3: Resources used when changing number of fixed point bits. Using 10 bits for integer part and NDB for decimal part.
As can be seen in Fig. 3 and Tab. 4, image quality quickly becomes sufficient with the use of more than 8
19
References
[1]	UltraFast High-Level Productivity Design Methodology Guide, Xilinx, UG1197 (v2018.2) June 6, 2018.
[2]	Bilinear interpolation, Wikipedia, https://en.wikipedia.org/wiki/Bilinear .interpolation
[3]	Image Rotation With Bilinear Interpolation, http://polymathprogrammer.com/2008/10/06/image-rotation-with-bilinear-interpolation/
[4]	SDAccel Development Environment Help, HLS Pragmas, https://www.xilinx.com/html_docs/xilinx2018_1/sdaccel_doc/ ijd1517252127848.html/
[5]	Vivado Design Suite User Guide, High-Level Synthesis, Xilinx, UG902 (v2014.1) May 30, 2014.
[6]	Xilinx Vivado HLS Beginners Tutorial, https://medium.com/Schathura.abeyrathne.lk/xilinx-vivado-hls-beginners-tutorial-custom-ip-core-design-for-fpga-59876d5a4119
[7]	How-To: Python Compare Two Images, https://www.pyimagesearch.com/2014/09/15/python-compare-two-images/
[8]	Zhou Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image Quality Assessment: From Error Visibility to Structural Similarity, IEEE Transactions on Image Processing, Volume: 13, Issue: 4, April 2004.
[9]	Image processing in Python, http://scikit-image.org/
[10]	The Advantages and Limitations of High Level Synthesis for FPGA Based Image Processing.
[11]	Ryan Kastner, Janarbek Matai, and Stephen Neuendorffer, Parallel Programming for FPGAs, May 11, 2018.
6 CONCLUSIONS
As shown in this paper, HLS is useful at development of digital hardware design. As demonstrated in this experiment, algorithm that would take bigger effort to code in HDL (Hardware Description Language) language such as VHDL, can easily be implemented. Doing so with use of only a few lines of simple C code and some directives.
Important thing to consider when using HLS tool is that C code synthesis output does not always meet our idea of hardware. Change of hardware design can be made by exploiting directives. Use of directives like unroll or pipeline can for instance increase algorithm parallelism and shorten latency. More drastic and optimized change in hardware design usually requires slightly different coding style. For instance, Vivado HLS only managed to reduce latency after image array elements were grouped together.
One disadvantage of HLS is definitely unreadable HDL language code at the end of synthesis. Regardless of representation, the best algorithms for hardware realization are not necessarily the same as those used in software. While HLS tools will produce hardware for realizing the algorithm, there may be better or more appropriate algorithms [10].
20
decimal bits. Image distortion is visible only if a few decimal bits are being used. Assessment methods used enable insight of image quality, revealing that use of more bits does not significant add to image quality. But increasing the number of fixed point bits has an effect on resource use (Tab. 3), mainly on DSP48E blocks (FFs and LUTs are not greatly effected).
NDB	MSE	SSIM
1	4782	0.32853
2	4556	0.32055
3	2202	0.47513
4	951	0.60306
5	590	0.68887
6	230	0.81405
7	98	0.90796
8	43	0.96283
9	9.77	0.98872
10	1.56	0.99701
11	0.41	0.99887
12	0.13	0.99948
13	0.06	0.99967
14	0.029	0.99988
15	0.014	0.99993
16	0.011	0.99994
17	0.005	0.99997
18	0.002	0.99999
Table 4: Results with different Number of Decimal Bits(NDB). Using 10 bits for integer part.