Obtaining High Performance via Lower-Precision FPGA Floating Point Units

Junqing Sun

Advisor: Gregory D. Peterson

University of Tennessee, Knoxville

[jsun5, gdp]@utk.edu

1      PROBLEM & MOTIVATION

Higher-precision floating point data formats are widely used because of their high accuracy and wide data range. However, their significant resource cost and slow speed are prohibitive for many timing-critical applications. For example, a double-precision multiplier requires four times the hardware of a single-precision multiplier and requires much longer computation time [1]. To avoid expensive and slower double precision floating point units, many applications use lower-precision floating point or even fixed point data formats to achieve higher performance by sacrificing accuracy. Unfortunately, these algorithms might fail to converge or may produce unexpected results if their data formats do not have sufficiently high accuracy. Moreover, finding the simplest data formats required by real applications usually involves difficult analysis [16].

Unlike previous work, our research aims at a much more aggressive goal - achieving higher performance without losing any of the accuracy of higher-precision data formats by using mixed-precision algorithms and architectures. This goal has been met with both mathematical analysis and experimental results [1]. In this paper, we first analyze the performance of different floating point data formats. Second, we introduce the mixed-precision algorithm and architecture. Our supercomputer implementation results show that a mixed-precision algorithm and architecture can significantly improve performance without losing accuracy. Based on our mathematical analysis and experimental results, we propose mixed-precision architectures for future high performance reconfigurable computers.

 

1.1    Floating Point Performance on FPGAs

Field programmable gate arrays (FPGAs) have shown great potential in accelerating computationally intensive applications because of their intrinsic parallelism, pipelining, and flexible architecture. Previous work concluded that peak floating-point performance on FPGAs exceeds that of CPUs and will increase by orders of magnitude [6]. Our previous work indicates the cost and performance of floating point IP cores directly determines the performance of many computational intensive applications [7]. Because of the importance of floating-point data formats, many researchers and commercial vendors have developed floating-point intellectual property (IP) cores for FPGAs. Xilinx, a large FPGA vendor, has included configurable pipelined floating-point operators in its development tools (ISE) [11]. Related floating-point IP cores have also been developed by academic research groups [12][13]. To reduce hardware resource cost and achieve high performance, operators can be built from either combinational logic slices or embedded circuits such as built-in 18x18 multipliers. For commonly used floating point components, the resource cost increases linearly for adders and quadratically for multipliers. We estimate the floating point performance of FPGAs by multiplying the frequency and maximum number of IP cores that fit on the FPGA. Figure 1 shows the floating point performance of a Xilinx Virtex 4 XC4LX160-10 FPGA. For this test, we assume 70% of the slices can be configured as floating point units, while the rest are available for routing and control. Our test results reveal the floating point performance for lower-precision data formats significantly exceeds that of higher-precision data formats. The customized data format s16e7 (16 bit mantissa and 7 bit exponent) outperforms double precision (s52e11) by a factor of 23x for multipliers configured by DSP48s (built-in structures on the FPGA).

 

Figure 1: Lower-precision floating point achieves higher performance

2      BACKGROUND & RELATED WORK

Linear algebra is widely used in scientific computations. Therefore, tremendous effort has focused on improving linear algebra performance by building optimized libraries for various computer architectures [7][8]. Inspired by the great performance of lower-precision floating point components and previous research using low-precision or fixed-precision data formats, we seek to achieve high performance linear algebra on FPGAs without losing accuracy by using mixed-precision algorithms and architectures. This research is closely related to previous work with reconfigurable computing, linear algebra, and iterative refinement algorithms.

2.1    Reconfigurable Computers

We are interested in high performance reconfigurable computing because the traditional Von Neumann computer architecture is currently facing significant challenges. First, computers architects must invest more power and area on the cache hierarchy to bridge the widening gap between CPU and main memory performance. Meanwhile, heat dissipation problems caused by high clock rates make it increasingly difficult to increase the CPU frequency. Due to these reasons, computer architects struggle to fully utilize the exploding chip capacity brought by modern Integrated Circuit (IC) technology. In contrast, FPGAs have enjoyed substantial growth in performance and capacity [6]. Reconfigurable computers containing FPGAs have shown tens to hundreds of times speedup for many computationally intensive algorithms, such as linear algebra [1][2][7][14], random number generators [15], and molecular dynamics [16]. Hence, mixed-precision algorithms and architectures for reconfigurable computers show promise to assist in LU decomposition and linear direct solvers.

 

2.2    Linear Algebra on Reconfigurable Computers

Due to the intensive floating point computational loads and high communication to computation ratio of linear algebra applications, especially sparse matrix operations, traditional computers usually cannot achieve good performance. Utilizing FPGAs for high performance linear algebra computation has been explored with promising speedups achieved over CPUs. Previous work reveal that the performance and resource cost of linear algebra on FPGAs depend on the size, frequency, and pipeline latency of floating point IP cores [7][14].  

Matrix factorization is widely used to solve linear equations, and LU decomposition is the most commonly used method for matrix factorization. Significant previous work addresses hardware architectures for this important computational kernel [14][17][18][19][20]. For example, Govindu developed a circular linear array architecture on FPGAs and achieved a 10% - 60% reduction in energy over that of a traditional CPU. Zhuo et al improved this design by increasing parallelism through more PEs and achieved higher GFLOPS performance than a 2.2 GHz AMD Opteron processor by using a Virtex-II Pro FPGA [14]. All these previous designs assume that target matrices are positive definite and no pivoting is required. Although pivoting complicates control logic, it increases the numeric stability of LU decomposition. Therefore, our hardware architecture supports pivoting.

Lower-precision designs can achieve higher performance by accommodating more floating point operators, increasing clock frequency, and shortening the pipeline latency. Memory bandwidth is another key factor affecting the performance of FPGA accelerators for linear algebra. For instance, the actual performance of sparse matrix computation [7] is usually limited by the memory bandwidth. Using lower-precision arithmetic directly enables higher performance for these designs due to reduced memory bandwidth needs. However, the solutions of linear solvers usually must meet certain accuracy constraints. We use lower-precision data formats internally (on the FPGA) for high performance and provide high precision externally (on the CPU). Our linear solver provides solutions as accurate as if only high-precision data formats are used but at a much faster speed.

 

2.3    Mixed-Precision Iterative Refinement

Iterative refinement algorithms were introduced in the 1960s [1]. While traditional algorithms use a specific data format, we use mixed-precision data formats to combine the high performance of lower-precision formats with the accuracy of higher-precision data formats. Buttari et al investigated single/double mixed-precision algorithms for linear solvers [5]. Their results show that mixed-precision algorithms can exploit single precision floating point performance and also achieve the double precision accuracy on several architectures, such as traditional CPUs and cell processors [5]. Strzodka et al discussed using mixed-precision algorithms for iterative solvers [4]. Their approach separated the loops of iterative solvers into two parts: lower precision inner loops and higher-precision outer loops. The results from inner loops are used as preconditions for the outer loops. They used software to emulate the possible inner and outer loops required by mixed-precision algorithms and compared the results to double precision algorithms. This previous research demonstrates the possibility of using mixed-precision algorithms but no previous configurable architectures have been built.

3      UNIQUENESS OF THE APPROACH   

3.1    Mixed-Precision Direct Solver

The primary contribution of this work is exploring the performance of mixed-precision direct solvers on FPGAs. This is in contrast to [4], which used different data formats for iterative solver loops, required numerous refinement loops, and resulted in high execution time for many test cases. Suppose matrix A can be factored as PA=LU with partial pivoting, where L is a lower triangular matrix, U is an upper triangular matrix, and P is a permutation matrix used for pivoting. The direct solver with iterative refinement [5] is shown in Figure 2, where the first two steps are very similar to traditional direct solver algorithms and the refinement loops are taken to improve the accuracy based on the available solution. The iterative refinement process is similar to Newton’s method applied to f(x) = b - Ax. The required number of refinement iterations depends on the matrix condition numbers and data precision. Table 1 gives the average number of iterations required for random matrices whose entries are generated by Gaussian random number generators [1]. In this analysis, we keep the same number of exponent bits as in double precision and vary the size of the mantissa. The number of iterations increases from right to left and from top to bottom, which corresponds to reduced precision and matrices with bigger condition numbers  [1].

Figure 2: Iterative refinement direct solver algorithm

Table 1: Refinement iterations for different precision

             Mantissa Bits

Problem Size

16

23

31

48

52

128

4

2

1

1

0

256

5.1

2.1

1

1

0

512

6.1

2.5

1

1

0

1024

6.3

2.6

1

1

0

2048

9.3

3

1.3

1

0

4096

13.3

3.1

1.43

1

0

The key idea here is that the factorization PA=LU and the triangular solver LUx=Pb are computed in lower precision; while the residual and solution update are computed in higher precision. This algorithm produces a solution correct to the higher-precision, provided matrix A is not too ill–conditioned [5]. The behavior of the mixed-precision method depends strongly on the accuracy with which the residual is computed [1]. The potential performance gain of using this algorithm comes from faster factorization computation, which is O(n3) and dominates the runtime of the algorithm. The other steps, including the triangular solver, residual computation, and solution update, are O(n2). Furthermore, shorter data formats usually reduce the memory bandwidth requirement.

 

3.2    Implementation on Cray XD1 Supercomputer

We realized a unique LU decomposition and mixed-precision direct solver on a reconfigurable supercomputer. The Cray XD1 supercomputer incorporates reconfigurable computing accelerators to deliver significant speedup of targeted applications. The basic architectural unit of the Cray XD1 system is the chassis, which can contain one to six compute blades. Each compute blade includes two 64-bit AMD Opteron processors configured as a two-way symmetric multiprocessor (SMP) that runs Linux. Each compute processor can be assigned 1 to 8 GB DDR. FPGAs can be included as coprocessors by adding an expansion module on the compute blade. As shown in Figure 3, the processors, FPGAs, and memory are linked within a chassis and between chasses by a high-speed fabric called the RapidArray interconnect. Besides the main memory, each FPGA module contains four QDR II SRAMs for high-speed storage. The Cray XD1 machine at ORNL (Tiger) has 12 chasses containing 144 Opteron processors and 6 Xilinx XC2VP50-7 FPGAs.

Figure 3: FPGA organization on Cray XD1 supercomputer [10]

The mixed-precision algorithm needs floating point components with different data formats contained in a single system. This can be realized by implementing different precision floating point IP cores on a single FPGA chip. An alternative approach is to have different precision arithmetic on different FPGAs, or on both FPGAs and CPUs. The former approach requires the FPGAs have enough capacity to accommodate parallel computational engines for multiple data formats. An issue with the second approach is that the data movement within FPGAs, or between FPGAs and CPUs, should not be too frequent to affect the performance. Considering the limited capacity of current FPGAs, our work uses the latter approach. Figure 4 shows our innovative mixed-precision hybrid direct solver on the Cray-XD1 supercomputer. The lower-precision LU decomposer is mapped to the FPGA for high performance while the forward/backward solver and iterative refinements are mapped to the Opteron processor in higher-precision. The latter could also be implemented on FPGAs, but complicated control logic is required to achieve fine grained parallelism. Furthermore, the division operation of each iteration cannot be easily parallelized and requires an expensive floating point divider. On the other hand, the computational complexity for the operations on the CPU is only O(n2), while that for the LU decomposition on the FPGA is O(n3). For an n by n matrix, the computational load for LU decomposition is n times of that for forward and backward solvers.

 

Figure 4: Hybrid mixed-precision direct solver on a reconfigurable supercomputer (Cray XD1)

To the best of our knowledge, our work is the first to design pivoting circuits for LU decomposition on FPGAs [1]. As shown in the left half of Figure 4, we build an LU decomposer with multiple parallel processing elements (PEs). The complete LU design is deeply pipelined with minimal overhead [1]. We improve the performance of the direct solver by accelerating LU decomposition on FPGAs. The maximum number of PEs and their local memory size are limited by the available resources of the targeted FPGA. The frequency shown in Table 2 is the frequency for our design running on the ORNL Cray XD1. Because more PEs can be implemented for lower-precision designs than for higher-precision designs on a FPGA, lower-precision designs can achieve higher performance.

 

Table 2: LU Implementation on Xilinx XC2VP50-7 FPGA

Design

Double (s52e11)

s31e8

s16e7

PE number

8

16

32

Max size

128

128

256

Achievable Frequency

120MHz

130MHz

140MHz

Slices

21044 (89%)

20356 (86%)

20907 (88%)

BRAMs

84 (36%)

84 (36%)

130 (56%)