Obtaining High Performance via Lower-Precision FPGA
Floating Point Units
Junqing Sun
Advisor: Gregory D. Peterson
[jsun5, gdp]@utk.edu
Higher-precision floating point
data formats are widely used because of their high accuracy and wide data
range. However, their significant resource cost and slow speed are prohibitive
for many timing-critical applications. For example, a double-precision
multiplier requires four times the hardware of a single-precision multiplier
and requires much longer computation time [1]. To avoid expensive and slower double precision
floating point units, many applications use lower-precision floating point or
even fixed point data formats to achieve higher performance by sacrificing accuracy.
Unfortunately, these algorithms might fail to converge or may produce
unexpected results if their data formats do not have sufficiently high
accuracy. Moreover, finding the simplest data formats required by real
applications usually involves difficult analysis [16].
Unlike previous work, our
research aims at a much more aggressive goal - achieving higher performance without
losing any of the accuracy of higher-precision data formats by using
mixed-precision algorithms and architectures. This goal has been met with both
mathematical analysis and experimental results [1]. In this paper, we first analyze the performance of
different floating point data formats. Second, we introduce the mixed-precision
algorithm and architecture. Our supercomputer implementation results show that a
mixed-precision algorithm and architecture can significantly improve
performance without losing accuracy. Based on our mathematical analysis and
experimental results, we propose mixed-precision architectures for future high performance
reconfigurable computers.
Field programmable gate arrays (FPGAs)
have shown great potential in accelerating computationally intensive
applications because of their intrinsic parallelism, pipelining, and flexible
architecture. Previous work concluded that peak floating-point performance on
FPGAs exceeds that of CPUs and will increase by orders of magnitude [6]. Our previous work indicates the cost and performance
of floating point IP cores directly determines the performance of many
computational intensive applications [7]. Because of the importance of floating-point data
formats, many researchers and commercial vendors have developed floating-point intellectual
property (IP) cores for FPGAs. Xilinx, a large FPGA vendor, has included configurable
pipelined floating-point operators in its development tools (ISE) [11]. Related floating-point IP cores have also been developed
by academic research groups [12][13]. To reduce hardware resource cost and achieve high
performance, operators can be built from either combinational logic slices or
embedded circuits such as built-in 18x18 multipliers. For commonly used floating
point components, the resource cost increases linearly for adders and quadratically
for multipliers. We estimate the floating point performance of FPGAs by multiplying
the frequency and maximum number of IP cores that fit on the FPGA. Figure 1 shows the floating point performance of a Xilinx
Virtex 4 XC4LX160-10 FPGA. For this test, we assume 70% of the slices can be
configured as floating point units, while the rest are available for routing and
control. Our test results reveal the floating point performance for lower-precision
data formats significantly exceeds that of higher-precision data formats. The
customized data format s16e7 (16 bit mantissa and 7 bit exponent) outperforms
double precision (s52e11) by a factor of 23x for multipliers configured by
DSP48s (built-in structures on the FPGA).

Figure 1: Lower-precision floating point achieves
higher performance
Linear algebra is widely used in
scientific computations. Therefore, tremendous effort has focused on improving linear
algebra performance by building optimized libraries for various computer
architectures [7][8]. Inspired by the great performance of lower-precision
floating point components and previous research using low-precision or fixed-precision
data formats, we seek to achieve high performance linear algebra on FPGAs without
losing accuracy by using mixed-precision algorithms and architectures. This
research is closely related to previous work with reconfigurable computing,
linear algebra, and iterative refinement algorithms.
We are interested in high
performance reconfigurable computing because the traditional Von Neumann computer
architecture is currently facing significant challenges. First, computers architects
must invest more power and area on the cache hierarchy to bridge the widening gap
between CPU and main memory performance. Meanwhile, heat dissipation problems caused
by high clock rates make it increasingly difficult to increase the CPU
frequency. Due to these reasons, computer architects struggle to fully utilize
the exploding chip capacity brought by modern Integrated Circuit (IC)
technology. In contrast, FPGAs have enjoyed substantial growth in performance
and capacity [6]. Reconfigurable computers containing FPGAs have shown tens to
hundreds of times speedup for many computationally intensive algorithms, such
as linear algebra [1][2][7][14], random number generators [15], and molecular dynamics [16]. Hence, mixed-precision algorithms and architectures for
reconfigurable computers show promise to assist in LU decomposition and linear
direct solvers.
Due to the intensive floating
point computational loads and high communication to computation ratio of linear
algebra applications, especially sparse matrix operations, traditional
computers usually cannot achieve good performance. Utilizing FPGAs for high
performance linear algebra computation has been explored with promising
speedups achieved over CPUs. Previous work reveal that the performance and
resource cost of linear algebra on FPGAs depend on the size, frequency, and
pipeline latency of floating point IP cores [7][14].
Matrix factorization is widely
used to solve linear equations, and LU decomposition is the most commonly used
method for matrix factorization. Significant previous work addresses hardware
architectures for this important computational kernel [14][17][18][19][20]. For example, Govindu developed a circular linear
array architecture on FPGAs and achieved a 10% - 60% reduction in energy over
that of a traditional CPU. Zhuo et al improved this design by increasing
parallelism through more PEs and achieved higher GFLOPS performance than a 2.2
GHz AMD Opteron processor by using a Virtex-II Pro FPGA [14]. All these previous designs assume that target
matrices are positive definite and no pivoting is required. Although pivoting
complicates control logic, it increases the numeric stability of LU
decomposition. Therefore, our hardware architecture supports pivoting.
Lower-precision designs can
achieve higher performance by accommodating more floating point operators,
increasing clock frequency, and shortening the pipeline latency. Memory
bandwidth is another key factor affecting the performance of FPGA accelerators
for linear algebra. For instance, the actual performance of sparse matrix
computation [7] is usually limited by the memory bandwidth. Using lower-precision
arithmetic directly enables higher performance for these designs due to reduced
memory bandwidth needs. However, the solutions of linear solvers usually must
meet certain accuracy constraints. We use lower-precision data formats
internally (on the FPGA) for high performance and provide high precision
externally (on the CPU). Our linear solver provides solutions as accurate as if
only high-precision data formats are used but at a much faster speed.
Iterative refinement algorithms were
introduced in the 1960s [1]. While traditional algorithms use a
specific data format, we use mixed-precision data formats to combine the high
performance of lower-precision formats with the accuracy of higher-precision
data formats. Buttari et al investigated single/double mixed-precision
algorithms for linear solvers [5]. Their results show that mixed-precision algorithms
can exploit single precision floating point performance and also achieve the
double precision accuracy on several architectures, such as traditional CPUs
and cell processors [5]. Strzodka et al discussed using mixed-precision
algorithms for iterative solvers [4]. Their approach separated the loops of iterative
solvers into two parts: lower precision inner loops and higher-precision outer
loops. The results from inner loops are used as preconditions for the outer
loops. They used software to emulate the possible inner and outer loops
required by mixed-precision algorithms and compared the results to double
precision algorithms. This previous research demonstrates the possibility of using
mixed-precision algorithms but no previous configurable architectures have been
built.
The primary contribution of this work is exploring the performance of
mixed-precision direct solvers on FPGAs. This is in contrast to [4], which used different data formats for iterative
solver loops, required numerous refinement loops, and resulted in high execution
time for many test cases. Suppose matrix A can be factored as PA=LU with partial pivoting, where L is a lower
triangular matrix, U is an upper triangular matrix, and P
is
a permutation matrix used for pivoting. The direct solver with iterative
refinement [5] is shown in Figure
2, where the first two steps are very similar to
traditional direct solver algorithms and the refinement loops are taken to
improve the accuracy based on the available solution. The iterative refinement
process is similar to
|
Figure 2: Iterative refinement
direct solver
algorithm |
Table 1: Refinement iterations for different precision
|
The key idea here is that the factorization PA=LU and the
triangular solver LUx=Pb are computed in lower precision; while the
residual and solution update are computed in higher precision. This algorithm produces a solution correct to the higher-precision, provided matrix A is not
too ill–conditioned
[5]. The behavior of the mixed-precision method depends
strongly on the accuracy with which the residual is computed [1]. The potential performance gain of using this
algorithm comes from faster factorization computation, which is O(n3)
and dominates the runtime of the algorithm. The other steps, including the
triangular solver, residual computation, and solution update, are O(n2).
Furthermore, shorter data formats usually reduce the memory bandwidth requirement.
We realized a unique LU decomposition and mixed-precision direct solver
on a reconfigurable supercomputer. The Cray XD1 supercomputer incorporates
reconfigurable computing accelerators to deliver significant speedup of
targeted applications. The basic architectural unit of the Cray XD1 system is
the chassis, which can contain one to six compute blades. Each compute blade
includes two 64-bit AMD Opteron processors configured as a two-way symmetric
multiprocessor (SMP) that runs Linux. Each compute processor can be assigned 1
to 8 GB DDR. FPGAs can be included as coprocessors by adding an expansion
module on the compute blade. As shown in Figure 3, the processors, FPGAs, and memory are linked within
a chassis and between chasses by a high-speed fabric called the RapidArray
interconnect. Besides the main memory, each FPGA module contains four QDR II
SRAMs for high-speed storage. The Cray XD1 machine at ORNL (Tiger) has 12
chasses containing 144 Opteron processors and 6 Xilinx XC2VP50-7 FPGAs.

Figure 3: FPGA organization on Cray XD1 supercomputer [10]
The mixed-precision algorithm needs floating point components with
different data formats contained in a single system. This can be realized by
implementing different precision floating point IP cores on a single FPGA chip.
An alternative approach is to have different precision arithmetic on different FPGAs, or on both
FPGAs and CPUs. The former approach requires the FPGAs have enough capacity to
accommodate parallel computational engines for multiple data formats. An issue
with the second approach is that the data movement within FPGAs, or between
FPGAs and CPUs, should not be too frequent to affect the performance.
Considering the limited capacity of current FPGAs, our work uses the latter
approach. Figure 4 shows our innovative mixed-precision hybrid direct
solver on the Cray-XD1 supercomputer. The lower-precision LU decomposer is
mapped to the FPGA for high performance while the forward/backward solver and
iterative refinements are mapped to the Opteron processor in higher-precision. The
latter could also be implemented on FPGAs, but complicated control logic is
required to achieve fine grained parallelism. Furthermore, the division
operation of each iteration cannot be easily parallelized and requires an
expensive floating point divider. On the other hand, the computational
complexity for the operations on the CPU is only O(n2), while
that for the LU decomposition on the FPGA is O(n3). For an n
by n matrix, the computational load for LU decomposition is n
times of that for forward and backward solvers.

Figure 4: Hybrid mixed-precision direct solver on a
reconfigurable
supercomputer (Cray XD1)
To
the best of our knowledge, our work is the first to design pivoting circuits
for LU decomposition on FPGAs [1]. As shown
in the left half of Figure
4, we build an LU decomposer with multiple parallel processing
elements (PEs). The complete LU design is deeply pipelined with minimal
overhead [1]. We improve the performance of the direct solver by
accelerating LU decomposition on
FPGAs. The maximum number of PEs and their local memory size are limited by the
available resources of the
targeted FPGA. The frequency shown
in Table 2 is the frequency for our design
running on the ORNL Cray XD1. Because
more PEs can be implemented for lower-precision designs than for
higher-precision designs on a FPGA, lower-precision designs can achieve higher
performance.
Table
2: LU Implementation on Xilinx XC2VP50-7
FPGA
|
Design |
Double (s52e11) |
s31e8 |
s16e7 |
|
PE number |
8 |
16 |
32 |
|
Max size |
128 |
128 |
256 |
|
Achievable
Frequency |
120MHz |
130MHz |
140MHz |
|
Slices |
21044 (89%) |
20356 (86%) |
20907 (88%) |
|
BRAMs |
84 (36%) |
84 (36%) |
130 (56%) |