N-Body Simulation on Heterogeneous Cluster

1. The N-body algorithm.

The N-body simulation is to simulated the environment of many particles interact under gravitational force in the universe. Each particles has seven values representing its own status.

P_i = ( m x y z vx vy vz )

Which means that the particle, P_i, with mass 'm' at location (x, y, z) is traveling at velocity (vx, vy, vz). For 'N' number of particles, the index 'i' is from 0 to n-1. Gravitational force between P_i and all other particles in the universe will place force, and hence acceleration a_i = (ax, ay, az), on it.

Simulation take place in discrete time steps, dt. In each step, the velocity is used to update the position of the particle and the calculated acceleration is used to update the velocity. Thus all particles will have a new set of status after each simulation step.

The acceleration is computed using the following procedures, where EPS is a small constant to prevent strange force value when two particles are too close to each other. In practice, EPS = 0.

for (i = 0; i < N; i++) {
  for (j = 0; j < N; j++) {
    if (j != i) {

      rx = p[j].x - p[i].x;
      ry = p[j].y - p[i].y;
      rz = p[j].z - p[i].z;

      dd = rx*rx + ry*ry + rz*rz + EPS;
      d = 1 / sqrtf(dd * dd * dd);

      s = p[j].m * d;

      a[i].x += rx * s;
      a[i].y += ry * s;
      a[i].z += rz * s;
    }
  }
}

In this example, we chose the time duration between each time step is the unit time for velocity and acceleration computation. Thus the update of particle position and velocity is simply adding the previous velocity and current acceleration respectively.

All data including input, output and internal computation are in 32-bit IEEE-754 single precision format. To compute the acceleration for a particle, all position and mass information must be read. Thus 16 bytes (4 bytes x 4) of data are read for each iteration of the inner loop in the above algorithm. In the inner loop, 17 floating point operations are performed including 3 subtractions, 3 addition, 6 multiplications, 1 square root, 1 inverse and 3 accumulations. So the computation to data ratio is

Rcd = 17/16 = 1.0625 fopb (Floating point Operation Per Byte)

Both inner and outer loop need to iterate over all particles. Thus there are total N*(N-1)*17 operations in each simulation step.

2. Input and Output.

The original input source of this example is the Dubinksi 1995 data set from the N-body Data Archive web site.

http://bima.astro.umd.edu/nemo/archive/
http://bima.astro.umd.edu/nemo/archive/dubinski/dubinski.tab.gz

There are 81920 particle in the file in ASCII format. To simplify and speedup benchmark programs, the file is reformatted in to binary format. The binary data file contains a sequence of 32-bit floating point numbers where following the appearance order of the original Dubinski data file. The simulation data file of FPGA implementation contains hexadecimal numbers in ASCII format.

The output file is in the same format of the input file while containing the updated status of particles after the simulation.

3. Single thread CPU implementation.

The reference implementation is a single thread C programing targeting Intel x86 architecture. It has been compiled using GCC and tested on Linux platform. To run the reference design, type the following command:

./nbody ../dat/nbody_81920.dat 81920 0 1

The first command line argument is the path and file name of the input data. The second is the number of particles to be simulated. The third argument is the value of EPS and the last one is the number of steps to be run in the simulation.

The reference implementation displays the performance information. One example is shown below.

nbody : 81920 particles with EPS=0.000000 in 1 iterations
compu time = 99305125 usec
total time = 99512072 usec

4. Multi-threaded CPU implementation.

OpenMP is used as the framework to parallelize the computation on multi-core system. FOR-LOOP parallel #pragma is used to parallelize the main loop for computing the acceleration of all particles in the current time step. The particle data, marked as shared variable, are partitioned evenly according to the loop index. There is no data dependency between the acceleration computations of different particles. The results are writing to the corresponding location in the acceleration array, which is also marked as shared. No write conflict exist in this stage.

The update of position by current velocity and the update of velocity by computed acceleration are performed outside the OpenMP parallel construct.

In the AXEL cluster, we run the OpenMP version of N-body simulation on a dual-CPU platform with 4 cores on each CPU. The program is instructed to run in 4 parallel threads such that the side effects of system loading are minimized. The performance is measured in the same way as in the single thread reference design and shown below.

nbody : 81920 particles with EPS=0.000000 in 1 iterations
In 4 parallel threads.
compu time = 29058890 usec
total time = 29271313 usec

5. GPU implementation.

An implementation targeting many-core GPU is created for the N-body simulation. The computation of acceleration of all particles in the current time step is off loaded to the GPU while the CPU updates the position and velocity after that.

Information of the particles needed for acceleration computation are copied to GPU memory at the beginning of each simulation time step. These include the mass and position in the format of (m, x, y, z). The acceleration vectors (ax, ay, az) are read back from the GPU memory after the kernel finished.

In the kernel, we dedicate one thread per particle. Inside the thread, it loops through all other particles to generate the acceleration value and writes it to the GPU memory. The computation in the loop is the same from the source code of the CPU reference design. No special optimization is applied on the kernel and the data are stored in GPU main memory with the same layout as in CPU main memory.

The performance of this design is not as good as the highly optimized version of N-body simulation in the CUDA SDK. Here, we want to provide a fair comparison of the processing power by restricting the same type and amount of computations are performed by different processors.

nbody : 81920 particles with EPS=0.000000 in 1 iterations
compu time = 9259148 usec
total time = 9531584 usec

For the same set of input vectors, the results of the GPU program are different from those of the CPU counterpart. This is due to the implementation deviations of floating point operators in each platform.

6. FPGA implementation.

This implementation assume there are four independent memory banks connected to the FPGA each with 32-bit width data bus. The host program will put the values of m, x, y and z in separated memory banks so that these information can be load to the FPGA in the same memory access cycle. The host program also send auxiliary information to core including the number of particles in memory N, the starting and ending index of i in the outer loop.

A deeply pipelined data path is created for the acceleration computation. All floating point operations are performed by macros generated from Xilinx Core Generator. The latency of the pipeline is 132 clock cycles and it can operate up to 400MHz.

First, the (x, y, z) values of P_i are read from memory and stored in internal registers. Then the values of all particles are read as the P_j inputs. The P_j data are fed to the pipeline one per clock cycle when the data from memory read operation is valid. After all P_j are read, the internally stored values in the pipeline are drained. Extra clock cycles (56 in the example) are need to perform final reduction and output the internal values of the accumulator before starting the next iteration in the outer loop for a new P_i.

Further optimizations can be done by overlapping the draining and reduction phase of current iteration and the feeding phase of next iteration. But this will cause extra control logic resource and affect achievable maximum working frequency. We consider the current time overhead (192 clock cycles) is negligible since the number of particles, N, is sufficiently large. This time overhead is less than 0.1% when 81920 particles are processed as in the example.

The final acceleration vectors (ax, ay, az) will be stored in a FIFO where the host program can read through register interface of the FPGA (uint32_t fpgaReg[USER_REG]). The index of current P_i is also output such that the host program knows how many data should be read by comparing with its local index. The host program will update the position and velocity vectors after all acceleration vectors is read. A polling scheme with 3us interval is used for the minimizing both host CPU utilization and FPGA idle time due to FIFO full. The host-FPGA communication time will overlaps with the FPGA computation time thus no extra time overhead is introduced.

The theoretical performance of the N-body core without I/O constraints under 400MHz is

T_t = (81920 + 192) * 81920 * 2.5 ns = 16.82 s

For 333MHz DDRII-SDRAM in the target platform, 171MBps DMA write speed and 230MBps DMA read speed, the expected performance of the FPGA implementation is

T_e = (81920 + 192) * 81920 * 3 ns +
      2293760 bytes / 171MBps
    = 20.18 + 0.13 = 20.21 s

The measured performance is

./userapp ../dat/nbody_81920.dat 81920 0 1
nbody : 81920 particles with EPS=0.000000 in 1 iterations
compu time = 46627148 usec
total time = 48886544 usec

The resource utilization of FPGA is listed below:

Number of Slice Registers:                32,483 out of 207,360   15%
Number of Slice LUTs:                     27,674 out of 207,360   13%
Number of DSP48Es:                            18 out of     192    9%

The difference between expected and measured performance is due to the memory reading overhead. We stored all particle data in external DDR2 memory which has an effective 4 * 32-bit width interface. In the computing the expected performance, we assume that all data will be ready when needed. This is an uninterrupted continuing read operation on the external DDR2 memory with 4 32-bit words per clock cycle. Due to the nature of the DDR2 memory IC and also the implementation of the memory controller (mainly the command buffer size), this assumption is not valid in real world operation.

For the same set of input vectors, the results of the FPGA program are different from those of the CPU counterpart. Two factors contribute to this deviation and both are related to the pipelined accumulator. First, the sequential CPU program accumulates the acceleration vectors once their are computed from P_i and P_j. But in the FPGA design, these vectors are first accumulated into 12 different partial sums matching the pipeline stages in the floating point adder. These 12 partial sums are then sum together in the final reduction stage. The difference between accumulating order introduces the differences in final results. This factor is confirmed and analyzed by a special CPU version which recreate the 12 partial sums as in the FPGA version. The second factor is the stall stages created by the external memory interface. During the clock cycles when the data from external DDR2 memory are not yet ready, the N-Body pipeline should be stalled. Using a single shared enable signal to stall the pipeline is not realistic since the fan out will grow over 10,0000 and the timing constraints will never be met. A delay pipe and input selection of the accumulators are used instead to avoid a global enable signal. This scheme will changed the summing order of the acceleration vectors when invalid data are skipped in the accumulator input. Thus the final result will be different as previously discussed. This factor is non-deterministically affected by the state of the external DDR2 memory chip. It is analyzed by inserting invalid stages in RTL simulation.

Since we use only 3% of logic resource and 9% of the DSP blocks in the target FPGA for the pipelined data path. We can easily duplicate 10 copies of the pipelined data path in the FPGA core. This is effectively unrolling the outer loop by 10. With several negligible extra memory read operations for the more P_i values needed, the speed up is linear as all pipeline shared the same P_j values read from memory. Overheads also include the extra control logics for sequencing the P_i access and longer traces for connecting memory ports to multiple pipelined data path. Thus the maximum working frequency of the multi-core design will be lower than that of the single core counterpart.

At 266.67MHz, the measured performance of multi-core FPGA implementation is 17.7 times faster than the single thread CPU implementation.

./userapp ../dat/nbody_81920.dat 81920 0 1
nbody : 81920 particles with EPS=0.000000 in 1 iterations
compu time = 5621557 usec
total time = 8429580 usec

The total FPGA resource utilization is reported below:

Number of Slice Registers:               117,393 out of 207,360   56%
Number of Slice LUTs:                     92,219 out of 207,360   44%
Number of DSP48Es:                           180 out of     192   93%

7. Heterogeneous implementation.

For more information, see our paper at International Symposium on FPGAs.