Barnes-Hut: An Implementation and Study

Calculation of Cell and Body Interactions

Note:

In comparison to other Barnes-Hut implementations in the literature, our tree-walk suffers a major disadvantage: our force calculations are not hand-coded in assembly, and we also do not use vector units when they are available. Most implementors choose to do so. Since this is the case, this report does not focus on directly comparing interaction time or total execution time.

Implementation:

O(n^2):

As detailed in the section on Motivation, we took an iterative approach to implementing the message passing version of Barnes-Hut. We started with the naive O(n^2) method, which is as follows:

fun StepSystem(N,Bodies[1..N]) =
  for i in 1..N do
    Initialize(Bodies[i]);
  done
  for i in 1..N and j in 1..N do
    Interact(Bodies[i],Bodies[j])
  done
  for i in 1..N do
    Update(Bodies[i])
  done
end

The intent of the for statements is to show that each of the steps are independent. It is important that all of the initialization finish before the interactions, and that all of the interactions finish before the updates, but within each part, the statements are independent. Now the computational work in this algorithm grows with n^2, which is obviously sub-optimal given that the Barnes-Hut algorithm grows with n log n, however for a sufficiently small system, the overhead of the n log n algorithm could easily dwarf the speedup from doing fewer interactions. In fact as we will show later, in our implementation, there is a substantial overhead for the tree-build and tree-walking steps.

O(n log n)

Our later implementations use the preferred O(n log n) method where interactions are pruned. Once the problem size reaches reasonable values, the O(n log n) algorithm clearly is the method of choice. We quickly the saw the import of tree-building and how it affected the force calculation phase: problem size is limited by the maximum tree size, not by the interaction calculations. (See Results).

After analyzing our initial implementation, it became clear that its speed was being limited by the lack of caching of nodes; without using a software caching method to reduce communication bandwidth and mask latency, the full cost of global reads were necessary:

a minimum of once for each insertion, plus...

a multiple of eight times when a leaf splits and becomes an internal node.

Note that this multiple of eight is only bounded by either the minimum perturbation constant or the floating point accuracy if the partition is a space-partition and the nodes are at nearly identical points.

Thus, during the second portion of our study, we added explicit software caching of nodes.

Continue to Caching

-----

maintained by <hodes@cs.berkeley.edu>