## Barnes-Hut: An Implementation and Study

### Calculation of Cell and Body Interactions

#### Note:

In comparison to other Barnes-Hut implementations in the literature,
our tree-walk suffers a major disadvantage: our force calculations
are not hand-coded in assembly, and we also do not use vector
units when they are available. Most implementors choose to do so.
Since this is the case, this report does not focus on directly comparing
interaction time or total execution time.

#### Implementation:

##### O(n^2):

As detailed in the section on Motivation,
we took an iterative approach to implementing the message passing
version of Barnes-Hut. We started with the naive *O(n^2)* method,
which is as follows:

fun StepSystem(N,Bodies[1..N]) =
for i in 1..N do
Initialize(Bodies[i]);
done
for i in 1..N and j in 1..N do
Interact(Bodies[i],Bodies[j])
done
for i in 1..N do
Update(Bodies[i])
done
end

The intent of the for statements is to show that each of the steps
are independent. It is important that all of the initialization finish
before the interactions, and that all of the interactions finish before
the updates, but within each part, the statements are independent.
Now the computational work in this algorithm grows with *n^2,*
which is obviously sub-optimal given that the Barnes-Hut algorithm
grows with *n log n,* however for a sufficiently small system,
the overhead of the *n log n* algorithm could easily dwarf the
speedup from doing fewer interactions. In fact as we will show later, in
our implementation, there is a substantial overhead for the tree-build
and tree-walking steps.

##### O(n log n)

Our later implementations use the preferred *O(n log n)*
method where interactions are pruned. Once the problem size reaches
reasonable values, the *O(n log n)* algorithm clearly is the
method of choice. We quickly the saw the import of tree-building
and how it affected the force calculation phase: problem size is limited
by the maximum tree size, not by the interaction calculations. (See
Results).

After analyzing our initial implementation, it became clear that
its speed was being limited by the lack of caching of nodes;
without using a software caching method to reduce communication bandwidth
and mask latency, the full cost of global reads were necessary:

a minimum of once for each insertion, plus...
a multiple of eight times when a leaf splits and becomes
an internal node.
Note that this multiple of eight is only bounded by either
the minimum perturbation constant or the floating point accuracy
if the partition is a space-partition and the nodes are at
nearly identical points.

Thus, during the second portion of our study, we added explicit
software caching of nodes.

Continue to
Caching

-----

maintained by <hodes@cs.berkeley.edu>