Chapter 7
Results and conclusions

To get an idea of how TC compares with PM, I have done a number of tests on both non-uniform sets and uniform sets, the latter consisting of either random generated sets or just monotonous grid patterns. I also ran a number of these tests with PP, but as was to be expected it wasn't possible to run the larger sets within a reasonable time, so I skipped PP for the ones with more than about 2000 particles. Note that in all tests from this chapter only 2D forces were used, to ensure that the results of TC and PP could be compared with the PM output.

Some remarks about the time measurements in this chapter : the runtime listed in all results is the so-called ``user-time'' that was spent when computing on the processor of our HP 755/125 systems. It's value was obtained using the clock() function, and as such it doesn't show the ``real-time'' or ``system-time'' spent on the program. However, the ratio is about the same for all different sorts of time and that's what we are most interested in.

But unfortunately I found that the values on HP were complete nonsense for the larger sets (taking more than about 1 hour), although on a Linux PC the same values were fine. After a long time of searching I found that the resolution on the HP is so high ( 10⁶ ), that the available space for the clock value (4 bytes = 2³² ) only allows runtimes of less than [(2³²)/(10⁶)] = 4295 seconds to be measured. As this can render many hours or even days of computation to be totally useless (and in fact it did so in my case !) I decided to give this some special attention: ``forewarned is forearmed'' :-)

1. Results

First let's take a look at some non-uniform sets (only runtime and energy are shown, the error in the momentum is omitted to avoid giving too much rather meaningless information). Also the results for the 3 non-uniform sets from the previous section can be found in here (see table ).


Non-uniform		PP	TC	PM		PP	TC	PM
Particles	Steps	Time per 10000 steps			Meshsize	Energy leak (%)

4	8000	0.19	1.05	85.0	32x32	27.02 %	27.02 %	-9.540 %
16	2500	2.72	7.20	85.1	32x32	0.198 %	0.199 %	0.188 %
50	10000	26.36	34.53	86.3	32x32	12.20 %	12.26 %	-3.435 %
300 (cluster)	400	955.0	313.3	95.5	32x32	1.319 %	1.319 %	1.025 %
500	50	2700.0	748	106	32x32	0.222 %	0.222 %	0.160 %
2000	30	43880	3907	467	64x64	0.0009 %	0.0009 %	0.0009 %

Table 1: Comparison between the 3 methods for several non-uniform sets.

The run for 4 particles couldn't be completed with PM as one particle moved outside the mesh. The error in the energy is now for only 2624 iterations, so about one third of the other 2.

Note that PM always uses a meshsize of at least 32x32, which results in runtimes of about the same order for all problems except for 2000 particles, where the meshsize was increased to 64x64.

There is another important remark to be made : in all cases we can say that - as long as the number of iterations and the timestep are identical - PP must necessarily have the largest accuracy. If there are large errors in the PP results they are almost always caused by collisions and close encounters.

Figure 1: Output for the 4 body problem from chapter 1 if run with PP (left) and with PM (right). In both cases a 2D force is used to make them comparable.

Well, TC is as vulnerable for close encounters as PP and so should have more or less the same error as PP, as long as the error made by the tree approximation is small compared with this error. So, the closer the error to the error made in PP, the more likely it is that the TC approximation is indeed a good one !

PM does not suffer from collisions and so also doesn't have the problem with energy leaks in close encounters. Energy (and also momentum) are often reasonably well preserved. However, all details caused by the close encounters are completely washed out or blurred too. I.e. the 4 body example from the introduction (with the sun, the planet and the moon: see picture ) runs fine with both TC and PP, but shows something that only vaguely resembles these pictures if run with PM (see picture 1).

Now let's take some uniform sets and do the same comparison. Table shows the results for the smaller meshes (where PP was still practically possible).


Meshes		PP	TC	PM		PP	TC	PM
Particles	Steps	Time per 10000 steps			Meshsize	Energy leak (%)

16	10000	2.68	6.50	85.5	32x32	1.094 %	1.093 %	0.956 %
64	2500	43.16	44.04	87.0	32x32	1.460 %	1.460 %	1.182 %
256	1000	694.1	273.5	93.7	32x32	4.702 %	4.699 %	3.564 %
500	100	2682	632.0	104.0	32x32	0.009 %	0.009 %	0.007 %
1024	100	11314	1543	123.0	32x32	5.590 %	5.589 %	4.078 %

Table 2: Comparison between the 3 methods for several smaller uniform sets (meshes).

If we put all results from the tables 1 and 2 in a picture we could compare the efficiency for PP, PM and TC. We see in figure that indeed the computation time for PP increases much faster than for PM and TC. It is also nice to see that TC computation time increases smoothly with the number of particles, while PM stays constant for a while, but suddenly jumps if the meshsize has to be increased. The latter stems from the fact that the FFT routines only accept meshsizes that are a power of 2.

Figure 2: Comparison of the PP, TC and PM runtimes for the problems from tables 7.1 and 7.2.

I also ran some larger uniform problems - both meshes and random sets - on only PM and TC, to see what their relative accuracy is and what the scaling of the computation time looks like. These larger problems - all generated by the 2 helper programs as mentioned earlier - are the same ones as already discussed in the Auto-Correlation section of chapter .

A special problem arose with the computation of the energy error : the potential energy computation is unfortunately an order O(N²) operation. This implies that for the large problems the bulk of the time would be spent in the energy error computation and not in the simulation part ! To give an idea : the total runtime for 131072 particles while doing the energy error computation was about 10 hours for PM and about 10.5 hours for TC, so this computation of the potential energy made the difference between PM and TC almost completely being washed out if only the total running time is being used.

For real large problems it may be an idea to use a Poisson equation and an FFT to compute an approximation of the potential energy, just as was done in the previous chapter (Analysis).

Particles ( P )	Steps ( S )	Runtime	Energy leak	Mom. leak X	Mom. leak Y
Random sets and Tree-Code ( Q = 1 ) :

256	1000	28.32	5.1247 %	3905	-5891
1024	200	31.02	0.7801 %	3373	-2504
4096	100	102.50	0.0397 %	96	-61
16384	50	398.86	0.0067 %	19	11
65536	10	629.77	0.0008 %	31	34
131072	5	943.93	0.0002 %	0.15	46
Monotonous sets and Tree-Code ( Q = 1 ) :

256	1000	27.66	4.9800 %	-2750	3103
1024	200	30.52	0.7547 %	-4621	7284
4096	100	86.53	0.0739 %	-177	328
16384	50	381.33	0.0717 %	-663	1165
65536	10	681.62	0.0085 %	-1073	1928
131072	5	812.01	0.0002 %	-20	45

Table 3: Results for generated sets and Tree-Code.

The output for Tree-Code and the generated sets can be found in table 3, while the same results for Particle-Mesh can be found in . For reasons explained before the runtimes given in these tables are the runtimes without the initialisation and energy error computation.

Particles ( P )	Steps ( S )	Meshsize	Runtime	Energy leak	Mom. leak X	Mom. leak Y
Random sets and Particle-Mesh :

256	1000	32x32	9.43	3.6068 %	36918	39992
1024	200	32x32	2.44	0.5712 %	10176	28111
4096	100	64x64	5.69	0.0322 %	-66.23	-168.8
16384	50	128x128	15.35	0.0048 %	-8.52	-9.75
65536	10	256x256	16.88	0.0067 %	-15.97	-8.83
131072	5	512x512	34.49	0.0002 %	0.611	-3.48
Monotonous sets and Particle-Mesh :

256	1000	32x32	9.38	3.7670 %	-435.6	-484.1
1024	200	32x32	2.45	0.5713 %	-310.3	-251.0
4096	100	64x64	5.65	0.0559 %	-1.135	-0.928
16384	50	128x128	15.07	0.0543 %	-4.257	-3.532
65536	10	256x256	15.46	0.0070 %	-0.913	-0.755
131072	5	512x512	30.12	0.0002 %	-1.620	-1.692

Table 4: Results for generated sets and Particle-Mesh.

To give an idea about the complexity of these problems, the number of computations for O(N²) and O(N·logN) operations for all the used combinations of particles and iteration steps is given in table .

Particles ( P )	Steps ( S )	P²·S	P·logP·S

256	1000	6.5·10⁷	2.0·10⁶
1024	200	2.1·10⁸	2.0·10⁶
4096	100	1.7·10⁹	4.9·10⁶
16384	50	1.3·10¹⁰	1.1·10⁷
65536	10	4.3·10¹⁰	1.0·10⁷
131072	5	8.6·10¹⁰	1.1·10⁷

Table 5: Total number of computations for both O(N²)\protect and O(NlogN)\protect for the generated problems.

As should be clear when comparing the numbers from table 5 with the results from the tables 3 and 4, the runtimes increase a little faster than should be expected if we consider that both methods are supposed to scale with O(NlogN) . There are however several explanations for this phenomenon :

Most modern computers have one or more levels of fast cache memory between the processor and the relatively slow main memory. This implies that the performance can drop dramatically as soon as the size of the problem becomes larger than the cache size on one or more levels.
I checked this sometime ago and found that increasing a problemsize with only 10 %, in worst case situations sometimes results in a runtime that could be more than 10 (!!) times as long as before, only because of these cache effects. As the difference in our case is on average only about a factor 2 to 3, it is reasonable to assume this was indeed caused by caching.
The number of computations for PM does not directly depend on the number of particles, but on the number of gridcells. Going from 256x256 to 512x512 (as with the step from 65536 to 131072 particles) does not double the number of computations, but of course increases them with a factor 4. So the runtime is in this case almost exactly what we expect for a O(NlogN) behaviour if we take N as the number of cells instead of the number of particles.
The recursion with Tree-Code makes heavily use of the stack, sometimes leading to a lot of extra time spent in pushing and pulling return addresses to and from the stack.
Furthermore the coordinates are currently only sorted on their X-coordinates and not on both, so in reality the order for the Tree-Code is currently something like O(N[ÖN)] , provided of course that a reasonable uniformity exists.
On the other hand: the weighting factor for the particle count is only small as the time needed for one operation (a particle count in the sortlist) is short compared with the much heavier other operations like the acceleration computation.

2. Conclusions

In the introduction I have listed some of the targets for this research project. I think I can say that most of the targets have been reached. However, although I did work already 2 years ago on the parallelization of the PP method described in this report, and although I have already done some work for the parallelization of the TC and PM programs as well, I didn't find the time to complete this unless I would be satisfied with only doing half work, resulting in i.e. a very poor speedup.

For this reason and also because this report is already quite a bit longer than I intended it to be at the beginning, I decided to omit the parallelization part completely and put the results in a separate report, to be submitted probably next year.

Ok, now it is the time to summarize some of the conclusions resulting from this report :

For large problems Particle-Mesh is certainly the fastest method, but it blurs many of the details. It behaves very bad for relatively small and for non-uniform problems as well as for problems where many collisions are likely to happen. This is why it should be only recommended for more or less uniform, collisionless systems.
The Tree-Code method described in this report is quite accurate, provided the discriminating factor Q is large enough and it is also reasonably fast even for these larger values of Q. Another advantage is that it allows for the proposed collision management methods described earlier in this report.
The ``sort and count'' technique to build a tree for the Tree-Code problems as described in this report - although discovered more or less accidently through a misunderstanding made by me - proves to be indeed a practically useful method, provided optimal sorting techniques are being used. However, it still needs to be compared with a Tree-Code algorithm according to the standard Barnes-Hut method to see if it really can be faster.
Within uniform distributions, forces that act only over a short distance in a 2 dimensional simulation may turn into long-distance forces if the same problem is transferred to 3 dimensions. This has to be kept in mind if - as was done in this report - only 2 dimensional problems for gravity and electric forces are being studied just because they are easier to visualize.
As shown in chapter a rough distinction between sets suited for TC, PM or PP can indeed be made by looking at the auto-correlation for the density distribution, together with the ratio between the average displacement over the intended simulation interval and the average minimum interparticle distance.
At the risk of reopening an already open door : The ``leapfrog'' integration method is the most simple integration method that guarantees a reasonable accuracy. `\"Runge-Kutta'', although quite accurate for only a few integration steps, is not recommended for simulations over a longer period due to it's numerical instability.
O(N·logN) behavior - although theoretically available in both TC and PM - is practically difficult to realize, except for relatively small problems. As soon as memory cache sizes are exceeded, on most of the modern computer architectures the runtime tends to grow slightly faster with increasing N than the theory suggests.

File translated from T_EX by T_TH, version 1.23.

Chapter 7 Results and conclusions

1. Results

2. Conclusions

Chapter 7
Results and conclusions