This is the README file for the GADGET application benchmark, distributed with the DEISA Benchmark Suite: http://www.deisa.eu/science/benchmarking/ Last modified by the DEISA Benchmark Team on 2008-08-25. ------------- GADGET readme ------------- Contents -------- 1. General description 2. Code structure 3. Parallelization 4. Building 5. Execution 6. Data 1. General description ====================== The simulation code GADGET2 evolves cosmological density fields, either only with dark matter, or with dark matter and gas. Any such calculation requires suitable initial conditions, which can be easily several dozen GBytes in size for large problems. Therefore, a parallel initial conditions code is provided as part of this benchmark as well. It is called N-GenIC. 2. Code structure ================= GADGET uses a hierarchical multipole expansion (organized in a “tree”) to calculate gravitational forces. In this method, particles are hierarchically grouped, multipole moments are calculated for each node, and then the force on each particle is obtained by approximating the exact force with a sum over multipoles. The list of multipoles to be used is obtained with a so-called tree- walk, in which the allowed force error can be tuned in a flexible way. A great strength of the tree algorithm is the near insensitivity of its performance to clustering of matter, and its ability to adapt to arbitrary geometries of the particle distribution. Periodic boundary conditions, if needed can be realized with Ewald summation, an approach followed by GADGET . While the high spatial accuracy of tree algorithms is ideal for the strongly clustered regime on small scales, there are actually faster methods to obtain the gravitational fields on large scales. In particular, the well-known particle-mesh (PM) approach based on Fourier techniques is probably the fastest method to calculate the gravitational field on a homogenous mesh. The obvious limitation of this method is however that the force resolution cannot be better than the size of one mesh cell, and the latter cannot be made small enough to resolve all the scales of interest in cosmological simulations. GADGET therefore offers a compromise between the two methods. The gravitational field on large scales is calculated with a particle-mesh(PM) algorithm, while the short-range forces are delivered by the tree. 3. Parallelization ================== Domain decomposition in GADGET plays a central role for parallelizing the calculation, balancing the work-load, and for making use of the total storage available in distributed memory computer partitions. To this end, GADGET uses a space-filling fractal, a Peano-Hilbert curve. The local “depth” of the curve is regulated by the local particle density and work-load, such that the decomposition becomes naturally finer in high-density region. The domains themselves are then generated by cutting the one-dimensional space-filling curve into Ncpu pieces that approximately induce the same computational work-load (as estimated by interaction counters for each particle), which automatically induces a decomposition of space. The advantages of the Peano-Hilbert curve are that it allows for domains of arbitrary shape, while at the same time it always generates a relatively small surface to volume ratio for the domains, and this compactness reduces communication overheads. Also, since the divisions generated by cuts of the Peano-Hilbert curve are commensurate with the oct-tree structure of the particle tree, this induces a natural decomposition of a global fiducial Barnes & Hut tree for the full particle set. As a result, a parallelization scheme is obtained where the final result of the tree calculation is strictly independent (up to numerical round-off) of the detailed decomposition used by the code and hence the number of processors employed. 4. Building =========== The GADGET-2 code, as well as the included initial conditions code N-GENIC, are ANSI-C codes which use the MPI-1.1 libarary for parallelization. Compilation of the source code should be free of problems on essentially all current systems. However, the codes also rely on two additional open source C-libaries. These are - FFTW, available at http://www.fftw.org (Note: Only the 2.X code line (most recent version is 2.1.5) can be used as the 3.X line does not support MPI yet.) For compiling FFTW, one needs to pass "--enable-mpi" to the configure-script. I also recommend to compile the code both in a double and in a single-precision version. This can be accomplished by building and installing the package twice, once with "--enable-type-prefix", and the second time with "--enable-type-prefix --enable-float". - GSL, availavle at http://www.gnu.org/software/gsl (e.g. version 1.6) GSL is included into the benchmark package and is compiled automatically by JuBE. 5. Execution ============ First the IC code must be compiled and executed by activating the 1st task (N-GenIC) in the top level xml-file of the corresponding platform. Please deactivate all following benchmark runs in this file. The command: perl ../../bench/jube -pdir ../../platform -start bench-PLATFORM_XXXX.xml will start compilation of GSL(takes about 30 minutes) and then N-GenIC. It will also submit a command file to run N-GenIC. The output of N-GenIC is distributed into several files, one for each processor. The number of processors used to generate the initial conditions is irrelevant for the results and for running the simulation. A larger processor number can be used. The parameterfile, that is used in this package, sets up the initial conditions for a cosmological hydrodynamical simulation, with 400^3 dark matter particles and 400^3 gas particles, i.e. 128 million particles in total. The file size of the initial conditions is about 3.57 GB in total. This is a moderate-sized problem by today's standards, but corresponds to a realistic scientific application of the code. The test simulation excercises both the TreePM gravity solver of Gadget and its SPH formalism to treat hydrodynamics. Ideally, the runtime of the code should be tested on a range of processors, covering at least 32 to 512 CPUs. Of course, for 512 CPUs, the problem size is already getting very small, and one would never use that many CPUs in practice. (Rather, it is more typical to operate the code in a weak scaling regime, where the problem size is increased along with the number of CPUs. It is then much easier to keep parallelization losses reasonably small.) When N-GenIC has finished, the directory run/data should contain the initial conditions files. Then the N-GenIC task in the top level xml file should be deactivated and the benchmark run can be activated. When entering the same command "perl ..." as above compilation of GADGET is started and command files are submitted to run the GADGET benchmarks. The simulation code has been set up for this test such that it will terminate itself automatically after 3 full steps have been completed. 6. Data ======= The execution times in differents parts of the code during these 3 steps represent the result of the bechmark. The code itself measures several internal performance metrics by timing certain parts of the code (using MPI_Wallclock). The output of the code (which includes these performance metrics) is written to a separate directory for each run. The pathname of this directory is composed of the output-path given in the parameterfile ("OutputDir" - which can be changed as appropriate), and the name of a subdirectory run_XXX, where XXX is the number of CPUs used ("run_128" for 128 CPUs, for example). This subdirectory is generated automatically, but "OutputDir" needs to exist. The code can hence be easily run with different numbers of processors, and the results are automatically put into separate output directories. The primary result of the benchmark run is contained in the file "cpu.txt" which is put into the output directory for each processor number tested. These files should be collected for a detailed evaluation. The file cpu.txt contains a detailed break-down of the wallclock times required for different parts of the code, as labelled. For each entry, the first number gives the cumulative time spent for this part in seconds, while the second number gives the relative contribution of this part to the total time in percent. There are three primary calculations in the code that consume the cycles. These are the short-range gravity done by a hierarchical tree-algorithm ("treegrav"), the long-range gravity with a FFT-based particle-mesh algorithm ("pmgrav"), and the smoothed particle hydrodynamics ("sph"). The domain decomposition also consumes a bit of time ("domain"), the other parts should be largely negligible. In principal it is interesting to check the relative break-up of these numbers, which for example includes information about communication times. But for the purpose of evaluating overall performance it suffices to focus on the total-time, which is labelled "total". However, simply taking the total time reported for step 2 can give a misleading answer since step 0 contains multiple start-up initialization procedures, which do not all parallelize as well as the actual evolution code, yet they are completely unimportant for the run time of any real application of the code (where we have > 1000 steps). To obtain an unbiased measurement for the time required per step, one should therefore take the total time reported for step 2, subtract from it the total time reported for step 0, and divide by two. This will give the average time required for the last two steps (both should also be very close individually). As an example, for the cpu.txt file that is reproduced below, the run time to report would be: Time_per_Step = 0.5 * (469.30 - 197.76) = 135.77 seconds In case of problems with this benchmark program, please direct your questions to: Volker Springel , MPI for Astrophysics ------------ Example cpu.txt file for 128 processor test run --------- Step 0, Time: 0.02, CPUs: 128 total 197.76 100.0% treegrav 61.88 31.3% treebuild 2.34 1.2% treeupdate 0.00 0.0% treewalk 56.74 28.7% treecomm 0.26 0.1% treeimbal 1.84 0.9% pmgrav 25.71 13.0% sph 63.86 32.3% density 37.66 19.0% denscomm 0.47 0.2% densimbal 2.22 1.1% hydrofrc 20.41 10.3% hydcomm 0.31 0.2% hydimbal 0.77 0.4% hmaxupdate 0.20 0.1% domain 11.84 6.0% potential 0.00 0.0% predict 0.28 0.1% kicks 1.41 0.7% i/o 8.40 4.2% peano 2.31 1.2% sfrcool 0.00 0.0% blackholes 0.00 0.0% fof/subfind 0.00 0.0% smoothing 0.00 0.0% misc 22.05 11.2% Step 1, Time: 0.0201534, CPUs: 128 total 329.44 100.0% treegrav 123.30 37.4% treebuild 3.73 1.1% treeupdate 0.00 0.0% treewalk 113.50 34.5% treecomm 0.52 0.2% treeimbal 4.23 1.3% pmgrav 49.51 15.0% sph 102.40 31.1% density 52.87 16.0% denscomm 0.54 0.2% densimbal 3.03 0.9% hydrofrc 40.93 12.4% hydcomm 0.41 0.1% hydimbal 1.74 0.5% hmaxupdate 0.41 0.1% domain 14.95 4.5% potential 0.00 0.0% predict 1.07 0.3% kicks 3.44 1.0% i/o 8.40 2.6% peano 3.98 1.2% sfrcool 0.00 0.0% blackholes 0.00 0.0% fof/subfind 0.00 0.0% smoothing 0.00 0.0% misc 22.39 6.8% Step 2, Time: 0.020308, CPUs: 128 total 469.30 100.0% treegrav 189.61 40.4% treebuild 5.15 1.1% treeupdate 0.00 0.0% treewalk 170.28 36.3% treecomm 1.19 0.3% treeimbal 11.01 2.3% pmgrav 73.39 15.6% sph 143.94 30.7% density 68.08 14.5% denscomm 0.62 0.1% densimbal 5.01 1.1% hydrofrc 61.47 13.1% hydcomm 0.59 0.1% hydimbal 4.40 0.9% hmaxupdate 0.62 0.1% domain 18.06 3.8% potential 0.00 0.0% predict 1.75 0.4% kicks 5.58 1.2% i/o 8.40 1.8% peano 5.63 1.2% sfrcool 0.00 0.0% blackholes 0.00 0.0% fof/subfind 0.00 0.0% smoothing 0.00 0.0% misc 22.93 4.9%