This is the README file for the PEPC application benchmark, distributed with the DEISA Benchmark Suite: http://www.deisa.eu/science/benchmarking/ Last modified by the DEISA Benchmark Team on 2010-04-15 ----------- PEPC readme ----------- Contents -------- 1. General description 2. Code structure 3. Parallelization 4. Building 5. Execution 6. Data 1. General description ====================== PEPC is a parallel tree-code for rapid computation of long-range (1/r) Coulomb forces for large ensembles of charged particles. The heart of the code is a Barnes-Hut style algorithm employing multipole expansions to accelerate the potential and force sums, leading to a computational effort O(NlogN) instead of the O(N^2) which would be incurred by direct summation. Parallelism is achieved via a `Hashed Oct Tree' scheme, which uses a space-filling curve to map the particle coordinates onto processors. The kernel (tree routines and force computation) is separated from the application `front-end' so that the code can be easily adapted to both electrostatic and gravitational problems. Currently this code family consists of: i) PEPC-B for modeling high-intensity laser and particle beam interactions with dense plasmas ii) PEPC-E, a stand-alone Coloumb-solver code with a transparent interface to the kernel. iii) PEPC-G = PEGS for studying star-disc encounters. The DEISA benchmark version uses the PEPC-E front-end. It is the first code to employ a mesh-free algorithm for modeling kinetic (non-fluid) phenomena such as non-local electron transport and particle acceleration in dense plasmas. In contrast to traditional mesh-based Particle-in-Cell codes, PEPC-E can operate in a fully collisional or strongly coupled regime, and can tackle open-boundary problems. Currently PEPC is being used to investigate ion acceleration in Petawatt laser-plasma interactions - issues relevant to the Fast Ignitor Laser-Fusion concept (HIPER, ELI). 2. Code structure ================== The code is divided into kernel routines and 'front-end' applications. The source directory is structured in the following way: src/ lpepcsrc/ Kernel tree routines comprising library liblpepc.a pepc-e/ Source code of PEPC-E code makefile Makefile for compiling library and code makefile.defs.jube Makefile definitions for compiler, flags etc. run.h Sample input file The plasma application front-end in PEPC-E has the following structure: pepce.f90 Main program openfiles.f90 Set up I/O files setup.f90 Set defaults, read input parameters (see 4.) and Allocate fields pepc_setup Call to kernel routine setup_treearrays.f90 in lpepcsrc/ - Allocate particle and tree arrays configure.f90 Initialise particle properties: positions, velocities, Charges, masses < Main loop in pepc.f90: pepc_fields Call to fields_p.f90 in lpepcsrc/ - Construct tree and compute fields on particles velocities Calculate velocities from accelerations push_x Update particle position energy_cons Calculate potential and kinetic energies write_particles Output of particle data /> pepc_cleanup closefiles tidy up I/O 3. Parallelization ================== A detailed description of each kernel routine in the lpepcsrc/ directory is beyond the scope of this document. However, a brief summary is provided here to help to identify potential performance bottlenecks. Routine Function ----------------------------------------------------------------------- tree_domains Construct keys from particle coordinates Sort keys (pbal_sort in tree_utils) Domain decomposition and load balancing tree_build Local tree construction (hash table) tree_branches Construct branch nodes tree_fill Fill in top level local tree nodes tree_properties Compute multipole moments tree_aswalk Construct interaction lists (tree traversal) sum_forces Compute forces and potential All of the above routines can be performed in parallel, and thus require a computational effort $O(N/P)$, give or take a slowly varying logarithmic factor. Typical single-timestep benchmarks are illustrated in the table below. Most of the time is spent in the tree-traversal and force-summation routines: the total overhead incurred by the tree construction, which includes the domains, build, branches, fill and properties routines above is around 3-5\%, although this figure excludes tree-nodes copied locally during the traversal. Routine 8 CPUs 16 CPUs 64 CPUs ------------------------------------------------ Domain decomposition 0.2 0.24 0.33 Tree building 2.3 2.3 2.7 Tree traversal 32.9 36.1 40.8 Force summation 64.4 61.2 55.7 The increasing fraction of time spent in the tree traversal reflects the rising communication overhead - exchange of multipole information - with number of CPUs. This fraction can also vary depending on the geometry: highly clustered systems will require deeper searches for interaction partners, and so longer traversals. 4. Building =========== The source code (fortran-90) resides in directories src/lpepcsrc and src/pepc-e. It is assumed that MPI is pre-installed locally. All source files are copied by JuBE to the compilation directory, in general located in PEPC/tmp/... . Log files for each of the following steps can be found in this directory, too. The compilation of PEPC will be done by JuBE. First, JuBE generates the appropriate makefile.defs from the file src/makefile.defs.jub. The basic compiler and linker flags are taken from the 'compile.xml' file in the PEPC directory and the 'platform.xml' file in the DEISA_BENCH/platform/ directory. Special optimization and linking flags can be specified in the top-level xml file of the corresponding platform, e.g bench-IBM-P6-vip.xml, using the 'opflags' and 'aldflags' variables, respectively. Finally, JuBE compiles the code by executing the 'make' command. 5. Execution ============ The type of PEPC run is specified in the top-level xml file in the 'params' section with the variable 'geom'. 'geom = 1' generates a homogeneous particle distribution, 'geom = 2' a single sphere of particles and 'geom = 3' two spheres. Furthermore, the following parameters can be specified: npart Number of particles nt Number of time steps wscheme Communication scheme (0: point-to-point, 1: collectives) fetchmult memory allocation parameter (machine dependent) npmult memory allocation parameter (machine dependent) idump Frequency of particle data output (=0 no output) writemode Format of particle data output (text,binary,sionlib) The 'idump' parameter determines the number of time steps after which particle data are written to disks. If set >0 this allows for I/O testing. For the DEISA benchmarks a uniform sphere of ions is set up (geom = 2) which is allowed to expand under its own electrostatic forces (Coulomb explosion) and the following parameters are used: geom 2 npart 5000000, 15000000, 25000000 nt 20 wscheme 1 fetchmult 3 npmult -40,-85,-100 idump 0 writemode text The parameter 'npmult' is chosen in dependence of the architecture the code runs on and on the number of particles used. In general, npmult=-40 is used. However, on larger shared memory systems like IBM Power6, npmult=-85 and/or npmult=-100 have to be used. JuBE prepares the input file 'run.h' using the input/run.h.jube template and copies it to the run directory. Then JuBE generates the jobscript and submits the job. 6. Output data ============== The output files are stored in the run directory, the stdout and stderr in the log/ directory. Important output files are: energy.dat Kinetic and potential energies expressed in keV PER PARTICLE (10 y-columns in ASCII format) Total energy should be conserved timing.dat Timing information trajectory.dat Particle positions run.out Printed diagnostics The 'trajectory.dat' file is compared with the reference data in corresponding file in reference/traj/ in order to verify the results. PEPC reports four timings in the stdout file: pepc timing - pre: Time for setup pepc timing - inner: Time for the main loop pepc timing - post: Time for post-main-loop work pepc timing - total: Total time used be PEPC For the DEISA benchmark the average time per time step for the main loop is reported performing simulations of 20 time steps. Since idump = 0 is used I/O is excluded from the measurement.