This is the README file for the NEMO application benchmark, distributed with the DEISA Benchmark Suite: http://www.deisa.eu/science/benchmarking/ Last modified by the DEISA Benchmark Team on 2008-08-25. ----------- NEMO readme ----------- Contents -------- 1. General description 2. Code structure 3. Parallelization 4. Building the code 5. Running the code 6. Data 1. General description ====================== NEMO (Nucleus for European Modelling of the Ocean) is a state-of-the-art modeling framework for oceanographic research and operational oceanography. The core of the system is a primitive equation model applied to both regional and global ocean circulation. It is intended to be a flexible tool for studying the ocean, the sea-ice and its interactions with the others components of the earth climate system (atmosphere, land-surfaces,vegetation, ...). NEMO is entirely written in fortran 90 and is parallelized with a domain decomposition using MPI library. All outputs are done with NetCDF library (release 3.6.1). This benchmark has been built to test the performances of the OPA OGCM component (released 9.0) on various HPC platforms. General information about the physic of OPA can be found on the web http://www.lodyc.jussieu.fr/opa or contact the OPA System Team (ESOPA) through opatlod@locean-ipsl.upmc.fr 2. Code structure ================= NEMO basic building blocks are the following: a) Read input file (namelist file). b) Initialisation of data and 2D domain decomposition along the two horizontal directions on a processor grid. c) Main loop of "nitend" time steps (nitend parameter is set in namelist input file and depends on timesteps parameter set in bench-ARCH.xml): 1) At each time step, uses the Red-Black Successive Over-Relaxation algorithm (SOR) to solve elliptic equations in NEMO, 2) Every "nwrite" time steps, it writes output files (nwrite parameter is set in namelist file and depends on writetimesteps parameter set in bench-ARCH.xml): - if nwrite>nitend or nwrite=nitend, each MPI task writes four output files one time at the end of the run. For DEISA benchmark, nwrite=nitend for runs labeled "without IO". - if nwritenitend or nstock=nitend, each MPI task writes one restart file one time at the end of the run. For DEISA benchmark, we always set nstock=nitend. - if nstock where NEWARCH is the same as in the directory ~/platform. Set values for compilers (names,default flags) and all needed librairies paths. 3) For each application, create a new top level xml file for this new architecture most often this is done by copying one of the file already available, example: cp bench-jump.xml bench-NEWARCH.xml edit bench-NEWARCH.xml and correct values accordingly to NEWARCH. Change the values for $threadspertask, $taskspernode, $nodes accordingly to the characteristics of NEWARCH 4) Edit compile.xml Create a new section where NEWARCH is the same as in the directory DEISA_BENCH/platform. Set values in the new compile section with those proper for the new architecture. Particular attention goes to the values: ARFLAGS, FFLAGS, F90FLAGS, CFLAGS, CXXFLAGS, LDFLAGS 5) Run the compile step with JuBE: edit bench-NEWARCH.xml and be sure that you have set something like this: This will build 3 binaries of 1/4 degree ORCA model (confcoef="50") for 3 runs on 64, 128 and 256 MPI tasks (threadspertask="1", taskspernode="4" and nodes="16,32,64"). If the compile process fail, go to the directory where JuBE has run the compile command (tmp/.../src) and try to run manually the command gmake. Analyze the error and try to fix it by modifying the file Makefile.defs. Once you have a working Makefile.defs files, report back the correct configuration values in the compile.xml file. 5. Running the code =================== For each benchmark (e.g. number of processors and data set), JuBE generates automatically binary(ies), job script and the directory where the job is run and where NEMO will write output files. The file execute.xml describes how the job script is set-up, for different architectures (see section ). Inputs for different benchmark cases are taken from the directory "input", as described in the prepare.xml file. To select a given benchmark case edit the file bench-ARCH.xml and set active="1" in the benchmark tag you are interested in (set active="0" for the others). To run the benchmarks within JuBE then simply execute: ../../bench/jube bench-ARCH.xml To run the benchmarks manually: - Create a run directory at your convenience into a filesystem seen by all compute nodes. - Copy into this run directory the binary and the input file namelist (check the value of the parameters nitend, nwrite and nstock which control your run). - Create a job script suitable for your queuing system containing the following command. For example, to run NEMO with 1/4 degree configuration on 64 processors: ./nemo.50.64 > nemo.out where is the program used to load the executable on remote node, for example "mpirun -n 64". Remarks: - timesteps parameter (in bench-ARCH.xml) set the length of the run (it's the number of timesteps of the simulation). Actually the only possible value is 1500 because there is only output file according to this value to valide runs. - restarttimesteps parameter (in bench-ARCH.xml) set the frequency at which the restart files are written. We always set restarttimesteps=timesteps to write restart files only one time at the end of run. - writetimesteps parameter (in bench-ARCH.xml) set the frequency at which the output files are written. - For runs with maximum IO, we set writetimesteps=60. Thus, files are written 1500/60=25 times during the run (each 60 timesteps which means every 5 days of simulation). - For runs with minimum IO, we set writetimesteps=timesteps to write output files only one time at the end of run. Warning: - A deadlock in MPI communications can appear when you run big configurations of the model (confcoef=25, 50 and 150). In this case, you have no error message and your job finishes on walltime limit. By default, NEMO uses standard MPI protocol and thus MPI_Send becomes blocking when message is big enough. To use another protocol, you have just to change the value of c_mpi_send variable defined in input file (in input/namelist.in). You can select: c_mpi_send = 'S' to use Standard blocking send (default), c_mpi_send = 'B' to use Buffered blocking send, c_mpi_send = 'I' to use Immediate non-blocking send. A good choice is immediate non-blocking sends. 6. Output data ============== 6.1 Timings: ------------ Timing information can be found: - on standard output (or output of the batch job) issued from time command, - in nemo.out file (timing per MPI tasks) issued by the binary. 6.2 validate the run: --------------------- To validate runs, you have just to check the solver.stat file issued by the run (this is the output of SOR solver). Here is an extract of the file solver.stat: it : 1 niter : 104 res : 0.9178231072E-05 b : 0.9345697597E-01 it : 2 niter : 112 res : 0.8303733661E-05 b : 0.2356295440E+00 it : 3 niter : 319 res : 0.9851944414E-05 b : 0.7258561144E+02 it : 4 niter : 312 res : 0.9921774191E-05 b : 0.6812613629E+01 For each time step (it), you get the corresponding number of iterations of the solver (niter), the residue (res) and the norm of the second member (b). You must compare your values of solver.stat to the corresponding reference values given in file solver_coef_...stat in the reference directory. To be sure that 2 runs are identical (run on N processors versus run on M processors), the 2 associated solver.stat files must have no differences, they should be strictly the same bit to bit. But this is true only for two runs of the same model configuration and on the same architecture with the same fixed nitend value: - If you run on two different architectures, some numerical differences may appear for the values of res and b (but not for it and niter). Only variations of res are really significant and should be less than 1.0E-08. The value of niter can be slightly different (+/-2) - If you change model configuration the two files will be completely different. 6.3 Other files: ---------------- Except for solver.stat file that is text file, all others output files are NetCDF files. Outputs are distributed among all MPI tasks. Each MPI task writes at least one restart file and 4 output files (named GYRE*.nc) at the end of the run. For runs with IO, the difference is that the 4 output files are written several times during the run. The global amount of I/O depends on: - the chosen configuration. Higher the configuration is, higher the global amount of IO is. - the frequency of writing output files (nwrite parameter in namelist) for runs with IO. Higher the frequency is, higher the amount of IO is. - the number of time step (nitend parameter in namelist) to perform during the simulation. Because each NetCDF file contains a header part, the global I/O volume increases slightly with the number of processor (there are more files). Here, are the total amount of IO for different configurations of NEMO: - 1/2 degree GYRE configuration: For runs with I/O minimum: nitend=nstock=nwrite=1500 For runs with I/O maximum: nitend=nstock=1500 and nwrite=60 (25 outputs) Number of | I/O volume (Go) | Process | GYRE* | restart | Total | ---------------------------------------- min I/O 64 | 1.65 | 1.16 | 2.81 | min I/O 128 | 1.66 | 1.16 | 2.82 | min I/O 256 | 1.69 | 1.16 | 2.85 | min I/O 512 | 1.73 | 1.16 | 2.89 | ---------------------------------------- 25 I/O 64 | 11.90 | 1.16 | 13.06 | 25 I/O 128 | 12.44 | 1.16 | 13.60 | 25 I/O 256 | 13.00 | 1.16 | 14.16 | 25 I/O 512 | 13.58 | 1.16 | 14.74 | ---------------------------------------- - 1/4 degree GYRE configuration: For runs with I/O minimum : nitend=nstock=nwrite=1800 For runs with I/O maximum : nitend=nstock=1800 and nwrite=60 (30 outputs) or nwrite=20 (90 outputs) Number of | I/O volume (Go) | Process | GYRE* | restart | Total | -------------------------------------- min I/O 64 | 1.74 | 4.58 | 6.32 | min I/O 128 | 1.78 | 4.58 | 6.36 | min I/O 256 | 1.82 | 4.58 | 6.40 | min I/O 512 | 1.86 | 4.58 | 6.44 | min I/O 1024 | 1.96 | 4.58 | 6.54 | min I/O 2048 | 2.09 | 4.58 | 6.67 | ---------------------------------------- 30 I/O 64 | 50.49 | 4.58 | 55.07 | 30 I/O 128 | 51.68 | 4.58 | 56.26 | 30 I/O 256 | 52.90 | 4.58 | 57.48 | 30 I/O 512 | 54.00 | 4.58 | 58.58 | 30 I/O 1024 | 56.51 | 4.58 | 61.09 | 30 I/O 2048 | 59.99 | 4.58 | 64.57 | ---------------------------------------- 90 I/O 128 | 155.00 | 4.58 | 159.58 | 90 I/O 256 | 158.60 | 4.58 | 163.18 | ---------------------------------------- - 1/12 degree GYRE configuration: For runs with I/O minimum : nitend=nstock=nwrite=600 For runs with I/O maximum : nitend=nstock=600 and nwrite=60 (10 outputs) Number of | I/O volume (Go) | Process | GYRE* | restart | Total | ---------------------------------------- min I/O 1024 | 15.99 | 41.09 | 57.08 | min I/O 2048 | 16.34 | 41.09 | 57.43 | ---------------------------------------- 10 I/O 1024 | 155.50 | 41.09 | 196.59 | 10 I/O 2048 | 158.77 | 41.09 | 199.86 | ---------------------------------------- -------------------------------------------------