This is the README file for the IFS application benchmark, distributed with the DEISA Benchmark Suite: http://www.deisa.eu/science/benchmarking/ Last modified by the DEISA Benchmark Team on 2009-05-06. -------------- The IFS readme -------------- The IFS Forecast Model Benchmark - Version IFS_35r2_Feb09 1 Introduction 2 Directory structure 3 Building and Executing 4 Input Data 5 Output Data 6 Analysis and Verification 7 Reference Output 8 Additional Information 1. Introduction =============== The ECMWF Integrated Forecast System (IFS) model benchmark (version IFS_35r2_Feb09) was incorporated into the DEISA benchmark suite in May 2009. The benchmark is based on the latest cycle 35r2 plus some technical enhancements due to be incorporated into the next cycle. The IFS sources are not available from the DEISA benchmark site. Please see the LICENCE file in this directory for information about how to obtain the IFS sources from ECMWF. Input data for a low resolution forecast at T159 is provided with the distribution from the DEISA benchmark site. This is suitable for testing that the build of the IFS has been successful. Further data sets at T799 and T1279 resolutions are available from the ECMWF website. These datasets are 700MB and 1.7GB in size respectively. T799 is the resolution of the current operational forecast at ECMWF. T1279 will be implemented in operations later this year on the new Power6 production system. Large portions of the IFS relating to 4DVAR Data Assimilation are not used by the forecast model benchmark and the relevant routines have been replaced by dummy routines. The IFS is a mixed mode MPI/OpenMP program which at high resolution scales well up to several thousand processor cores. It is written mainly in Fortran 90/95 with some Fortran 77 and C routines. Directory Structure =================== Here can be found a brief description of the files structure for this code. LICENCE file containing information about obtaining a licence agreement README this file bin/ contains a perl script that can be used to merge information obtained when using the inbuilt instrumentation routines. input/ folder containing 3 subdirectories RTABLES/ data tables used by the radation code local_definitions/ data tables used by the GRIB decoding routines T159/ input data used by the T159 model run job/ folder containing a file prepare_data_files.in used by JuBE to setup the input data for a run. reference/ folder containing reference output files run/ folder where the executable should be after the compilation. Also contains a perl script used by JuBE to check that a run has executed with correct results. src/ folder which will contain the file src.contents.tar.gz obtained from the ECMWF website 3 Building and Execution ======================== Background Information ---------------------- Before describing how to build the IFS with JuBE the underlying IFS build mechanisms are discussed. The IFS is a very large and complex application that has been substantially reduced in size for the DEISA benchmark suite from almost 3M lines of code to approx 700K. The application is split into different project areas each of which is built separately and for which a separate library is created. The project areas in the order that they are built are ifsaux trans surf algor ifs dummy ifsiodummy Directories for each project area will be found in the src directory after it has been unpacked by JuBE. Other directories present in src are build gribex lib log scripts ECMWF_logfiles build contains the master IFS Makefile along with project specific Makefiles such as Makefile.root.trans which has dependency rules for building trans. These files should not need any modification. Also present in build is a subdirectory called arch. Normally this contains a set of architecture specific files such as Makefile.in.ibm_power6, Makefile.in.opteron,..... For this benchmark it contains a template file Makefile.in used by JuBE to generate a Makefile.in.NEWARCH for the target architecture (NEWARCH). lib will contain after building the separate project libraries such as libifsaux.a. gribex is an auxiliary library maintained separately from the IFS which uses a different Makefile mechanism. log will contain after building the logfiles for each separate project compilation. scripts contains the shell and perl scripts used by the build process. It also contains a subdirectory called include which holds some additional configurations files. Normally for a new architecture it is necessary to manually edit the configuration files but this will be done by JuBE from information contained in the compile.xml file. ECMWF_logfiles contains a full set of logfiles from a build of the benchmark on the ECMWF IBM Power6 machine called C1A. Some routines in the IFS have a habit of causing compilers to run very slowly if they are compiled with optimisation. In particular these are routine names starting with the strings 'rrtm_kgb' or 'srtm_kgb' in ifs/phys_radi and 'sueca' or 'sueco' in ifs/phys_ec. To overcome this a wrapper script is used to drive the Fortran compiler. In this benchmark the wrapper script is generated from one of two template files, scripts/ibm_wrapper.in or scripts/wrapper.in for non IBM systems. The wrapper script selects the appropriate optimisation flags and then invokes the compiler with these flags and the rest of the flags from the command line. At ECMWF a separate library called EMOSLIB is linked with the IFS. It contains routines for manipulating meteorological data. A subset of EMOSLIB called gribex has been extracted from EMOSLIB for this benchmark but being a separate library it has a different build mechanism. Note that Fortran 95 routines contained in this release are coded using one of the following approaches - Free source format(*.F90) and use of KIND statements. For these routines no auto double is required. - Fixed source format(*.F) and no use of KIND statements. For these routines an auto double capability (e.g. -r8) is required. Initially when building for a new platform it is recommended that OpenMP is not used and that a safe compilation optimization level such as -O1 is tried. When the IFS is running correctly try turning on OpenMP support. Then try increasing the optimization level. It is essential when running the IFS that the execution results do not change when the number of MPI tasks or OpenMP threads are varied. For this reason it is normally necessary to avoid the highest levels of compiler optimization. On the IBM platforms the chosen options are -O3 -qstrict. Building with JuBE ------------------ To build the IFS with the JuBE tool on a new architectire (NEWARCH), do the following steps: 1) Create a new top-level XML file for the new architecture (bench-NEWARCH.xml). Use one of the already existing files as a starting point: bench-Cray-XT4-HECToR.xml, bench-IBM-P6-vip.xml, and bench-IBM-BGP-Babel.xml. Normally it is sufficient to change the values of nodes, taskspernode, threadspertask in the tasks section and in the params section change the value of nproc, which corresponds to the total number of MPI tasks. Chose a small number of nodes/tasks on which to run T159. On a vector based platform it will also be necessary to change scalar="true" to scalar="false". The purpose of nproma, nrproma and stats is explained in the additional information section below. 2) Edit compile.xml: Create a new section , where NEWARCH is the same as in the file DEISA_BENCH/platform/platform.xml. Substitute values in the new compile section with those suitable for the new architecture. There are 8 separate sections which will need modification. scripts/include/config #SMS_TARGET# the name of the target processor/architecture scripts/include/makelib #AR# Only change from the default $ar if special flags required such as -X64 to archive 64bit objects scripts/include/mkabs_fc #LDRFLG# Flags for the loader such as linking for OpenMP #LD# Name of the mpi compiler/loader #BLAS_DIR# Location of the optimised blas libaries #BLAS_LIB# Name of the optimised blas libaries #LOADMAP# Flags to generate a load map if required #MASSLIB# On IBM the names of the libmass libraries #EXTRALIBS# The name of any extra libraries required scripts/include/project_config #MPIF_PATH# The absolute path for the mpif.h file scripts/wrapper #FC# The name of the Fortran 90 compiler #OPTDEFAULT# The default optimisation level eg -O3 build/arch/Makefile #CC# The name of the C compiler #F90FLAGS# F90 compilation flags #F90AUTO# compiler flag for autodouble real*4 to real*8 #F90FIXED# Normally blank unless an IBM #F90FREE# Normally blank unless an IBM #FF90SUFFIX# Normally blank unless an IBM #FFSUFFIX# Normally blank unless an IBM #FMODINC# Normally "-I" #FMODEXT# Normally "-mod" the extension for F90 modules #FMODUP# Normally "no" but "yes" if modules names become upper case as with the pathscale compiler #CCFLAGS# C compilation flags #CPPFLAGS# CPP FLAGS for use with F90 #CPPCLAGS# CPP FLAGS for use with C #AR# Only change from the default $ar if special flags required such as -X64 to archive 64bit objects gribex/config/config #AR# Normally $ar #ARFLAGS# Normally -rv unless special flags required such as -X64 to archive 64bit objects #CC# The name of the C compiler #CFLAGS# C compilation and CPP flags for building gribex #FC# The name of the Fortran compiler #FFLAGS# Fortran compilation and CPP flags for building gribex #RANLIB# Normally $ranlib scripts/compile.sh #MODULE_SWAP# Normally blank unless a Cray #MODULE_LOAD Normally blank unless a Cray 3) Edit execute.xml: Create a new section , where NEWARCH is the same as in the file DEISA_BENCH/platform/platform.xml. Substitute values in the new execute section with those suitable for the new architecture. Note that the following environment variables are used by the IFS LOCAL_DEFINITION_TEMPLATES=./local_definitions DR_HOOK=0 DR_HOOK_OPT="none" NCPUS_PER_NODE=N were N is the number of cpus on a node Other environment variables set in the environment section are platform specific. 4) Execute perl ../../bench/jube bench-NEWARCH.xml Note that the compilation of the IFS takes at least 30 minutes and even longer on a busy system. It may be necessary to do the compilation in batch or if allowed to use nohup. It is possible to monitor the progess of the compilation by cd'ing to the location of the temporary directory created by JuBE for the build. Then cd src/log and examine the log files such as bld_ifsaux.job.out. The compilation script will be executing the following commands in turn ./bld_job bld_ifsaux.job ./bld_job bld_trans.job ./bld_job bld_surf.job ./bld_job bld_ifs.job ./bld_job bld_dummy.job ./bld_job bld_ifsiodummy.job ./bld_job gribex.job followed by ./bld_job mkabs_fc.job If the compilation and build is successful the executable will be copied into the top level IFS/run directory and a job will be submitted for execution. The more likely event is that the initial compilation will fail. Then the log files can be examined to find the cause of the failure and manual changes can be made to the Makefile and configuration files to fix the problem. bld_job can be also be run manually to check that a change has fixed the problem. Just cd src/scripts and ./bld_job bld_...... Remember to reflect any changes made in the top level compile.xml file. In the event that problems are encountered within the gribex library it is useful to know that include in the gribex directory is a small example test program. cd src/gribex/examples make ./test.sh test.sh runs a demonstration program that decodes some data. Reference output can be found in the file IFS/src/gribex/examples/agrdemo_output If the IFS is linked manually the executable will be placed in the temporary directory created bu JuBE for the build. This should be copied into IFS/run. Then modify bench-NEWARCH.xml to change version="new" to version="reuse" and execute jube as before. 4 Input Data ============ For a run of T159 the following files and directories will be used 159l_2/ ICMGGf410INIT ICMSHf410INIT fort.4 ref_0159_0024 ICMCLf410INIT ICMGGf410INIUA RTABLES/ local_definitions/ symbolic links are created by the JuBE prepare step for all of these apart from the fort.4 file which is created from a template file by the JuBE execute step. The ICM... files contain grib encoded meteorological fields representing the state of the atmosphere at the start of the forecast. The 159l_2 folder contains climatology data. The RTABLES folder contains data tables used by the radiation code. The local_definitions folder contains data tables used by the grib decoding routines. ref_0159_0024 contains reference spectral norm values used to check for correctness at the end of a run. The T159 input data sets are included in the tarball downloaded from the DEISA website. Higher resolution data sets at T799 and T1279 are available from the ECMWF ftp site. For T799 the files/directories are 799l_2/ ICMGGf40zINIT ICMSHf40zINIT fort.4 ref_0799_0024 ICMCLf40zINIT ICMGGf40zINIUA RTABLES/ local_definitions/ and for T1279 the files/directories are 1279l_2/ ICMGGf411INIT ICMSHf411INIT fort.4 ref_1279_0064 ICMCLf411INIT ICMGGf411INIUA RTABLES/ local_definitions/ 5 Output Data ============= For a normal run of T159 only 3 files will be produced. They are NODE.001_01 ifs.stat res_0159_0024 For 799 they are NODE.001_01 ifs.stat res_0799_0040 and for T1279 NODE.001_01 ifs.stat res_1279_0064 These 3 files will be concatenated into the stdout log for a job. 6 Analysis and Verification =========================== ifs.stat contains timing information for each timestep. Here is some sample output A B 09:55:20 0AAA00AAA STEPO 1 12.17 12.17 6.29 0:13 0:08 09:55:23 0AAA00AAA STEPO 2 4.86 4.86 2.49 0:17 0:11 09:55:25 0AAA00AAA STEPO 3 4.87 4.87 2.49 0:22 0:13 09:55:31 0AAA00AAA STEPO 4 12.04 12.04 6.10 0:34 0:19 09:55:34 0AAA00AAA STEPO 5 4.83 4.83 2.49 0:39 0:22 09:55:36 0AAA00AAA STEPO 6 4.83 4.83 2.49 0:44 0:24 These times are for the master task. Column A is the cpu time for the timestep, column B is the elapsed time for the timestep. This run was done with 4 OpenMP threads using SMT. The cpu time cannot be more than twice the wall time since 2 threads each share a cpu core. Steps 1 and 4 are radiation timesteps, steps 2,3,5 and 6 are normal timesteps. The file NODE.001_01 is output by the master task and contains detailed information about the progress of an execution. It also contains DEISA specific timing information that contains times for 8 groups of timesteps. At T159 a group consists of 3 timesteps each of 3600 forecast seconds duration, 2 are normal timesteps and the third is a more expensive radiation timestep. Hence the T159 run corresponds to a 24 hour forecast. At T799 the timestep length is 720 forecast seconds and a group of timesteps corresponds to 1 forecast hour consisting of 4 normal timesteps and 1 radiation timestep. Hence the T799 run corresponds to an 8 hour forecast. At T1279 the timestep length is 450 forecast seconds and again a group of timesteps corresponds to 1 forecast hour but now consists of 7 normal timesteps and 1 radiation step. For DEISA timing purposes the times for each group are measured, the first and last groups are ignored and the remainder are averaged. The IFS benchmark automatically validates the results against results obtained from a run at ECMWF. Spectral norms are computed at the end of the run and compared with the norms held in the ref_.... file. A run is considered to be correct if the spectral norms differ by less than 1%. Changing compilation options will normally lead to different results. It is a requirement that for the same executable the results must not change as the number of tasks and threads is varied. To obtain the times and check the results of a run do perl ../../bench/jube N were N is the run id supplied by JuBE when the run was launched. Output will be the average, minimum and maximum times for the groups of timesteps and the results of the correctness check. 7 Reference Output ================== In the reference folder can be found stdout and stderr log files for runs of T159, T799 and T1279 on the Cray XT4 system called HECToR. 8 Additional Information ======================== In the bench-..... file for each platform, the values of nproma, nrproma and stats can be changed for each execution. nproma and nrproma are used to control the length of the inner most loops (grid point chunks) used in the grid point and radiation calculations. A value of zero tells the IFS to use one of two sets of default values suitable for either a scalar or a vector machine. The values chosen will be output in the NODE.... output file. Values different to the default ones can be obtained by setting nproma and nrproma. If the values are positive then the IFS will adjust them slightly to get a best fit to the number of grid points per task and the number of grid point chunks per thread. If the values are negative then the IFS will make them positive but not adjust them. eg nproma="-24" and nrproma="-8" will result in NPROMA set to 24 and NRPROMA set to 8. On a vector machine large values should be chosen to reflect the size of the vector registers. On a scalar machine small values will help increase the cache hit rate at the expense of subroutine call overheads. Finding the optimal values for an architecture will require experimentation. Setting stats="true" can be used to generate more detailed timing information in the NODE.... output file. The times for different sections of the code and the costs of communications (which includes load imbalance) will be output. In the execute.xml file there are two environment variables that can be very useful. These are DR_HOOK and DR_HOOK_OPT. Set DR_HOOK=1 if runtime failures with no tracebacks occur. The IFS calls the dr_hook routines at the beginning and end of every subroutine call. If DR_HOOK=0 then no action is taken but if DR_HOOK=1 the software keeps track of the call tree and will print it out on receipt of an OS failure signal. Set DR_HOOK=1 and DR_HOOK_OPT="prof" to obtain a detailed code profile with the time spent in every routine, the number of calls and the time spent in child routines. The information is output in a series of files called drhook.prof.NN were NN corresponds to the MPI task number starting from 1. Change to the temporary directory made by JuBE for a run in order to examine the files. The files can also be summarised by running a utility as follows cat drhook* | perl ../../../bin/drhook_merge_walltime.pl > filename were filename can be any name. Typically a profile of the IFS is very flat with the most expensive routines only consuming about 5 to 7% of the total. Note that when doing benchmark timing runs always have DR_HOOK=0 and DR_HOOK_OPT="none"