This is the README file for the BQCD application benchmark, distributed with the DEISA Benchmark Suite: http://www.deisa.eu/science/benchmarking/ Last modified by the DEISA Benchmark Team on 2008-08-25. ----------- BQCD readme ----------- Note: all base information taken from the BQCD document; updated with JuBE and new ported platforms Subdirectories in src: ~~~~~~~~~~~~~~~~~~~~~ clover routines for the clover improvement comm communication routines d multiplication of a vector with "D-slash" modules (some) Fortran90 modules platform Makefiles and service routines for various platforms General remarks ~~~~~~~~~~~~~~~ BQCD has been ported to various platforms (see platform/Makefile-*.var): # Makefile-altix.var - settings on SGI-Altix 3700 and SGI-Altix 4700 # Makefile-bgl.var - settings on BlueGene/L # Makefile-cray.var - settings on Cray T3E and Cray XT4 # Makefile-hitachi-omp.var - settings on Hitachi SR8000 # Makefile-hitachi.var - settings on Hitachi SR8000 (pure MPI version) # Makefile-hp.var - settings for HP-UX Fortran Compiler # Makefile-ibm.var - settings on IBM # Makefile-intel.var - settings for Intel Fortran Compiler # Makefile-nec.var - settings on NEC SX-8 # Makefile-sun.var - settings on Sun The corresponding files platform/service-*.F90 contain interfaces to service routines / system calls. Not all of these files have been used recently. There are kept as a starting point. A "Makefile.var" and a "service.F90" have to be provided in the source directory that work correctly with your system. The contents of these files is explained in: platform/Makefile-EXPLAINED.var platform/service-EXPLAINED.var "gmake prep-" will create symbolic links accordingly: berni1> gmake prep-ibm gmake prep PLATFORM=ibm rm -f Makefile.var service.F90 ln -s platform/Makefile-ibm.var Makefile.var ln -s platform/service-ibm.F90 service.F90 Resource requirements ~~~~~~~~~~~~~~~~~~~~~ The resource requirements are approximately: benchmark lattice total memory size of output execution time --------------------------------------------------------------------------- MPP 48*48*48*96 497 GByte 4 GByte 268.2 s at 758.52 GFlop/s SMP 24*24*24*48 37 GByte 0.25 GByte 44.4 s at 608.96 GFlop/s Standart porting ~~~~~~~~~~~~~~~~ *** make The Makefiles use the makro $(MAKE) and the "include" statement. Some of Makefile-*.var are quite standard, some require GNU-make. "make fast" can be used for a parallel "make". "make fast" builds the binary "bqcd." Without "make fast" one has to enter: make Modules make libs make bqcd JuBE porting ~~~~~~~~~~~~ For Altix: Change the following lines in the execution file bensh: the first line: #!/usr/local/bin/perl -w to #!/usr/bin/perl -w the line 1235: $cmd="cp -rp $srcdir/$file $dir/src/"; to $cmd="cp -rp $srcdir/* $dir/src/"; *** ANSI C preprocessor The C preprocessor is needed for building the source. The C preprocessor must be able to handle the string concatenation macro "##". Recent versions of the GNU C Proprocesse do not work because they refuse to process the Fortran90 separator "%". *** Service routines and "stderr" Service routines are needed for aborting, measuring CPU-time, to get arguments from the command line, etc. The corresponding routines have to be inserted in the file service.F90. It is assumed that Fortran unit 0 is pre-connected to stderr. If this is not the case on your machine you should re-#define STDERR in "defs.h". For the time measurements it is important to use a time function with high resolution in the function "sekunden". *** Message passing / Communication library Originally the communication was programmed with the shmem library on a Cray T3E. Now MPI is mainly used. There is also a single CPU version (that needs no communication library) and a combination of shmem for the most time consuming part and MPI. See $(LIBCOMM) in platform/Makefile-EXPLAINED.var and "Hints for optimisation" below. *** OpenMP In addition to setting your compiler's OpenMP option you have to add "-D_OPENMP" in "Makefile.var": MYFLAGS = ... -D_OPENMP Verification ~~~~~~~~~~~~ *** Random numbers Correctness of random numbers can be checked by: make the_ranf_test The test is done by comparison with reference output. On most platforms there is no difference. However, on Intel "diff" usually reports differences in the last digit of the floating point representation of the random numbers; the integer representations match exactly, eg: < 1 4711 0.5499394951912783 --- > 1 4711 0.5499394951912784 *** Argments from the command line Try option -V: berni1> ./bqcd -V This is bqcd benchmark2 input format: 4 conf info format: 3 MAX_TEMPER: 50 real kind: 8 version of D: 2 D3: buffer vol: 0 communication: single_pe + OpenMP *** BQCD To check that the BQCD works correctly execute the following sequence of commands: berni1> cd work berni1> ../bqcd input.TEST > out.TEST berni1> grep ' %[fim][atc]' out.TEST > out.tmp berni1> grep ' %[fim][atc]' out.TEST.reference | diff - out.tmp 18c18 < %fa -1 1 0.4319366404 1.0173348431 43 407 38 --- > %fa -1 1 0.4319366404 1.0173348433 43 407 38 The test can be run for any domain decomposition and any number of threads. In any case result should agree. Floating point numbers might differ in the last digit as shown above. (In total 20 lines containing floating point numbers are compared.) *** Check sums BQCD writes restart files in the working directory. The extension of the file containing information on the run is ".info". It contains check sums of the binary data files (the example was run after the test run): berni1> tail -6 bqcd.000.1.info >BeginCheckSum bqcd.000.1.00.u 286125633 24576 bqcd.000.1.01.u 804770858 24576 bqcd.000.1.02.u 657813015 24576 bqcd.000.1.03.u 3802083338 24576 >EndCheckSum These check sums should be identical to check sums calculated by the "cksum" command: berni1> cksum bqcd.000.1.*.u | awk '{print $3, $1, $2}' bqcd.000.1.00.u 286125633 24576 bqcd.000.1.01.u 804770858 24576 bqcd.000.1.02.u 657813015 24576 bqcd.000.1.03.u 3802083338 24576 Structure of the input ~~~~~~~~~~~~~~~~~~~~~~ run 204 names of restart files will contain "run" can be set to 0 lattice 24 24 24 48 lattice size, can e.g. be modified for weak scaling analysis processes 1 2 2 4 number of MPI-proceses per direction (1 1 1 1 in the pure OpenMP case) boundary_conditions_fermions 1 1 1 -1 do not change beta 5 do not change kappa 0.13 do not change csw 2.3327 do not change hmc_test 0 do not change hmc_model C do not change hmc_rho 0.1 do not change hmc_trajectory_length 0.2 do not change hmc_steps 10 can be lowered -> shorter execution time hmc_accept_first 1 do not change hmc_m_scale 3 do not change start_configuration cold do not change start_random default do not change mc_steps 1 do not change mc_total_steps 100 do not change solver_rest 1e-99 do not change solver_maxiter 100 can be lowered -> shorter execution time solver_ignore_no_convergence 2 do not change (CG will not converge, the numbers of iterations per call will be exactly "solver_maxiter") solver_mre_vectors 7 do not change Hints on optimisation ~~~~~~~~~~~~~~~~~~~~~ Before starting any optimisation one should find the fastest variant in the existing code. There are two libraries to look at: $(LIBD) and $(LIBCOMM). *** LIBCOMM ("communication", directory: comm) There are the following variants: lib_single_pe.a: Single CPU version (PE: "processing element"). lib_mpi.a: MPI version. lib_shmempi.a: shmem for nearest neighbour communication, MPI for the rest. *** Caveat Not all combinations of LIBD and LIBCOMM have been implemented. The following combinations should work (lib_mpi.a always works): LIBD LIBCOMM -------------------------------------------------- libd.a lib_single_pe.a lib_mpi.a libd2.a lib_single_pe.a lib_mpi.a lib_schmempi.a libd3.a lib_mpi.a libd21.a lib_single_pe.a lib_mpi.a lib_schmempi.a Rules for time measurements ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In "Makefile.var" "-DTIMING" must always be set: MYFLAGS = -DTIMING ... All time measurements (TIMING_START() ... TIMING_STOP()) must be kept. There is one exception: If you restructure routines d() and d_dag() it might occur that the current regions of time measurements (which are per direction) do not make sense. (For example, this would occur when combining loops from more than one direction.) In that case, please report in addition the best measurement obtained with the existing code.