4 File Systems and Data Management
The main platform for data management in DEISA is currently IBM's Multicluster General Parallel File System (MC-GPFS) which is distributed over various HPC sites and compute platforms in Europe utilizing the high-speed network provided by GEANT2. Almost all DEISA sites are integrated into this DEISA-wide service or will be integrated - as far as the latest HPC systems (not IBM) are involved. This multi-site shared file system enables the users to access and organize their data transparently from most of the DEISA sites.
Beside the multi-site shared file system there are two other file systems provided for the user. The following section describes
- the general organization of the file systems in DEISA,
- the access to and usage of the file systems on the BlueGene/P systems
- the alternative data management using GridFTP
DEISA provides a consistent view of the file systems which users access from any DEISA site to manage their applications and data, including input and output data, that need to be staged in to and out from a faster file space close to the execution platform. Beside the initial site-local home directory, three types of file systems are defined for the users:
- DEISA home file system, realized as mutli-site shared file system
- the DEISA data file system, realized as mutli-site shared file system
- temporary scratch space, realized as local cluster-wide high throughput shared file system
The naming of the path to the DEISA-wide shared file systems follow the scheme shown in the table below. In addition, specific environment variables point to these file systems. NB these DEISA-specific environment variables are only available after running the command 'module load deisa'. Please see the DEISA Common Production Environment manual for more information.
|file system||path, environment and purpose
|DEISA home file-system||
|DEISA data file system||
|Local scratch space||
|Local home directory
The DEISA file systems have been classified with respect to the possible evolution of DEISA policies and concerning site-specific policies. These paths and variables can either point to separate file systems or they may point to one single file system. Each site decides which option is implemented. However, the user does not need to care about site specific path names for his data management. As first preference, the environment variables should be used (e.g. in scripts, job definition files). The actual path should be used only when employing services such as GridFTP, which is used for exchanging data between sites not sharing a common file system, as this does not always support the definition of environment variables in a prolog.
So, in general, four different types of file systems have been identified. The file systems can be either local or non-local. The non-local file systems may be available only on a specific subset of hardware or accessible from all computers participating in DEISA. There are also possibly different policies on the file systems. There may be a quota limiting the available file space to something less than the physically possible maximum on some file systems. There may be a backup for some of the file systems, while others may be available only during a single job and cleared on exit.
Each DEISA HPC system, that is assigned to the user's project, provides a home directory, i.e. the initial directory for the user after interactive or batch login. The pathname of this home directory may vary from system to system and the sites may apply different quota to this directory. In general, the user's home directory ($HOME) must not be used for user applications, job scripts or data. DEISA-specific file systems should be used instead. Depending on each site's policy, the usage of $HOME might typically not exceed a few hundreds of Megabytes.
Since the pathname of $HOME may vary from system to system, DEISA introduced the environment variable $DEISA_HOME which is on many systems pointing to the same physical location on the shared multi-cluster filesystem. Please be aware that the pathname of this directory may be site-specific, so always use the environment variable $DEISA_HOME instead of using the explicit pathname.
The subdirectories and files located on $DEISA_HOME can be physically located at your home site while they are transparently accessible from any other system within the DEISA shared multi-cluster filesystem. This file system is realized using the IBM’s Multi-Cluster GPFS. $DEISA_HOME can be restricted by quota, and depending on the site-specific policy, data from there can be backed up regularly by the site which is hosting the corresponding GPFS file server.
Since the file space in $DEISA_HOME may be too limiting, there is another file system which can be accessed using the environment variable $DEISA_DATA. This file system is also available from each computer within DEISA and the location addressed by this environment variable is also guaranteed to point to the same physical location at least on each of the homogeneous platforms. However, as for $DEISA_HOME, the path for $DEISA_DATA may vary from site to site.
The data is also physically located on your home site but accessible from each site. Whether the file system is backed up depends on the policy of your home site. There may also be some site dependent policy for quotas and/or for cleaning up by removing older files.
If accessed from remotely (i.e. from a site that does not host the files physically) all the files in $DEISA_DATA or $DEISA-HOME need to be transferred through the DEISA network. For performance reasons, a job that is executed not at the users home site but remotely, and which is reading and producing a large amount of data, is normally not suited for accessing the $DEISA_HOME and $DEISA_DATA during the job execution time. Instead, the data should be staged to and located at the job execution site on a fast high troughput file system. Every DEISA site is offering such a disk space that may also offer from cluster to cluster (e.g. can be physically different on a Power6 and a BlueGene/P system).
This disk space is commonly addressed using the environment variable $DEISA_SCRATCH. Note that this scratch space is managed by each site under their own policy. Besides a few execeptions, users can not expect that the scratch file system will exist much longer than the end of the job execution time. Even if it can be expected in general, that the scratch space of one job is automatically separated from another job, there is no rule that guarantees this policy. Therefore users must expect that they have to stage their data into the scratch space shortly in advance of the job execution, that they have to organize there input and output data so that jobs running at the same time on the same machine can not interfere the each other, and that they have to stage the data from the scratch space as soon as posible.
However, the scratch file system is best used only for fast I/O operations from within a job. Data located in this file system is not persistent and has to be copied to $DEISA_DATA or $DEISA_HOME, e.g. when the batch job is ending.
The DEISA sites IDRIS, FZJ and RZG are providing more than one HPC system, an IBM Power6 and a BlueGene/P system. Due to performance reasons (the BlueGene/P partitions including the I/O nodes are rebooting everytime when the execution of a new job is started ) it was decided avoid mounting the DEISA file system on the numerous I/O-Nodes of the BlueGene/P. Therefore, jobs running on the BlueGene/P systems are not able to read or write data from or to the DEISA shared file system.
But the DEISA file systems are mounted on the BlueGene/P frontend node at each of these sites. This allows users to manage the data staging from and to the DEISA file system to or from the local scratch space on the BlueGene/P. The scratch space assigned to $DEISA_SCRATCH on the BlueGene/P systems should be available for a longer time period (e.g. 1-2 weeks), so that the user can access his data also after the job has completed.
The dynamic evolution of the DEISA computing environment makes it possible that new compute platforms cannot always be integrated into the MC-GPFS infrastructure, at least not timely ennough after they have been commissioned for production. Thereforeis provided as an alternative solution for transfering huge data volumes efficiently between the HPC sites over the DEISA internal network. In addition some GridFTP servers are available within DEISA that allow also to transfer data between DEISA and non-DEISA sites via the public internet.
Details on how GridFTP servers in DEISA can be used for the data movement between machines that are not integrated in the DEISA multi-cluster shared filesystem are described in the document Data transfer with GridFTP.