Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

combining several sequential jobs in one PBS job to fill a complete node

Sometimes trivial parallelism is the most efficient way to parallelize work, e.g. for parameter studies with a sequential program. If only complete nodes may be allocated on a certain cluster, several sequenatial runs can very easily be bundled into one job file:

#!/bin/bash -l
# allocate 1 nodes (4 CPUs) for 8 hours
#PBS -l nodes=1:ppn=4,walltime=08:00:00
# job name
#PBS -N  xyz
# first non-empty non-comment line ends PBS options

# jobs always start in HOME
# but we want to go to the directory where we submitted the job
cd  $PBS_O_WORKDIR

# run 4 sequential parameter studies in parallel and bind eachone
# to a specific core
(taskset -c 0  ./a.out input1.dat ) &
(taskset -c 1  ./a.out input2.dat ) &
(taskset -c 2  ./a.out input3.dat ) &
(taskset -c 3  ./a.out input4.dat ) &

# wait for all background processes to finish ("wait" is a bash built-in)
wait

The bash builtin wait ensures that all background processes have finished once wait returns.

For this to work efficiently, of course all parameter runs should take about the same time …

Running STAR-CCM+ jobs on Woody

Running STAR-CCM+ jobs on Woody

We now have a first user who is using STAR-CCM+ in parallel on the Woody cluster. Starting jobs in batch mode seems to be quite easy. As STAR-CCM+ internally uses HP-MPI, Infiniband should automatically be used correctly, too (although I did not explicitly verify this yet).

Here is what this user currently uses (again no idea if automatic stopping actually works with STAR-CCM+, thus, there might be room for improvements):

#!/bin/bash -l
#PBS -l nodes=2:ppn=4
#PBS -l walltime=24:00:00
#PBS -N some-jobname

cd  $PBS_O_WORKDIR

module add star-ccm+/3.04.008

# specify the time you want to have to save results, etc.
export TIME4SAVE=800

# detect number of available CPUs (should be modified for Nehalems with active SMT)
ncpus=`cat $PBS_NODEFILE | wc -l`

# STAR-CCM+ starts a master plus N $ncpus slaves; on Woody it's o.k. to
# oversubscribe the nodes in this way (i.e. ncpus+1 processes on ncpus
# however, on Nehalem nodes (e.g. TinyBlue) it seems to be a very had idea
# to avoid oversubscription, uncomment the following line
## ncpus=$(($ncpus-1))

# check if enough licenses should be available
/apps/rrze/bin/check_lic.sh -c $CDLMD_LICENSE_FILE ccmpsuite 1 hpcdomains $(($ncpus-1))
. /apps/rrze/bin/check_autorequeue.sh

export MPIRUN_OPTIONS="-v -prot"
# or with pinning: e.g.
## export MPIRUN_OPTIONS="-v -prot -cpu_bind=v,rank,v"
## export MPIRUN_OPTIONS="-v -prot -cpu_bind=v,MAP_CPU:0,1,2,3,4,5,6,7,v"

# if there are messages about "mpid: Not enough shared memory" you may try to set
# the maximum shared memory size in bytes by hand - but usually the message means
# that there is really not enough memory available; so forget about this option!
## export MPI_GLOBMEMSIZE=...

export MPI_REMSH=ssh

# automatically detect how much time this batch job requested and adjust the
# sleep attempt;
# make sure you defined the "stop file" within STAR-CCM+ accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE                   \
    '{if ( $0 ~ /Resource_List.walltime/ )                            \
        { split($3,duration,":");                                     \
          print duration[1]*3600+duration[2]*60+duration[3]-t }}' `;  \
 touch ABORT ) >& /dev/null  &
export SLEEP_ID=$!

starccm+ -batch -np $ncpus -machinefile $PBS_NODEFILE -load myjob.sim

pkill -P $SLEEP_ID

Running Turobomole 5.10 in parallel on RRZE’s clusters

Recent versions of Turbomole come bundled with HP-MPI. This allows using different types of network interconnects for parallel runs and HP-MPI is supposed to select the fastest one available – but sometimes HP-MPI requires manual intervention to select to correct interconnect … As parallel runs of Turbomole also require one control process, additional intervention is required to avoid overbooking of the CPUs.

Here is a short receipt for RRZE’s clusters:

#!/bin/bash -l
#PBS -lnodes=4:ppn=4,walltime=12:00:00
#PBS -q opteron
#... further PBS options as desired

module load turbomole-parallel/5.10

# shorten node list by one
. /apps/rrze/bin/chop_nodelist.sh

# prevent IB initialization attempts if IB is not available
ifconfig ib0>& /dev/null
if [ $? -ne 0 ]; then
   export MPIRUN_OPTIONS="-v -prot -TCP"
fi

# now call your TM binaries (jobx or whatever)

manually setting the TM architecture: As the s2 queue of the Transtec cluster has access to 32 and 64-bit nodes since recently, it may be a good idea to restrict the Turbomole binary to the 32-bit version which can run on both types of nodes by including export TURBOMOLE_SYSNAME="i786-pc-linux" in the PBS job script.

pinning with HP-MPI: As Turbomole internally uses HP-MPI, the usual pinning mechanisms of RRZE’s mpirun cannot be used with Turbomole. If you want to use pinning, you have to check the HP-MPI documentation and use their specific command line or environment variables! Cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf.

stopping STAR-CD at latest just before the wallclock time is exceeded

A similar approach to the one described for CFX in is also possible for STAR-CD as shown in the following snippet (Thanks to one of our users for the feedback!):

#!/bin/bash -l
#PBS -l nodes=2:ppn=4
#PBS -l walltime=24:00:00
#PBS -N somename

#  Change to the directory where qsub was made
cd $PBS_O_WORKDIR

### add the module of the STAR-CD version, e.g. 4.02
module add star-cd/4.02_64bit

# specify the time needed to write the result and info files, e.g. 900 seconds
export TIME4SAVE=900

#automatically detect how much time this job requested and
#adjust the sleep accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE                   \
    '{if ( $0 ~ /Resource_List.walltime/ )                            \
        { split($3,duration,":");                                     \
          print duration[1]*3600+duration[2]*60+duration[3]-t }}' `;  \
  star -abort ) >& /dev/null  &
export SLEEP_ID=$!

# the normal STAR-CD start follows ...
star -dp `cat  $PBS_NODEFILE`

pkill -P $SLEEP_ID

Automatically requeuing of jobs if not enough licenses are available

A common problem with queuing systems and commercial software using floating licenses is that you cannot easily guarantee that the licenses you need are available when your job starts. Some queuing systems and schedulers can consider license usage – the solution at RRZE does not (at least not reliably).

A partial solution (although by far not optimal) is outlined below. With effectively two additional lines in your job script you can at least ensure that your job gets requeued if not enough licenses are available – and does not just abort. (The risk for race conditions which are not detected of course still exists, and you may have to wait again some time until compute resources are available for your new jobs … but better than only seeing the error message after the weekend …

#!/bin/bash -l
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -N myjob

# it is important that "bash" is executed on the first line above!
#
# check for 16 hpcdomains and 1 starpar license and automatically
# requeue the job if not enough licenses are available right now.
# This check is based on the situation right now - it may
# change just in the next second, thus, there is no guarantee
# that the license is still available in just a few moments.
# We do not checkout, borrow or reserve anything here!
# CHANGE license server and feature list according to your needs!
# instead of $CDLMD_LICENSE_FILE you can use the PORT@SERVER syntax
/apps/rrze/bin/check_lic.sh -c $CDLMD_LICENSE_FILE hpcdomains 16 starpar 1

# the next line must follow immediately after the check_lic.sh line
# with no commands in between!
# (the "." at the beginning is also correct and important)
. /apps/rrze/bin/check_autorequeue.sh

# now continue with your normal tasks ...
# if there were not enough licenses in the preliminary check,
# the script will not come until here but it got requeued.

This approach is not at all limited to STAR-CD and should work on Cluster32 and Woody.

 

ATTENTION: this approach does NOT work if license throttling is active, i.e. in cases where licenses are in principle available but the license server limits the number of licenses you or your group may get by using some MAX setting in the option file on the license server!

Most licenses at RRZE are throttled, thus, the check_lic.sh and check_autorqueue.sh scripts are of limited use only these days.

mpiexec + taskset: code snippet to generate config file

mpiexec + taskset: code snippet to generate config file

We usually use Pete Wyckoff’s mpiexec to start MPI processes from within batch jobs. The following code snippet may be used to pin individual MPI processes to their CPU. A similar snippet could be used to allow round-robin distribution of the MPI processes …

#!/bin/bash
# usage: run.sh 8 "0 2 4 6" ./a.out "-b -c"
#        start a.out with arguments -b -c
#        use 4 MPI processes
#        pin processes to CPUs 0,2,4,6

NUMPROC=$1
TASKCPUS="$2"
EXE=$3
ARGS=$4

TASKS=`echo $TASKCPUS | wc -w`
echo "running $NUMPROC MPI processes with $TASKS per node"

cat /dev/null > cfg.$PBS_JOBID
for node in `uniq $PBS_NODEFILE`; do
   for cpu in $TASKCPUS; do
      echo "$node : taskset -c $cpu $EXE $ARGS" >> cfg.$PBS_JOBID
      NUMPROC=$((NUMPROC - 1))
      if [ $NUMPROC -eq 0 ]; then
        break 2
      fi
   done
done
mpiexec -comm pmi -kill -config cfg.$PBS_JOBID

Running STAR-CD over Infiniband

STAR-CD 4.02 works out of the box; there are currently some warnings ERROR: ld.so: object 'libmpi.so' from LD_PRELOAD cannot be preloaded: ignored. As these messages seem to be uncritical, I’m not sure if I’ll further debug their cause.

STAR-CD 3.26 also works out of the box.

User subroutines are not yet tested; some additional steps will be required to get them compiled as the required PGI compiler is not installed locally…

In first tests, star -chkpnt failed with the message TAR checkpoint failed due to invalid "star.pst" file. or TAR checkpoint failed due to invalid "star.ccm" file.

=====================================================================

Using STAR-CD in principle works as follows (access to the STAR-CD module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start prostar, etc. on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the "-l" is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N STARCD-woody
    #... any other PBS option you like
    
    # let's go to the directory where the script was submitted
    cd $PBS_O_WORKDIR
    
    # load the STAR-CD module; either "star-cd/3.26_64bit" or "star-cd/4.02_64bit"
    module add star-cd/3.26_64bit
    
    # here we go
    star -dp `cat $PBS_NODEFILE`
    
  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast graphics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

Some more details on “ERROR: ld.so: object ‘libmpi.so’ from LD_PRELOAD cannot be preloaded: ignored” message: Looking into the script which is actually used to call the STAR-CD binary, I can guess where the message might come from. They use something like $HPMPI/bin/mpirun ... -e LD_PRELOAD=libmpi$PNP_DSO ... -f .starboot.mpi, i.e. they do not specify a path for the library to be preloaded. I’m not sure what the current policy of ld.so from glibc is (does it look at the current LD_LIBRARY_PATH – if set) or does it only look at “secure” (predefined) directories. If the latter is the case, star of course cannot find the library …

Problems with STAR-CD-4.02 (64-bit) and Infibiband: Running the PGI variant of STAR-CD-4.02 (64-bit) over Infiniband (i.e. using HP-MPI) currently may fail on our new Woody-cluster as well as on the Infiniband partition of the Transtec-cluster. The observations (currently based on just a signle testcase) are as follows:

  • STAR-CD-4.02/PGI using HP-MPI/VPAI runs fine for a few iterations but then suddenly stops to consume CPU time on most of the nodes.
  • STAR-CD-4.02/PGI using HP-MPI/TCP or mpich runs fine.
  • STAR-CD-4.02/Absoft runs fine even with HP-MPI/VAPI!

Further tests are currently on their way …

For the moment, module add star-cd/4.02_64bit; star -mpi=hp -mppflags="-v -prot -TCP" is the recommended way of starting STAR-CD-4.02.

Sometimes also problems of STAR-CD 3.26 with Infiniband: According to a user report, also STAR-CD 3.26 with HP-MPI over Infiniband has the problem that it suddenly stops to run. It seems the the AMG-preconditioner is the reason for the problems.

So, check if Infiniband runs fine for your cases, if not (and only if not) add -mpi=hp -mppflags="-v -prot -TCP"

Upgraded IB stack seems to solve Infiniband problems: Updating from the Voltaire ibhost-stack to the Voltaire GridStack 4.1 (which is OFED-1.1 based) seems to have solved the issue with hanging STAR-CD processes. Please try to run without the argument -TCP!

As a technical note: the HP-MPI version which comes with STAR-CD 4.02 or 3.26 is too old to work with OFED; thus, the latest HP-MPI (i.e. 2.02.05) has been installed on Woody. The module files for STAR-CD have been adapted to automatically use this updated version. Your output (if -v -prot is used) should now show IBV instead of VAPI if the high speed network is used.

Using IPoverIB as fallback: If native Infiniband does not work even with the upgraded IB stack (e.g. due to a bug in connection with AMG), try IP-over-IB by using -mppflags="-v -prot -TCP -netaddr 10.188.84.0/255.255.254.0".

Another option available in HP-MPI (which is completely unrelated to the communication network but I’m to lazy to create an other thread right now) is the ability to pin processes to specific CPUs of a node. On woody, -mppflags="... -cpu_bind=v,map_cpu:0,1,2,3 ..." is the right choice if you run with 4 CPUs per node; if you have huge memory requirements and thus only use every second core, the correct option would be -mppflags="... -cpu_bind=v,map_cpu:0,1 ...".

Running CFX over Infiniband

Running CFX-10 over Infiniband and using user subroutines should work now – of course you must have a valid license!

In principle it works as follows (access to the CFX module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start cfx5pre/cfx5post directly on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the "-l" is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N CFX-woody
    #... any other PBS option you like
    
    # let's go to the directory where the script was submitted
    cd $PBS_O_WORKDIR
    
    # transfer the list of nodes in a format suitable for CFX
    nodes=`cat $PBS_NODEFILE`
    nodes=`echo $nodes | sed -e 's/ /,/g'`
    
    # load the (recommended) CFX module
    module add cfx
    
    # here we go
    cfx5solve -name woody-$PBS_JOBID -size-mms 2.0 -def xyz.def -double   \
                -par-dist $nodes -start-method "HP MPI Distributed Parallel for x86 64"
    
  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast grafics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

=====================================================================

Some special “tuning” of the PBS script is of course possible, e.g. to stop the simulation at latest just before the wallclock time is exceeded:

# specify the time you want to have to save results, etc.
export TIME4SAVE=600
# directory where CFX will be running
export SIMDIR=woody-${PBS_JOBID}_001.dir
# automatically detect how much time this batch job requested and adjust the
# sleep accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE                   \
    '{if ( $0 ~ /Resource_List.walltime/ )                            \
        { split($3,duration,":");                                     \
          print duration[1]*3600+duration[2]*60+duration[3]-t }}' `;  \
  cfx5stop -dir $SIMDIR ) >& /dev/null  &
export SLEEP_ID=$!
cfx5solve -name woody-$PBS_JOBID ...
pkill -P $SLEEP_ID

This “add-on” is only for advanced users which hopefully understand what this code does …

UPDATE 2012-10-24: recent Ansys/CFX version have a command line argument to limit the wall clock time:
-max-elapsed-time <elapsed time>
Set the maximum elapsed time (wall clock time) that the ANSYS CFX Solver will run. Elapsed time must be in quotes and have correct units in square brackets (eg: -maxet “10 [min]” or -maxet “5 [hr]”).

=====================================================================

Some more technical notes which are not really relevant for pure users … it just documents the steps which were required to get everything up-and-running.

CFX-10 (and it’s service packs) comes with a bundled version of HP-MPI version. No extra license is required to use HP-MPI together with CFX. Just specify
-start-method "HP MPI Distributed Parallel for x86 64"
and HP-MPI automatically selects the fastest interconnect available.
However, the version distributed by Ansys (HP-MPI 2.1-2) does not work together with the Infiniband stack installed on the new Woody cluster or the “old” Infiniband part of the Transtec cluster. Therefore, the current version of HP-MPI (2.2.0.2) must be downloaded from hp.com and installed manually.
CFX-10.0/etc/start-methods.ccl has to be adapted to reflect the new path (unless one does some dirty directory renames/links).

By default, CFX also tries to connect via rsh to remote hosts. Setting the environment variable CFX5RSH to ssh solves this problem. It might also be necessary to set the environment variable MPI_REMSH to ssh so that also HP-MPI uses the correct piece of software. Using PBS-mpiecec probably does not work with HP-MPI, otherwiese some further hacking in CFX-10.0/etc/start-methods.ccl might also be possible.

To force the use of the Infiniband interconnect, the environment variable MPIRUN_OPTIONS may be set to -VAPI – note the capital letters.

To support user subroutines, a PGI Fortran compiler is required to complie the user code once (using cfx5mkext *.F AND cf5mkext -doube *.F). As we do not have the PGI compiler installed on the Woody cluster, compilation has to be done elsewhere. The required PGI runtime libraries were copied (in accordance with the PGI license agreement) from a PGI 6.1.4 installation to the directory CFX-10.0/lib/linux-amd64 where they are found from within CFX without setting any special LD_LIBRARY_PATH.