Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things


Intel compiler and -mcmodel=...

As it seems to be come a FAQ – although it is documented as a small note in the ifort man page: if the 64-bit Intel compilers (for EM64T/Opteron but not for IA64) are used and statically allocated data (e.g. in Fortran common blocks) exceeds 2GB, the -mcmodel=medium or -mcmodel=large switches must be used. As a consequence, the -i-dynamic or -shared-intel flag depending on the compiler version must also be specified for the linking step otherwise strange error messages about relocation truncated occur for libifcore.a routines during linking.

stopping STAR-CD at latest just before the wallclock time is exceeded

A similar approach to the one described for CFX in is also possible for STAR-CD as shown in the following snippet (Thanks to one of our users for the feedback!):
#!/bin/bash -l
#PBS -l nodes=2:ppn=4
#PBS -l walltime=24:00:00
#PBS -N somename

# Change to the directory where qsub was made

### add the module of the STAR-CD version, e.g. 4.02
module add star-cd/4.02_64bit

# specify the time needed to write the result and info files, e.g. 900 seconds
export TIME4SAVE=900

#automatically detect how much time this job requested and
#adjust the sleep accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE \
‘{if ( $0 ~ /Resource_List.walltime/ ) \
{ split($3,duration,”:”); \
print duration[1]*3600+duration[2]*60+duration[3]-t }}’ `; \
star -abort ) >& /dev/null &
export SLEEP_ID=$!

# the normal STAR-CD start follows …
star -dp `cat $PBS_NODEFILE`

pkill -P $SLEEP_ID

Automatically requeuing of jobs if not enough licenses are available

A common problem with queuing systems and commercial software using floating licenses is that you cannot easily guarantee that the licenses you need are available when your job starts. Some queuing systems and schedulers can consider license usage – the solution at RRZE does not (at least not reliably).

A partial solution (although by far not optimal) is outlined below. With effectively two additional lines in your job script you can at least ensure that your job gets requeued if not enough licenses are available – and does not just abort. (The risk for race conditions which are not detected of course still exists, and you may have to wait again some time until compute resources are available for your new jobs … but better than only seeing the error message after the weekend …

#!/bin/bash -l
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -N myjob

# it is important that “bash” is executed on the first line above!
# check for 16 hpcdomains and 1 starpar license and automatically
# requeue the job if not enough licenses are available right now.
# This check is based on the situation right now – it may
# change just in the next second, thus, there is no guarantee
# that the license is still available in just a few moments.
# We do not checkout, borrow or reserve anything here!
# CHANGE license server and feature list according to your needs!
# instead of $CDLMD_LICENSE_FILE you can use the PORT@SERVER syntax
/apps/rrze/bin/ -c $CDLMD_LICENSE_FILE hpcdomains 16 starpar 1

# the next line must follow immediately after the line
# with no commands in between!
# (the “.” at the beginning is also correct and important)
. /apps/rrze/bin/

# now continue with your normal tasks …
# if there were not enough licenses in the preliminary check,
# the script will not come until here but it got requeued.

This approach is not at all limited to STAR-CD and should work on Cluster32 and Woody.


ATTENTION: this approach does NOT work if license throttling is active, i.e. in cases where licenses are in principle available but the license server limits the number of licenses you or your group may get by using some MAX setting in the option file on the license server!

Most licenses at RRZE are throttled, thus, the and scripts are of limited use only these days.