Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

STAR-CCM+ fails with “mpid: Not enough shared memory”

If STAR-CCM+ fails on large shared memory nodes with the message “mpid: Not enough shared memory”, your sysadmin might need to increase the kernel limits for SHMMAX (maximum size of shared memory segment in bytes), i.e. sysctl -w kernel.shmmax=.... Especially, the Ubuntu/Debian default of 32 MB seems to be too small even for 2-socket nodes with 8-core AMD Opteron processors, i.e. 16 cores/node …

Running STAR-CCM+ jobs on Woody

Running STAR-CCM+ jobs on Woody

We now have a first user who is using STAR-CCM+ in parallel on the Woody cluster. Starting jobs in batch mode seems to be quite easy. As STAR-CCM+ internally uses HP-MPI, Infiniband should automatically be used correctly, too (although I did not explicitly verify this yet).

Here is what this user currently uses (again no idea if automatic stopping actually works with STAR-CCM+, thus, there might be room for improvements):

[shell]
#!/bin/bash -l
#PBS -l nodes=2:ppn=4
#PBS -l walltime=24:00:00
#PBS -N some-jobname

cd $PBS_O_WORKDIR

module add star-ccm+/3.04.008

# specify the time you want to have to save results, etc.
export TIME4SAVE=800

# detect number of available CPUs (should be modified for Nehalems with active SMT)
ncpus=`cat $PBS_NODEFILE | wc -l`

# STAR-CCM+ starts a master plus N $ncpus slaves; on Woody it’s o.k. to
# oversubscribe the nodes in this way (i.e. ncpus+1 processes on ncpus
# however, on Nehalem nodes (e.g. TinyBlue) it seems to be a very had idea
# to avoid oversubscription, uncomment the following line
## ncpus=$(($ncpus-1))

# check if enough licenses should be available
/apps/rrze/bin/check_lic.sh -c $CDLMD_LICENSE_FILE ccmpsuite 1 hpcdomains $(($ncpus-1))
. /apps/rrze/bin/check_autorequeue.sh

export MPIRUN_OPTIONS=”-v -prot”
# or with pinning: e.g.
## export MPIRUN_OPTIONS=”-v -prot -cpu_bind=v,rank,v”
## export MPIRUN_OPTIONS=”-v -prot -cpu_bind=v,MAP_CPU:0,1,2,3,4,5,6,7,v”

# if there are messages about “mpid: Not enough shared memory” you may try to set
# the maximum shared memory size in bytes by hand – but usually the message means
# that there is really not enough memory available; so forget about this option!
## export MPI_GLOBMEMSIZE=…

export MPI_REMSH=ssh

# automatically detect how much time this batch job requested and adjust the
# sleep attempt;
# make sure you defined the “stop file” within STAR-CCM+ accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE \
‘{if ( $0 ~ /Resource_List.walltime/ ) \
{ split($3,duration,”:”); \
print duration[1]*3600+duration[2]*60+duration[3]-t }}’ `; \
touch ABORT ) >& /dev/null &
export SLEEP_ID=$!

starccm+ -batch -np $ncpus -machinefile $PBS_NODEFILE -load myjob.sim

pkill -P $SLEEP_ID
[/shell]

stopping STAR-CD at latest just before the wallclock time is exceeded

A similar approach to the one described for CFX in is also possible for STAR-CD as shown in the following snippet (Thanks to one of our users for the feedback!):
[shell]
#!/bin/bash -l
#PBS -l nodes=2:ppn=4
#PBS -l walltime=24:00:00
#PBS -N somename

# Change to the directory where qsub was made
cd $PBS_O_WORKDIR

### add the module of the STAR-CD version, e.g. 4.02
module add star-cd/4.02_64bit

# specify the time needed to write the result and info files, e.g. 900 seconds
export TIME4SAVE=900

#automatically detect how much time this job requested and
#adjust the sleep accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE \
‘{if ( $0 ~ /Resource_List.walltime/ ) \
{ split($3,duration,”:”); \
print duration[1]*3600+duration[2]*60+duration[3]-t }}’ `; \
star -abort ) >& /dev/null &
export SLEEP_ID=$!

# the normal STAR-CD start follows …
star -dp `cat $PBS_NODEFILE`

pkill -P $SLEEP_ID
[/shell]

Automatically requeuing of jobs if not enough licenses are available

A common problem with queuing systems and commercial software using floating licenses is that you cannot easily guarantee that the licenses you need are available when your job starts. Some queuing systems and schedulers can consider license usage – the solution at RRZE does not (at least not reliably).

A partial solution (although by far not optimal) is outlined below. With effectively two additional lines in your job script you can at least ensure that your job gets requeued if not enough licenses are available – and does not just abort. (The risk for race conditions which are not detected of course still exists, and you may have to wait again some time until compute resources are available for your new jobs … but better than only seeing the error message after the weekend …

[shell]
#!/bin/bash -l
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -N myjob

# it is important that “bash” is executed on the first line above!
#
# check for 16 hpcdomains and 1 starpar license and automatically
# requeue the job if not enough licenses are available right now.
# This check is based on the situation right now – it may
# change just in the next second, thus, there is no guarantee
# that the license is still available in just a few moments.
# We do not checkout, borrow or reserve anything here!
# CHANGE license server and feature list according to your needs!
# instead of $CDLMD_LICENSE_FILE you can use the PORT@SERVER syntax
/apps/rrze/bin/check_lic.sh -c $CDLMD_LICENSE_FILE hpcdomains 16 starpar 1

# the next line must follow immediately after the check_lic.sh line
# with no commands in between!
# (the “.” at the beginning is also correct and important)
. /apps/rrze/bin/check_autorequeue.sh

# now continue with your normal tasks …
# if there were not enough licenses in the preliminary check,
# the script will not come until here but it got requeued.
[/shell]

This approach is not at all limited to STAR-CD and should work on Cluster32 and Woody.

 

ATTENTION: this approach does NOT work if license throttling is active, i.e. in cases where licenses are in principle available but the license server limits the number of licenses you or your group may get by using some MAX setting in the option file on the license server!

Most licenses at RRZE are throttled, thus, the check_lic.sh and check_autorqueue.sh scripts are of limited use only these days.

Compiling user subroutins for STAR-CD at RRZE

STAR-CD can be extended with user subroutines. To compile the user code, a compatible Fortran compiler is required. Unfortunately, CD adapco’s standard compiler for Linux still is Absoft for which no license is available. However, for most versions of STAR-CD for Linux x86_64 at least, also a PGI version is available. Thanks to the financial engagement of one of the main users’ group, a PGI license can now be used on the frontends of the Woody cluster (and also on sfront03). If you have user subroutines, please generally use one of the 64-bit PGI-STAR-CD versions (usually but not always with “pgi” in the module name), to compile the user code login to woody.rrze (or sfront03.rrze if you run your simulations e.g. on the opteron queue of Cluster32 (“Transtec cluster”), load the appropriate STAR-CD module and “pgi/6.2-2” and compile your user routines using star [-dp] -ufile. Now, you can submit your code as usual. Automatic compilation of the user subroutines from within a job file may or may not work. Thus, please compile them in advance before hand.

An important note: STAR-CD as of 3.2x does not work together with the latest PGI versions (7.x). Thus, you have to explicitly select version 6.2-2 of the PGI compiler; nobody tested yet which PGI versions are compatible with STAR-CD 4.0x … please drop a note if you are the first volunteer.

Attention: Interactive access to thor.rrze for compiling user code is no longer possible (and also no longer necessary). Use the woody login nodes or sfront03 instead.

The description in my previous article thus in principle is still valid.

STAR-CD 4.06 has just been installed. If the corresponding module is loaded, the appropriate PGI compiler module will be loaded automatically.

Existing star.reg files may cause problems: With STAR-CD 4.06 we just discovered a very strange behavior: if there is a star.reg file present, star -ufile tries to ssh to the first node listed in PNP_HOSTS (if present in star.reg and do the compilation there – which of course usually fails as users are not allowed to login to batch nodes without having a job running right now on them. The definition of a COMPILERHOST also is no solution as (1) star.reg is still evaluated and (2) STAR-CD tries to make a ssh connection to COMPILERHOST which works but then fails as there is no PGI module loaded.
To sum up: if you have to compile user subroutines, do this on the login nodes by calling star [-dp] -ufile but make sure that there is no star.reg file in the current directory. As star.reg does not contain critical information, it should be safe to just delete it if is is in the way.

Common license pool for STAR-CD probably continues for next three years

It took quite long until a solution for prolonging the joint license pool with an increased number of licenses for parallel runs could be found. But everything seems to be solved now for the next three years.

Further chairs can join at any time – of course, the license may only be used for education and scientific research (and not for industrial research or projects). If additional groups join, this will not increase the total costs (unsless additional license features will be required) but reduce the amount the individual groups have to pay annually …

Also check some notes if you use STAR-CD on RRZE’s new parallel computer Woody.

Running STAR-CD over Infiniband

STAR-CD 4.02 works out of the box; there are currently some warnings ERROR: ld.so: object 'libmpi.so' from LD_PRELOAD cannot be preloaded: ignored. As these messages seem to be uncritical, I’m not sure if I’ll further debug their cause.

STAR-CD 3.26 also works out of the box.

User subroutines are not yet tested; some additional steps will be required to get them compiled as the required PGI compiler is not installed locally…

In first tests, star -chkpnt failed with the message TAR checkpoint failed due to invalid "star.pst" file. or TAR checkpoint failed due to invalid "star.ccm" file.

=====================================================================

Using STAR-CD in principle works as follows (access to the STAR-CD module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start prostar, etc. on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    [shell]
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the “-l” is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N STARCD-woody
    #… any other PBS option you like

    # let’s go to the directory where the script was submitted
    cd $PBS_O_WORKDIR

    # load the STAR-CD module; either “star-cd/3.26_64bit” or “star-cd/4.02_64bit”
    module add star-cd/3.26_64bit

    # here we go
    star -dp `cat $PBS_NODEFILE`
    [/shell]

  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast graphics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

Some more details on “ERROR: ld.so: object ‘libmpi.so’ from LD_PRELOAD cannot be preloaded: ignored” message: Looking into the script which is actually used to call the STAR-CD binary, I can guess where the message might come from. They use something like $HPMPI/bin/mpirun ... -e LD_PRELOAD=libmpi$PNP_DSO ... -f .starboot.mpi, i.e. they do not specify a path for the library to be preloaded. I’m not sure what the current policy of ld.so from glibc is (does it look at the current LD_LIBRARY_PATH – if set) or does it only look at “secure” (predefined) directories. If the latter is the case, star of course cannot find the library …

Problems with STAR-CD-4.02 (64-bit) and Infibiband: Running the PGI variant of STAR-CD-4.02 (64-bit) over Infiniband (i.e. using HP-MPI) currently may fail on our new Woody-cluster as well as on the Infiniband partition of the Transtec-cluster. The observations (currently based on just a signle testcase) are as follows:

  • STAR-CD-4.02/PGI using HP-MPI/VPAI runs fine for a few iterations but then suddenly stops to consume CPU time on most of the nodes.
  • STAR-CD-4.02/PGI using HP-MPI/TCP or mpich runs fine.
  • STAR-CD-4.02/Absoft runs fine even with HP-MPI/VAPI!

Further tests are currently on their way …

For the moment, module add star-cd/4.02_64bit; star -mpi=hp -mppflags="-v -prot -TCP" is the recommended way of starting STAR-CD-4.02.

Sometimes also problems of STAR-CD 3.26 with Infiniband: According to a user report, also STAR-CD 3.26 with HP-MPI over Infiniband has the problem that it suddenly stops to run. It seems the the AMG-preconditioner is the reason for the problems.

So, check if Infiniband runs fine for your cases, if not (and only if not) add -mpi=hp -mppflags="-v -prot -TCP"

Upgraded IB stack seems to solve Infiniband problems: Updating from the Voltaire ibhost-stack to the Voltaire GridStack 4.1 (which is OFED-1.1 based) seems to have solved the issue with hanging STAR-CD processes. Please try to run without the argument -TCP!

As a technical note: the HP-MPI version which comes with STAR-CD 4.02 or 3.26 is too old to work with OFED; thus, the latest HP-MPI (i.e. 2.02.05) has been installed on Woody. The module files for STAR-CD have been adapted to automatically use this updated version. Your output (if -v -prot is used) should now show IBV instead of VAPI if the high speed network is used.

Using IPoverIB as fallback: If native Infiniband does not work even with the upgraded IB stack (e.g. due to a bug in connection with AMG), try IP-over-IB by using -mppflags="-v -prot -TCP -netaddr 10.188.84.0/255.255.254.0".

Another option available in HP-MPI (which is completely unrelated to the communication network but I’m to lazy to create an other thread right now) is the ability to pin processes to specific CPUs of a node. On woody, -mppflags="... -cpu_bind=v,map_cpu:0,1,2,3 ..." is the right choice if you run with 4 CPUs per node; if you have huge memory requirements and thus only use every second core, the correct option would be -mppflags="... -cpu_bind=v,map_cpu:0,1 ...".

STAR-CD 4.0 available

Since resently, STAR-CD 4.0 is available for x86 and x86_64. Unfortunately, the currently builds again require the ABSOFT compiler to include user subroutines 🙂 Further builds and support or IA64 and MS Windows hopefully apprear in the following months …

STAR-CD 4.0 has already installed on Cluster32. All STAR-CD users are encouraged to test this new version. Iy you are using user subroutines, some changes are probably required as CD-adapco moved from Fortran77 to Fortran90 and eliminated one common data structure. Check the documentation for more details.

For those, running STAR-CD on their local systems, the installation files are available at the usual place.