Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

Intel compiler and -mcmodel=...

As it seems to be come a FAQ – although it is documented as a small note in the ifort man page: if the 64-bit Intel compilers (for EM64T/Opteron but not for IA64) are used and statically allocated data (e.g. in Fortran common blocks) exceeds 2GB, the -mcmodel=medium or -mcmodel=large switches must be used. As a consequence, the -i-dynamic or -shared-intel flag depending on the compiler version must also be specified for the linking step otherwise strange error messages about relocation truncated occur for libifcore.a routines during linking.

Automatically requeuing of jobs if not enough licenses are available

A common problem with queuing systems and commercial software using floating licenses is that you cannot easily guarantee that the licenses you need are available when your job starts. Some queuing systems and schedulers can consider license usage – the solution at RRZE does not (at least not reliably).

A partial solution (although by far not optimal) is outlined below. With effectively two additional lines in your job script you can at least ensure that your job gets requeued if not enough licenses are available – and does not just abort. (The risk for race conditions which are not detected of course still exists, and you may have to wait again some time until compute resources are available for your new jobs … but better than only seeing the error message after the weekend …

[shell]
#!/bin/bash -l
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -N myjob

# it is important that “bash” is executed on the first line above!
#
# check for 16 hpcdomains and 1 starpar license and automatically
# requeue the job if not enough licenses are available right now.
# This check is based on the situation right now – it may
# change just in the next second, thus, there is no guarantee
# that the license is still available in just a few moments.
# We do not checkout, borrow or reserve anything here!
# CHANGE license server and feature list according to your needs!
# instead of $CDLMD_LICENSE_FILE you can use the PORT@SERVER syntax
/apps/rrze/bin/check_lic.sh -c $CDLMD_LICENSE_FILE hpcdomains 16 starpar 1

# the next line must follow immediately after the check_lic.sh line
# with no commands in between!
# (the “.” at the beginning is also correct and important)
. /apps/rrze/bin/check_autorequeue.sh

# now continue with your normal tasks …
# if there were not enough licenses in the preliminary check,
# the script will not come until here but it got requeued.
[/shell]

This approach is not at all limited to STAR-CD and should work on Cluster32 and Woody.

 

ATTENTION: this approach does NOT work if license throttling is active, i.e. in cases where licenses are in principle available but the license server limits the number of licenses you or your group may get by using some MAX setting in the option file on the license server!

Most licenses at RRZE are throttled, thus, the check_lic.sh and check_autorqueue.sh scripts are of limited use only these days.

CFX-11.0SP1 and Windows CCS

According to Ansys, CFX-11 is not supported on Windows CCS. According to Microsoft, running CFX-11 on Windows CCS is no problem at all …

When we first tried some months ago on our small Windows CCS test cluster, we suffered from very strange error messages complaining about wrong paths, etc. even if only the GUIs were started on the terminal server (login node). Thus, we gave up again quite soon as diving into all details did not seem to be worth the effort.

I now tried again with a fresh installation of ANSYS/CFX-11 on our Windows CCS headnode – and what a surprise, no error messages any more.

At least for the CFX solver itself, it seems not to be a problem if UNC paths are listed in ...\ANSYS Inc\v110\CFX\conf\hosts.ccl for the compute nodes – although the CFX documentation mentions that UNC paths do not work for the CFX solver.

The default settings in ...\ANSYS Inc\v110\CFX\cfxccs.pl are a bit strange – we do not have a \\headnode\cfxworkdir\ directory where everyone is allowed to write and it also seems to be a bad idea to always use just stdout and stderr for the output files …, thus, I locally made the following changes in cfxccs.pl on the shared directory of the CCS headnode:

  my $workingdirectory='//'.$hostname.'/ccsshare/'.$ENV{USERNAME};
  my $stdoutfile=$workingdirectory.'/stdout.%CCP_JOBID%.txt';
  my $stderrfile=$workingdirectory.'/stderr.%CCP_JOBID%.txt';

Now all output first of all goes into the user’s own shared directory, and second, the stdout/stderr files get the job-ID appended (and the suffix txt so that they are automatically opened with a text editor).

Initial tests look fine – let’s see what the users say.

fork and OFED Infiniband stack

Attention: OFED disallows system(const char*) or fork/exec after initializing the Infiniband libraries. Some documentation mentions about this:
… the Mellanox InfiniBand driver has ssues with buffers sharing pages when fork() is used. Pinned (locked in memory) pages are normally marked copy-on-write during a fork. If a page is pinned before a fork and subsequently written to while RDMA operations are being performed on the same page, silent data corruption can occur as RDMA operations continue to stream data to a page that has moved. To avoid this, the Mellanox driver does not use copy-on-write behavior during a fork for pinned pages. Instead, access to these pages by the child process will result in a segmentation violation.
Fork support from kernel 2.6.12 and above is available provided that applications do not use threads. The fork() is supported as long as parent process does not run before child exits or calls exec(). The former can be achieved by calling wait(childpid) the later can be achieved by application specific means. Posix system() call is supported.

Woody is running a SuSE SLES9 kernel, i.e. 2.6.5. Thus, no support for fork and similar things!

Some users already hit this problem! Even a Fortran user who had call system('some command') in his code! In the latter case, the application just hang in some (matching) MPI_send/MPI_recv calls.

building mvapich/mvapich2 for woody

Just as an internal information for ourselves: mvapich and mvapich2 heavily depend on the underlaying Infiniband stack. Therefore, both MPI libraries had to be recompiled while changing from Voltaire IBHOST-3.4.5 stack to GridStack 4.1.5 (aka OFED-1.1).

As mvapich/mvapich2 are statically linked into executables, probably those applications have to be recompiled, too.
On the other side, Intel-MPI (which is the default on woody) is not affected, thus, most of the users’ applications will not require recompilation.

The RRZE versions of mvapich-0.9.8 and mvapich2-0.9.8 were now compiled using:

module load intel64-c/9.1.049, intel64-f/9.1.045
export CC=icc
export CXX=icpc
export F77=ifort
export F90=ifort
export ARCH=_EM64T_
export OPEN_IB_HOME=/usr/local/ofed
export PREFIX=/apps/mvapich2/0.9.8-gridstack-4.1.5
./make.mvapich[2].gen2

some notes on using Intel Trace Collector (ITC/ITA)

some notes on using Intel Trace Collector (ITC/ITA)

Preliminary note: these are just some “random” notes for myself …

  • Intel 10.0 (beta) compilers introduced -tcollect switch
    • inserts instrumentation probes calling the ITC API, i.e. shows functions names instead of just “user code” in the trace
    • has to be used for compilation and linking
    • not sure yet if ITV libraries have to be specified during linking – but probably one wants to add “$ITC_LIB” (defined by RRZE) anyway
    • works fine (at least in initial tests)
  • source code locations are not inclides in the trace file by default; compile with -g and set VT_PCTRACE to some value (either ON or the call level to be recorded (with optional skip levels); see ITC-RefGuide Sec. 3.8, p. 19/20 and 95
  • OS counters can be recorded; VT_COUNTER variable has to be set accordingly; see ITC-RefGuide Sec. 3.10, p. 22
  • VT_LOGFILE_NAME can be used to specify the name of the tracefile
  • to change colors, one has to the “all functions” item as the color toggle is deactivated in the default “generated groups/major functions” item
  • unfortunately, I did not yet find a “redraw” button – and chaning the colors only redisplays the current chart automatically

mpiexec + taskset: code snippet to generate config file

mpiexec + taskset: code snippet to generate config file

We usually use Pete Wyckoff’s mpiexec to start MPI processes from within batch jobs. The following code snippet may be used to pin individual MPI processes to their CPU. A similar snippet could be used to allow round-robin distribution of the MPI processes …

[shell]
#!/bin/bash
# usage: run.sh 8 “0 2 4 6” ./a.out “-b -c”
# start a.out with arguments -b -c
# use 4 MPI processes
# pin processes to CPUs 0,2,4,6

NUMPROC=$1
TASKCPUS=”$2″
EXE=$3
ARGS=$4

TASKS=`echo $TASKCPUS | wc -w`
echo “running $NUMPROC MPI processes with $TASKS per node”

cat /dev/null > cfg.$PBS_JOBID
for node in `uniq $PBS_NODEFILE`; do
for cpu in $TASKCPUS; do
echo “$node : taskset -c $cpu $EXE $ARGS” >> cfg.$PBS_JOBID
NUMPROC=$((NUMPROC – 1))
if [ $NUMPROC -eq 0 ]; then
break 2
fi
done
done
mpiexec -comm pmi -kill -config cfg.$PBS_JOBID
[/shell]

Vector-TRIAD on woody using different versions of the Intel EM64T Fortran compiler

Switching from one compiler version to an other can have significant influence on performance, but even moving one patch level ahead may change your performance …

The Vector-TRIAD benchmark (a(:)=b(:)+c(:)*d(:) according to Schoenauer) was run on the new Woodcrest cluster at RRZE which consists of HP DL140G3 boxes. The performance is given in MFlop/s for a loop length of 8388608. The value is the aggregated bandwidth of 4 MPI processes running on the node in saturation mode.

SNOOP filter of the 5000X chipset enabled

Performance in MFlop/s for loop length 8388608
on 2-socket Woodcrest node with 4 MPI processes
compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 374.0   | 352.5
9.1-039 | 374.1   | 352.3
9.1-045 | 359.0 ! | 352.4
10.0-13 | 373.4   | 377.6 !
10.0-17 | 373.7   | 352.0

SNOOP filter of the 5000X chipset disabled (switching the snoop filter off only works with the latest BIOS (v1.12) released on April 16!)

compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 376     | 332
9.1-039 | 376     | 331
9.1-045 | 341  !! | 331
10.0-13 | 376     | 380 !!
10.0-17 | 376     | 331

The “default” version always refers to arrays which were known at compile time; “USE_COMMON” meaans that the arrays have additionally been put into a common block.

 

And for reference the STREAM values in MB/s for 4 OpenMP threads (and added NONTEMPORAL directives and Array size = 20000000, Offset = 0) are also given:

      Snoopfilter on | Snoopfilter off
Function Rate (MB/s) | Rate (MB/s)
Copy:      7492.0444 | 6178.7991
Scale:     7485.3591 | 6174.8763
Add:       6145.5004 | 6180.6296
Triad:     6152.6559 | 6189.2369

The results were more or less identical when using fce-9.1.039 and fce-9.1.45!

Reasons for performance differences: The main reason for the performance differences is unrolling (add -unroll0 to avoid it – thanks to Intel for pointing this out). The 10.0 compilers seem to be much more aggressive when doing optimizations and vectorization. By default, non-temporal stores may now be used in certain cases automatically by the 10.0 compiler. If you want to avoid that use -opt-streaming-stores never. Even if non-temporal stores were disabled via command line, the compiler directive vector nontemporal will still be respected. A directive to avoid non-temporal stores for a specific loop only is not (yet) available.

Running STAR-CD over Infiniband

STAR-CD 4.02 works out of the box; there are currently some warnings ERROR: ld.so: object 'libmpi.so' from LD_PRELOAD cannot be preloaded: ignored. As these messages seem to be uncritical, I’m not sure if I’ll further debug their cause.

STAR-CD 3.26 also works out of the box.

User subroutines are not yet tested; some additional steps will be required to get them compiled as the required PGI compiler is not installed locally…

In first tests, star -chkpnt failed with the message TAR checkpoint failed due to invalid "star.pst" file. or TAR checkpoint failed due to invalid "star.ccm" file.

=====================================================================

Using STAR-CD in principle works as follows (access to the STAR-CD module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start prostar, etc. on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    [shell]
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the “-l” is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N STARCD-woody
    #… any other PBS option you like

    # let’s go to the directory where the script was submitted
    cd $PBS_O_WORKDIR

    # load the STAR-CD module; either “star-cd/3.26_64bit” or “star-cd/4.02_64bit”
    module add star-cd/3.26_64bit

    # here we go
    star -dp `cat $PBS_NODEFILE`
    [/shell]

  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast graphics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

Some more details on “ERROR: ld.so: object ‘libmpi.so’ from LD_PRELOAD cannot be preloaded: ignored” message: Looking into the script which is actually used to call the STAR-CD binary, I can guess where the message might come from. They use something like $HPMPI/bin/mpirun ... -e LD_PRELOAD=libmpi$PNP_DSO ... -f .starboot.mpi, i.e. they do not specify a path for the library to be preloaded. I’m not sure what the current policy of ld.so from glibc is (does it look at the current LD_LIBRARY_PATH – if set) or does it only look at “secure” (predefined) directories. If the latter is the case, star of course cannot find the library …

Problems with STAR-CD-4.02 (64-bit) and Infibiband: Running the PGI variant of STAR-CD-4.02 (64-bit) over Infiniband (i.e. using HP-MPI) currently may fail on our new Woody-cluster as well as on the Infiniband partition of the Transtec-cluster. The observations (currently based on just a signle testcase) are as follows:

  • STAR-CD-4.02/PGI using HP-MPI/VPAI runs fine for a few iterations but then suddenly stops to consume CPU time on most of the nodes.
  • STAR-CD-4.02/PGI using HP-MPI/TCP or mpich runs fine.
  • STAR-CD-4.02/Absoft runs fine even with HP-MPI/VAPI!

Further tests are currently on their way …

For the moment, module add star-cd/4.02_64bit; star -mpi=hp -mppflags="-v -prot -TCP" is the recommended way of starting STAR-CD-4.02.

Sometimes also problems of STAR-CD 3.26 with Infiniband: According to a user report, also STAR-CD 3.26 with HP-MPI over Infiniband has the problem that it suddenly stops to run. It seems the the AMG-preconditioner is the reason for the problems.

So, check if Infiniband runs fine for your cases, if not (and only if not) add -mpi=hp -mppflags="-v -prot -TCP"

Upgraded IB stack seems to solve Infiniband problems: Updating from the Voltaire ibhost-stack to the Voltaire GridStack 4.1 (which is OFED-1.1 based) seems to have solved the issue with hanging STAR-CD processes. Please try to run without the argument -TCP!

As a technical note: the HP-MPI version which comes with STAR-CD 4.02 or 3.26 is too old to work with OFED; thus, the latest HP-MPI (i.e. 2.02.05) has been installed on Woody. The module files for STAR-CD have been adapted to automatically use this updated version. Your output (if -v -prot is used) should now show IBV instead of VAPI if the high speed network is used.

Using IPoverIB as fallback: If native Infiniband does not work even with the upgraded IB stack (e.g. due to a bug in connection with AMG), try IP-over-IB by using -mppflags="-v -prot -TCP -netaddr 10.188.84.0/255.255.254.0".

Another option available in HP-MPI (which is completely unrelated to the communication network but I’m to lazy to create an other thread right now) is the ability to pin processes to specific CPUs of a node. On woody, -mppflags="... -cpu_bind=v,map_cpu:0,1,2,3 ..." is the right choice if you run with 4 CPUs per node; if you have huge memory requirements and thus only use every second core, the correct option would be -mppflags="... -cpu_bind=v,map_cpu:0,1 ...".

Running CFX over Infiniband

Running CFX-10 over Infiniband and using user subroutines should work now – of course you must have a valid license!

In principle it works as follows (access to the CFX module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start cfx5pre/cfx5post directly on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    [shell]
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the “-l” is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N CFX-woody
    #… any other PBS option you like

    # let’s go to the directory where the script was submitted
    cd $PBS_O_WORKDIR

    # transfer the list of nodes in a format suitable for CFX
    nodes=`cat $PBS_NODEFILE`
    nodes=`echo $nodes | sed -e ‘s/ /,/g’`

    # load the (recommended) CFX module
    module add cfx

    # here we go
    cfx5solve -name woody-$PBS_JOBID -size-mms 2.0 -def xyz.def -double \
    -par-dist $nodes -start-method “HP MPI Distributed Parallel for x86 64”
    [/shell]

  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast grafics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

=====================================================================

Some special “tuning” of the PBS script is of course possible, e.g. to stop the simulation at latest just before the wallclock time is exceeded:

[shell]
# specify the time you want to have to save results, etc.
export TIME4SAVE=600
# directory where CFX will be running
export SIMDIR=woody-${PBS_JOBID}_001.dir
# automatically detect how much time this batch job requested and adjust the
# sleep accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE \
‘{if ( $0 ~ /Resource_List.walltime/ ) \
{ split($3,duration,”:”); \
print duration[1]*3600+duration[2]*60+duration[3]-t }}’ `; \
cfx5stop -dir $SIMDIR ) >& /dev/null &
export SLEEP_ID=$!
cfx5solve -name woody-$PBS_JOBID …
pkill -P $SLEEP_ID
[/shell]

This “add-on” is only for advanced users which hopefully understand what this code does …

UPDATE 2012-10-24: recent Ansys/CFX version have a command line argument to limit the wall clock time:
-max-elapsed-time <elapsed time>
Set the maximum elapsed time (wall clock time) that the ANSYS CFX Solver will run. Elapsed time must be in quotes and have correct units in square brackets (eg: -maxet “10 [min]” or -maxet “5 [hr]”).

=====================================================================

Some more technical notes which are not really relevant for pure users … it just documents the steps which were required to get everything up-and-running.

CFX-10 (and it’s service packs) comes with a bundled version of HP-MPI version. No extra license is required to use HP-MPI together with CFX. Just specify
-start-method "HP MPI Distributed Parallel for x86 64"
and HP-MPI automatically selects the fastest interconnect available.
However, the version distributed by Ansys (HP-MPI 2.1-2) does not work together with the Infiniband stack installed on the new Woody cluster or the “old” Infiniband part of the Transtec cluster. Therefore, the current version of HP-MPI (2.2.0.2) must be downloaded from hp.com and installed manually.
CFX-10.0/etc/start-methods.ccl has to be adapted to reflect the new path (unless one does some dirty directory renames/links).

By default, CFX also tries to connect via rsh to remote hosts. Setting the environment variable CFX5RSH to ssh solves this problem. It might also be necessary to set the environment variable MPI_REMSH to ssh so that also HP-MPI uses the correct piece of software. Using PBS-mpiecec probably does not work with HP-MPI, otherwiese some further hacking in CFX-10.0/etc/start-methods.ccl might also be possible.

To force the use of the Infiniband interconnect, the environment variable MPIRUN_OPTIONS may be set to -VAPI – note the capital letters.

To support user subroutines, a PGI Fortran compiler is required to complie the user code once (using cfx5mkext *.F AND cf5mkext -doube *.F). As we do not have the PGI compiler installed on the Woody cluster, compilation has to be done elsewhere. The required PGI runtime libraries were copied (in accordance with the PGI license agreement) from a PGI 6.1.4 installation to the directory CFX-10.0/lib/linux-amd64 where they are found from within CFX without setting any special LD_LIBRARY_PATH.