Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

Running STAR-CD over Infiniband

STAR-CD 4.02 works out of the box; there are currently some warnings ERROR: ld.so: object 'libmpi.so' from LD_PRELOAD cannot be preloaded: ignored. As these messages seem to be uncritical, I’m not sure if I’ll further debug their cause.

STAR-CD 3.26 also works out of the box.

User subroutines are not yet tested; some additional steps will be required to get them compiled as the required PGI compiler is not installed locally…

In first tests, star -chkpnt failed with the message TAR checkpoint failed due to invalid "star.pst" file. or TAR checkpoint failed due to invalid "star.ccm" file.

=====================================================================

Using STAR-CD in principle works as follows (access to the STAR-CD module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start prostar, etc. on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    [shell]
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the “-l” is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N STARCD-woody
    #… any other PBS option you like

    # let’s go to the directory where the script was submitted
    cd $PBS_O_WORKDIR

    # load the STAR-CD module; either “star-cd/3.26_64bit” or “star-cd/4.02_64bit”
    module add star-cd/3.26_64bit

    # here we go
    star -dp `cat $PBS_NODEFILE`
    [/shell]

  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast graphics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

Some more details on “ERROR: ld.so: object ‘libmpi.so’ from LD_PRELOAD cannot be preloaded: ignored” message: Looking into the script which is actually used to call the STAR-CD binary, I can guess where the message might come from. They use something like $HPMPI/bin/mpirun ... -e LD_PRELOAD=libmpi$PNP_DSO ... -f .starboot.mpi, i.e. they do not specify a path for the library to be preloaded. I’m not sure what the current policy of ld.so from glibc is (does it look at the current LD_LIBRARY_PATH – if set) or does it only look at “secure” (predefined) directories. If the latter is the case, star of course cannot find the library …

Problems with STAR-CD-4.02 (64-bit) and Infibiband: Running the PGI variant of STAR-CD-4.02 (64-bit) over Infiniband (i.e. using HP-MPI) currently may fail on our new Woody-cluster as well as on the Infiniband partition of the Transtec-cluster. The observations (currently based on just a signle testcase) are as follows:

  • STAR-CD-4.02/PGI using HP-MPI/VPAI runs fine for a few iterations but then suddenly stops to consume CPU time on most of the nodes.
  • STAR-CD-4.02/PGI using HP-MPI/TCP or mpich runs fine.
  • STAR-CD-4.02/Absoft runs fine even with HP-MPI/VAPI!

Further tests are currently on their way …

For the moment, module add star-cd/4.02_64bit; star -mpi=hp -mppflags="-v -prot -TCP" is the recommended way of starting STAR-CD-4.02.

Sometimes also problems of STAR-CD 3.26 with Infiniband: According to a user report, also STAR-CD 3.26 with HP-MPI over Infiniband has the problem that it suddenly stops to run. It seems the the AMG-preconditioner is the reason for the problems.

So, check if Infiniband runs fine for your cases, if not (and only if not) add -mpi=hp -mppflags="-v -prot -TCP"

Upgraded IB stack seems to solve Infiniband problems: Updating from the Voltaire ibhost-stack to the Voltaire GridStack 4.1 (which is OFED-1.1 based) seems to have solved the issue with hanging STAR-CD processes. Please try to run without the argument -TCP!

As a technical note: the HP-MPI version which comes with STAR-CD 4.02 or 3.26 is too old to work with OFED; thus, the latest HP-MPI (i.e. 2.02.05) has been installed on Woody. The module files for STAR-CD have been adapted to automatically use this updated version. Your output (if -v -prot is used) should now show IBV instead of VAPI if the high speed network is used.

Using IPoverIB as fallback: If native Infiniband does not work even with the upgraded IB stack (e.g. due to a bug in connection with AMG), try IP-over-IB by using -mppflags="-v -prot -TCP -netaddr 10.188.84.0/255.255.254.0".

Another option available in HP-MPI (which is completely unrelated to the communication network but I’m to lazy to create an other thread right now) is the ability to pin processes to specific CPUs of a node. On woody, -mppflags="... -cpu_bind=v,map_cpu:0,1,2,3 ..." is the right choice if you run with 4 CPUs per node; if you have huge memory requirements and thus only use every second core, the correct option would be -mppflags="... -cpu_bind=v,map_cpu:0,1 ...".

Running CFX over Infiniband

Running CFX-10 over Infiniband and using user subroutines should work now – of course you must have a valid license!

In principle it works as follows (access to the CFX module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start cfx5pre/cfx5post directly on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    [shell]
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the “-l” is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N CFX-woody
    #… any other PBS option you like

    # let’s go to the directory where the script was submitted
    cd $PBS_O_WORKDIR

    # transfer the list of nodes in a format suitable for CFX
    nodes=`cat $PBS_NODEFILE`
    nodes=`echo $nodes | sed -e ‘s/ /,/g’`

    # load the (recommended) CFX module
    module add cfx

    # here we go
    cfx5solve -name woody-$PBS_JOBID -size-mms 2.0 -def xyz.def -double \
    -par-dist $nodes -start-method “HP MPI Distributed Parallel for x86 64”
    [/shell]

  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast grafics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

=====================================================================

Some special “tuning” of the PBS script is of course possible, e.g. to stop the simulation at latest just before the wallclock time is exceeded:

[shell]
# specify the time you want to have to save results, etc.
export TIME4SAVE=600
# directory where CFX will be running
export SIMDIR=woody-${PBS_JOBID}_001.dir
# automatically detect how much time this batch job requested and adjust the
# sleep accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE \
‘{if ( $0 ~ /Resource_List.walltime/ ) \
{ split($3,duration,”:”); \
print duration[1]*3600+duration[2]*60+duration[3]-t }}’ `; \
cfx5stop -dir $SIMDIR ) >& /dev/null &
export SLEEP_ID=$!
cfx5solve -name woody-$PBS_JOBID …
pkill -P $SLEEP_ID
[/shell]

This “add-on” is only for advanced users which hopefully understand what this code does …

UPDATE 2012-10-24: recent Ansys/CFX version have a command line argument to limit the wall clock time:
-max-elapsed-time <elapsed time>
Set the maximum elapsed time (wall clock time) that the ANSYS CFX Solver will run. Elapsed time must be in quotes and have correct units in square brackets (eg: -maxet “10 [min]” or -maxet “5 [hr]”).

=====================================================================

Some more technical notes which are not really relevant for pure users … it just documents the steps which were required to get everything up-and-running.

CFX-10 (and it’s service packs) comes with a bundled version of HP-MPI version. No extra license is required to use HP-MPI together with CFX. Just specify
-start-method "HP MPI Distributed Parallel for x86 64"
and HP-MPI automatically selects the fastest interconnect available.
However, the version distributed by Ansys (HP-MPI 2.1-2) does not work together with the Infiniband stack installed on the new Woody cluster or the “old” Infiniband part of the Transtec cluster. Therefore, the current version of HP-MPI (2.2.0.2) must be downloaded from hp.com and installed manually.
CFX-10.0/etc/start-methods.ccl has to be adapted to reflect the new path (unless one does some dirty directory renames/links).

By default, CFX also tries to connect via rsh to remote hosts. Setting the environment variable CFX5RSH to ssh solves this problem. It might also be necessary to set the environment variable MPI_REMSH to ssh so that also HP-MPI uses the correct piece of software. Using PBS-mpiecec probably does not work with HP-MPI, otherwiese some further hacking in CFX-10.0/etc/start-methods.ccl might also be possible.

To force the use of the Infiniband interconnect, the environment variable MPIRUN_OPTIONS may be set to -VAPI – note the capital letters.

To support user subroutines, a PGI Fortran compiler is required to complie the user code once (using cfx5mkext *.F AND cf5mkext -doube *.F). As we do not have the PGI compiler installed on the Woody cluster, compilation has to be done elsewhere. The required PGI runtime libraries were copied (in accordance with the PGI license agreement) from a PGI 6.1.4 installation to the directory CFX-10.0/lib/linux-amd64 where they are found from within CFX without setting any special LD_LIBRARY_PATH.

Using DDT parallel debugger on Woody-Cluster

DDT is a parallel debugger developped and sold by Allinea. Together with the Woody-Cluster, RRZE baught a 32-processes floating license.

Here are some first tips on how to get DDT running in our environment

  • compile your program with debugging information enabled (e.g. -g -fp)
  • login to the cluster frontend (woody.rrze) with ssh and X11-forwarding enabled (e.g. ssh -X or ssh -Y)
  • submit an interactive batch job with X11-forwarding enabled using qsub -I -X -lnodes=#:ppn=4,walltime=##:##:## (# of course has to be replaced by some suitable numbers)
  • wait for the job to start, i.e. until you get a promt back; you are now on one of the compute nodes
  • start DDT using /apps/ddt/ddt1.10-Suse-10.0-x86_64/bin/ddt; do NOT use an "&" here! We might provide a module for DDT in the future removing the need to give a full path …
  • when you start DDT for the first time, you have to “create a new configuration file”:
    • select “intel mpi” as your MPI implementation (no tests with mvapich/mvapich2 yet)
    • select “do not configure DDT for attaing at this time”
  • in the “session control” select “advanced”
    • enter (select) your executable in the top box (“application”)
    • give arguments in the next line if necessary for your code (not tested yet) – no idea yet about STDIN redirect
    • select the number of processes at the bottom
    • press “change” on the lower right hand side (normally only required the first time you use DDT)
      • tick “submit job through batch or configure own mpirun command”
      • use mpiexec -comm pmi -n NUM_PROCS_TAG /apps/ddt/ddt1.10-Suse-10.0-x86_64/bin/ddt-debugger PROGRAM_ARGUMENTS_TAG as “submit command”. NUM_PROCS_TAG and PROGRAM_ARGUMENTS_TAG get automatically substituted when the command is run.
    • to start your application press “submit”
  • here you are! Set break points, advance in different process groups, etc.

Full documentation can be found in /apps/ddt/ddt1.10-Suse-10.0-x86_64/doc/. Check in particular the userguide.pdf and the quickstart-?.pdf files.

Updates on using DDT: The DDT version numbers mentioned in my original post are of course no longer valid (we arrived at 2.3 in the meantime) and we provide “modules”. Moreover, the recommended way of configuring the custom run command is as follows: specify mpirun -n NUM_PROCS_TAG -ddt PROGRAM_ARGUMENTS_TAG as “submit command” as described above. -ddt will automatically be replaced by the path of the currently loaded ddt-debugger.

Erfahrungen mit Intel 10.0-beta Compilern

Intel hat vor kurzem den Beta-Test für die kommenden Intel 10.0 Compiler begonnen. Laut Release-Notes hat sich einiges geändert …

Hier die ersten eigenen Erfahrungen mit der Linux-Version (x86_64 sofern nichts anderes vermerkt):

  • -tpp wird jetzt nur noch auf IA64 unterstützt; insbesondere -tpp7 auf x86/x86_64 wird nciht mehr erkannt. Da “Pentium4” sowieso der Default ist, kann diese Option einfach weggelassen werden. Druch -mtune= können jetzt Tuning-Hinweise gegeben werden. Die -x bzw. -ax gibt es dagegen unverändert (wobei für neue Prozessoren weitere Buchstaben hinzugekommen sind).
    -xK == -march=pentium3 und -xW == -march=pentium4

LINPACK optimiert & Beginn des Stabilitätstests

In den letzten Wochen wurde hart gearbeitet und einiges an Hard- und Software optimiert, um Stabilität und Leistung gewährleisten zu können …

Die Mühen haben sich aber gelohnt. Die Performance des LINPACK-Benchmarks bei Verwendung von nun 185 Rechenknoten konnte auf 5,585 TFlop/s gesteigert werden. Dies entspräche Platz 119 auf der TOP500-Liste vom vergangenen November, wo wir mit dem Wert vom Oktober 2006 “nur” auf Platz 124 kamen.

Die Stabilitätstestphase und damit die letzte Phase der Abnahme hat inzwischen offiziell begonnen und Testbenutzer unterschiedlicher Fachrichtungen dürfen schrittweise auf das neue System. Wenn man sich jedoch bereits jetzt die Queues anschaut, so muss man sich keine Sorgen um die Auslastung des neuen System machen …

Vor größeren Erweiterungen der HPC-Systeme am RRZE steht dann jedoch erst einmal eine “Optimierung” der Kühlung/Klimatisierung des Rechnerraums an …