Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

some notes on using Intel Trace Collector (ITC/ITA)

some notes on using Intel Trace Collector (ITC/ITA)

Preliminary note: these are just some “random” notes for myself …

  • Intel 10.0 (beta) compilers introduced -tcollect switch
    • inserts instrumentation probes calling the ITC API, i.e. shows functions names instead of just “user code” in the trace
    • has to be used for compilation and linking
    • not sure yet if ITV libraries have to be specified during linking – but probably one wants to add “$ITC_LIB” (defined by RRZE) anyway
    • works fine (at least in initial tests)
  • source code locations are not inclides in the trace file by default; compile with -g and set VT_PCTRACE to some value (either ON or the call level to be recorded (with optional skip levels); see ITC-RefGuide Sec. 3.8, p. 19/20 and 95
  • OS counters can be recorded; VT_COUNTER variable has to be set accordingly; see ITC-RefGuide Sec. 3.10, p. 22
  • VT_LOGFILE_NAME can be used to specify the name of the tracefile
  • to change colors, one has to the “all functions” item as the color toggle is deactivated in the default “generated groups/major functions” item
  • unfortunately, I did not yet find a “redraw” button – and chaning the colors only redisplays the current chart automatically

mpiexec + taskset: code snippet to generate config file

mpiexec + taskset: code snippet to generate config file

We usually use Pete Wyckoff’s mpiexec to start MPI processes from within batch jobs. The following code snippet may be used to pin individual MPI processes to their CPU. A similar snippet could be used to allow round-robin distribution of the MPI processes …

#!/bin/bash
# usage: run.sh 8 "0 2 4 6" ./a.out "-b -c"
#        start a.out with arguments -b -c
#        use 4 MPI processes
#        pin processes to CPUs 0,2,4,6

NUMPROC=$1
TASKCPUS="$2"
EXE=$3
ARGS=$4

TASKS=`echo $TASKCPUS | wc -w`
echo "running $NUMPROC MPI processes with $TASKS per node"

cat /dev/null > cfg.$PBS_JOBID
for node in `uniq $PBS_NODEFILE`; do
   for cpu in $TASKCPUS; do
      echo "$node : taskset -c $cpu $EXE $ARGS" >> cfg.$PBS_JOBID
      NUMPROC=$((NUMPROC - 1))
      if [ $NUMPROC -eq 0 ]; then
        break 2
      fi
   done
done
mpiexec -comm pmi -kill -config cfg.$PBS_JOBID

Running STAR-CD over Infiniband

STAR-CD 4.02 works out of the box; there are currently some warnings ERROR: ld.so: object 'libmpi.so' from LD_PRELOAD cannot be preloaded: ignored. As these messages seem to be uncritical, I’m not sure if I’ll further debug their cause.

STAR-CD 3.26 also works out of the box.

User subroutines are not yet tested; some additional steps will be required to get them compiled as the required PGI compiler is not installed locally…

In first tests, star -chkpnt failed with the message TAR checkpoint failed due to invalid "star.pst" file. or TAR checkpoint failed due to invalid "star.ccm" file.

=====================================================================

Using STAR-CD in principle works as follows (access to the STAR-CD module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start prostar, etc. on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the "-l" is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N STARCD-woody
    #... any other PBS option you like
    
    # let's go to the directory where the script was submitted
    cd $PBS_O_WORKDIR
    
    # load the STAR-CD module; either "star-cd/3.26_64bit" or "star-cd/4.02_64bit"
    module add star-cd/3.26_64bit
    
    # here we go
    star -dp `cat $PBS_NODEFILE`
    
  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast graphics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

Some more details on “ERROR: ld.so: object ‘libmpi.so’ from LD_PRELOAD cannot be preloaded: ignored” message: Looking into the script which is actually used to call the STAR-CD binary, I can guess where the message might come from. They use something like $HPMPI/bin/mpirun ... -e LD_PRELOAD=libmpi$PNP_DSO ... -f .starboot.mpi, i.e. they do not specify a path for the library to be preloaded. I’m not sure what the current policy of ld.so from glibc is (does it look at the current LD_LIBRARY_PATH – if set) or does it only look at “secure” (predefined) directories. If the latter is the case, star of course cannot find the library …

Problems with STAR-CD-4.02 (64-bit) and Infibiband: Running the PGI variant of STAR-CD-4.02 (64-bit) over Infiniband (i.e. using HP-MPI) currently may fail on our new Woody-cluster as well as on the Infiniband partition of the Transtec-cluster. The observations (currently based on just a signle testcase) are as follows:

  • STAR-CD-4.02/PGI using HP-MPI/VPAI runs fine for a few iterations but then suddenly stops to consume CPU time on most of the nodes.
  • STAR-CD-4.02/PGI using HP-MPI/TCP or mpich runs fine.
  • STAR-CD-4.02/Absoft runs fine even with HP-MPI/VAPI!

Further tests are currently on their way …

For the moment, module add star-cd/4.02_64bit; star -mpi=hp -mppflags="-v -prot -TCP" is the recommended way of starting STAR-CD-4.02.

Sometimes also problems of STAR-CD 3.26 with Infiniband: According to a user report, also STAR-CD 3.26 with HP-MPI over Infiniband has the problem that it suddenly stops to run. It seems the the AMG-preconditioner is the reason for the problems.

So, check if Infiniband runs fine for your cases, if not (and only if not) add -mpi=hp -mppflags="-v -prot -TCP"

Upgraded IB stack seems to solve Infiniband problems: Updating from the Voltaire ibhost-stack to the Voltaire GridStack 4.1 (which is OFED-1.1 based) seems to have solved the issue with hanging STAR-CD processes. Please try to run without the argument -TCP!

As a technical note: the HP-MPI version which comes with STAR-CD 4.02 or 3.26 is too old to work with OFED; thus, the latest HP-MPI (i.e. 2.02.05) has been installed on Woody. The module files for STAR-CD have been adapted to automatically use this updated version. Your output (if -v -prot is used) should now show IBV instead of VAPI if the high speed network is used.

Using IPoverIB as fallback: If native Infiniband does not work even with the upgraded IB stack (e.g. due to a bug in connection with AMG), try IP-over-IB by using -mppflags="-v -prot -TCP -netaddr 10.188.84.0/255.255.254.0".

Another option available in HP-MPI (which is completely unrelated to the communication network but I’m to lazy to create an other thread right now) is the ability to pin processes to specific CPUs of a node. On woody, -mppflags="... -cpu_bind=v,map_cpu:0,1,2,3 ..." is the right choice if you run with 4 CPUs per node; if you have huge memory requirements and thus only use every second core, the correct option would be -mppflags="... -cpu_bind=v,map_cpu:0,1 ...".

Running CFX over Infiniband

Running CFX-10 over Infiniband and using user subroutines should work now – of course you must have a valid license!

In principle it works as follows (access to the CFX module is restricted by ACLs)

  • prepare your input files on your local machine; the RRZE systems are not supposed for interactive work.
    If you have to use the RRZE systems for some reason for pre/postprocessing, do not start cfx5pre/cfx5post directly on the login nodes but submit an interactive batch job using qsub -I -X -lnodes=1:ppn=4,walltime=1:00:00!
  • transfer all input files to the local filesystem on the Woody cluster using SSH (scp/sftp), i.e. copy them to /home/woody1/.../.../...
  • Use a batch file as follows:
    #!/bin/bash -l
    # DO NOT USE #!/bin/sh in the line above as module would not work; also the "-l" is required!
    #PBS -l nodes=2:ppn=4
    #PBS -l walltime=24:00:00
    #PBS -N CFX-woody
    #... any other PBS option you like
    
    # let's go to the directory where the script was submitted
    cd $PBS_O_WORKDIR
    
    # transfer the list of nodes in a format suitable for CFX
    nodes=`cat $PBS_NODEFILE`
    nodes=`echo $nodes | sed -e 's/ /,/g'`
    
    # load the (recommended) CFX module
    module add cfx
    
    # here we go
    cfx5solve -name woody-$PBS_JOBID -size-mms 2.0 -def xyz.def -double   \
                -par-dist $nodes -start-method "HP MPI Distributed Parallel for x86 64"
    
  • submit your job to the PBS batch system using qsub
  • wait until the job finished
  • transfer the required result files to your local PC, analyze the results locally (using your fast grafics card)
  • delete all files you no longer need from the RRZE system as disk space is still valuable

=====================================================================

Some special “tuning” of the PBS script is of course possible, e.g. to stop the simulation at latest just before the wallclock time is exceeded:

# specify the time you want to have to save results, etc.
export TIME4SAVE=600
# directory where CFX will be running
export SIMDIR=woody-${PBS_JOBID}_001.dir
# automatically detect how much time this batch job requested and adjust the
# sleep accordingly
( sleep ` qstat -f $PBS_JOBID | awk -v t=$TIME4SAVE                   \
    '{if ( $0 ~ /Resource_List.walltime/ )                            \
        { split($3,duration,":");                                     \
          print duration[1]*3600+duration[2]*60+duration[3]-t }}' `;  \
  cfx5stop -dir $SIMDIR ) >& /dev/null  &
export SLEEP_ID=$!
cfx5solve -name woody-$PBS_JOBID ...
pkill -P $SLEEP_ID

This “add-on” is only for advanced users which hopefully understand what this code does …

UPDATE 2012-10-24: recent Ansys/CFX version have a command line argument to limit the wall clock time:
-max-elapsed-time <elapsed time>
Set the maximum elapsed time (wall clock time) that the ANSYS CFX Solver will run. Elapsed time must be in quotes and have correct units in square brackets (eg: -maxet “10 [min]” or -maxet “5 [hr]”).

=====================================================================

Some more technical notes which are not really relevant for pure users … it just documents the steps which were required to get everything up-and-running.

CFX-10 (and it’s service packs) comes with a bundled version of HP-MPI version. No extra license is required to use HP-MPI together with CFX. Just specify
-start-method "HP MPI Distributed Parallel for x86 64"
and HP-MPI automatically selects the fastest interconnect available.
However, the version distributed by Ansys (HP-MPI 2.1-2) does not work together with the Infiniband stack installed on the new Woody cluster or the “old” Infiniband part of the Transtec cluster. Therefore, the current version of HP-MPI (2.2.0.2) must be downloaded from hp.com and installed manually.
CFX-10.0/etc/start-methods.ccl has to be adapted to reflect the new path (unless one does some dirty directory renames/links).

By default, CFX also tries to connect via rsh to remote hosts. Setting the environment variable CFX5RSH to ssh solves this problem. It might also be necessary to set the environment variable MPI_REMSH to ssh so that also HP-MPI uses the correct piece of software. Using PBS-mpiecec probably does not work with HP-MPI, otherwiese some further hacking in CFX-10.0/etc/start-methods.ccl might also be possible.

To force the use of the Infiniband interconnect, the environment variable MPIRUN_OPTIONS may be set to -VAPI – note the capital letters.

To support user subroutines, a PGI Fortran compiler is required to complie the user code once (using cfx5mkext *.F AND cf5mkext -doube *.F). As we do not have the PGI compiler installed on the Woody cluster, compilation has to be done elsewhere. The required PGI runtime libraries were copied (in accordance with the PGI license agreement) from a PGI 6.1.4 installation to the directory CFX-10.0/lib/linux-amd64 where they are found from within CFX without setting any special LD_LIBRARY_PATH.

Using DDT parallel debugger on Woody-Cluster

DDT is a parallel debugger developped and sold by Allinea. Together with the Woody-Cluster, RRZE baught a 32-processes floating license.

Here are some first tips on how to get DDT running in our environment

  • compile your program with debugging information enabled (e.g. -g -fp)
  • login to the cluster frontend (woody.rrze) with ssh and X11-forwarding enabled (e.g. ssh -X or ssh -Y)
  • submit an interactive batch job with X11-forwarding enabled using qsub -I -X -lnodes=#:ppn=4,walltime=##:##:## (# of course has to be replaced by some suitable numbers)
  • wait for the job to start, i.e. until you get a promt back; you are now on one of the compute nodes
  • start DDT using /apps/ddt/ddt1.10-Suse-10.0-x86_64/bin/ddt; do NOT use an "&" here! We might provide a module for DDT in the future removing the need to give a full path …
  • when you start DDT for the first time, you have to “create a new configuration file”:
    • select “intel mpi” as your MPI implementation (no tests with mvapich/mvapich2 yet)
    • select “do not configure DDT for attaing at this time”
  • in the “session control” select “advanced”
    • enter (select) your executable in the top box (“application”)
    • give arguments in the next line if necessary for your code (not tested yet) – no idea yet about STDIN redirect
    • select the number of processes at the bottom
    • press “change” on the lower right hand side (normally only required the first time you use DDT)
      • tick “submit job through batch or configure own mpirun command”
      • use mpiexec -comm pmi -n NUM_PROCS_TAG /apps/ddt/ddt1.10-Suse-10.0-x86_64/bin/ddt-debugger PROGRAM_ARGUMENTS_TAG as “submit command”. NUM_PROCS_TAG and PROGRAM_ARGUMENTS_TAG get automatically substituted when the command is run.
    • to start your application press “submit”
  • here you are! Set break points, advance in different process groups, etc.

Full documentation can be found in /apps/ddt/ddt1.10-Suse-10.0-x86_64/doc/. Check in particular the userguide.pdf and the quickstart-?.pdf files.

Updates on using DDT: The DDT version numbers mentioned in my original post are of course no longer valid (we arrived at 2.3 in the meantime) and we provide “modules”. Moreover, the recommended way of configuring the custom run command is as follows: specify mpirun -n NUM_PROCS_TAG -ddt PROGRAM_ARGUMENTS_TAG as “submit command” as described above. -ddt will automatically be replaced by the path of the currently loaded ddt-debugger.

LINPACK optimiert & Beginn des Stabilitätstests

In den letzten Wochen wurde hart gearbeitet und einiges an Hard- und Software optimiert, um Stabilität und Leistung gewährleisten zu können …

Die Mühen haben sich aber gelohnt. Die Performance des LINPACK-Benchmarks bei Verwendung von nun 185 Rechenknoten konnte auf 5,585 TFlop/s gesteigert werden. Dies entspräche Platz 119 auf der TOP500-Liste vom vergangenen November, wo wir mit dem Wert vom Oktober 2006 “nur” auf Platz 124 kamen.

Die Stabilitätstestphase und damit die letzte Phase der Abnahme hat inzwischen offiziell begonnen und Testbenutzer unterschiedlicher Fachrichtungen dürfen schrittweise auf das neue System. Wenn man sich jedoch bereits jetzt die Queues anschaut, so muss man sich keine Sorgen um die Auslastung des neuen System machen …

Vor größeren Erweiterungen der HPC-Systeme am RRZE steht dann jedoch erst einmal eine “Optimierung” der Kühlung/Klimatisierung des Rechnerraums an …

Erste MPI PingPing-Benchmarkdaten

Auch heute wieder Benchmarkwerte: MPI PingPong-Werte mit unterschiedlichen MPI-Implementierungen auf dem neuen Woodcrest-Cluster und im Vergleich dazu einige wenige Daten vom alten “Cluster32”. Kurzum, DDR-Infiniband bringt beim PingPong-Benchmark einiges, die getesteten MPI-Varianten unterscheiden sich nur minimal. IPoverIB ist z.Z. sehr enttäuschen. Das alte Cluster32 mit den Transtec-Einstellungen des TCP-Stacks ist erstaunlich flott, insbesondere was die Latenzen für sehr kleine Nachrichten betrifft.

MPI Ping-Pong

Tag 1 der Woody-Installation- Auf in neue Welten oder der Kampf mit dem Aufzug

… nach langem Warten ist es endlich so weit: 8.6 TFlop/s Rechenleistung rollen an: das neue HPC-Cluster von Bechtle/HP mit Intel Woodcrest CPUs für die HPC-Crew an der Uni-Erlangen. Berichte über die Vertragsunterzeichnung gab es ja schon in den Erlanger Nachrichten sowie in den VDI-Nachrichten. Ein kurzer technischer Überblick findet sich in der aktuellen BI (BenutzerInformation).

Doch bis in den Rechnerraum ist es ein weiter Weg (1. OG) mit so einigen Hürden: Aufzug der im 12. OG fest steckt und Aufzugtechniker, die Stunden benötigen, bis sie an die FAU kommen. In die beiden kleinen Aufzüge passen die Racks wegen einigen Zentimetern nicht hinein. Tierisch nervende Bedienstete der Informatik, die nicht verstehen können oder wollen, dass bei so einer Anlieferung und einem nicht funktionierenden Aufzug einige sperrige Teile etwas im “Weg” stehen (müssen).

Im Rechnerraum daher noch gähnende Leere …

ah, unser neuer Rechner? Nein, nicht wirklich, aber vielleicht die Nutzer von übermorgen! Schlulklasse@RRZE

Und auch mit Aufzug und durch das Mittagessen gestärkt ist es mühsam

neuer Tag? Noch nicht, aber Stromsparmassnahmen …

So langsam tut sich nun wirklich etwas

wenngleich ein Rack mit rund 800 kg nicht so leicht dirigierbar ist

Probleme? Seit dem Aufzug eigentlich nicht mehr 🙂 Und alle Unsicherheiten über die Tragfähigkeit der Bodenplatten lassen sich schnell ausräumen.

… und relativ schnell ist die Hardware dann auch weitgehend an der Stelle, an die sie soll.

Damit ist aber natürlich die Arbeit noch lange nicht getan. Racks sauber ausrichten, verschauben, und noch ein “wenig” Konfigurationsarbeit … und der nächste Tag kommt bestimmt.

Bis zum regulären Benutzerbetrieb wird aber wohl noch einige Zeit vergehen 🙂

Abschluss von Aufbau und Konfiguration, Leistungsüberprüfung, Stabilitätstest, RRZE-Finetuning, etc. steht ja alles noch an … und nicht zu vergessen die Dokumentation auf den RRZE-Webseiten, damit nicht jeder Benutzer immer gleich bei uns anruft