Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

Using DDT parallel debugger on Woody-Cluster

DDT is a parallel debugger developped and sold by Allinea. Together with the Woody-Cluster, RRZE baught a 32-processes floating license.

Here are some first tips on how to get DDT running in our environment

  • compile your program with debugging information enabled (e.g. -g -fp)
  • login to the cluster frontend (woody.rrze) with ssh and X11-forwarding enabled (e.g. ssh -X or ssh -Y)
  • submit an interactive batch job with X11-forwarding enabled using qsub -I -X -lnodes=#:ppn=4,walltime=##:##:## (# of course has to be replaced by some suitable numbers)
  • wait for the job to start, i.e. until you get a promt back; you are now on one of the compute nodes
  • start DDT using /apps/ddt/ddt1.10-Suse-10.0-x86_64/bin/ddt; do NOT use an "&" here! We might provide a module for DDT in the future removing the need to give a full path …
  • when you start DDT for the first time, you have to “create a new configuration file”:
    • select “intel mpi” as your MPI implementation (no tests with mvapich/mvapich2 yet)
    • select “do not configure DDT for attaing at this time”
  • in the “session control” select “advanced”
    • enter (select) your executable in the top box (“application”)
    • give arguments in the next line if necessary for your code (not tested yet) – no idea yet about STDIN redirect
    • select the number of processes at the bottom
    • press “change” on the lower right hand side (normally only required the first time you use DDT)
      • tick “submit job through batch or configure own mpirun command”
      • use mpiexec -comm pmi -n NUM_PROCS_TAG /apps/ddt/ddt1.10-Suse-10.0-x86_64/bin/ddt-debugger PROGRAM_ARGUMENTS_TAG as “submit command”. NUM_PROCS_TAG and PROGRAM_ARGUMENTS_TAG get automatically substituted when the command is run.
    • to start your application press “submit”
  • here you are! Set break points, advance in different process groups, etc.

Full documentation can be found in /apps/ddt/ddt1.10-Suse-10.0-x86_64/doc/. Check in particular the userguide.pdf and the quickstart-?.pdf files.

Updates on using DDT: The DDT version numbers mentioned in my original post are of course no longer valid (we arrived at 2.3 in the meantime) and we provide “modules”. Moreover, the recommended way of configuring the custom run command is as follows: specify mpirun -n NUM_PROCS_TAG -ddt PROGRAM_ARGUMENTS_TAG as “submit command” as described above. -ddt will automatically be replaced by the path of the currently loaded ddt-debugger.

LINPACK optimiert & Beginn des Stabilitätstests

In den letzten Wochen wurde hart gearbeitet und einiges an Hard- und Software optimiert, um Stabilität und Leistung gewährleisten zu können …

Die Mühen haben sich aber gelohnt. Die Performance des LINPACK-Benchmarks bei Verwendung von nun 185 Rechenknoten konnte auf 5,585 TFlop/s gesteigert werden. Dies entspräche Platz 119 auf der TOP500-Liste vom vergangenen November, wo wir mit dem Wert vom Oktober 2006 “nur” auf Platz 124 kamen.

Die Stabilitätstestphase und damit die letzte Phase der Abnahme hat inzwischen offiziell begonnen und Testbenutzer unterschiedlicher Fachrichtungen dürfen schrittweise auf das neue System. Wenn man sich jedoch bereits jetzt die Queues anschaut, so muss man sich keine Sorgen um die Auslastung des neuen System machen …

Vor größeren Erweiterungen der HPC-Systeme am RRZE steht dann jedoch erst einmal eine “Optimierung” der Kühlung/Klimatisierung des Rechnerraums an …

Erste MPI PingPing-Benchmarkdaten

Auch heute wieder Benchmarkwerte: MPI PingPong-Werte mit unterschiedlichen MPI-Implementierungen auf dem neuen Woodcrest-Cluster und im Vergleich dazu einige wenige Daten vom alten “Cluster32”. Kurzum, DDR-Infiniband bringt beim PingPong-Benchmark einiges, die getesteten MPI-Varianten unterscheiden sich nur minimal. IPoverIB ist z.Z. sehr enttäuschen. Das alte Cluster32 mit den Transtec-Einstellungen des TCP-Stacks ist erstaunlich flott, insbesondere was die Latenzen für sehr kleine Nachrichten betrifft.

MPI Ping-Pong

Tag 1 der Woody-Installation- Auf in neue Welten oder der Kampf mit dem Aufzug

… nach langem Warten ist es endlich so weit: 8.6 TFlop/s Rechenleistung rollen an: das neue HPC-Cluster von Bechtle/HP mit Intel Woodcrest CPUs für die HPC-Crew an der Uni-Erlangen. Berichte über die Vertragsunterzeichnung gab es ja schon in den Erlanger Nachrichten sowie in den VDI-Nachrichten. Ein kurzer technischer Überblick findet sich in der aktuellen BI (BenutzerInformation).

Doch bis in den Rechnerraum ist es ein weiter Weg (1. OG) mit so einigen Hürden: Aufzug der im 12. OG fest steckt und Aufzugtechniker, die Stunden benötigen, bis sie an die FAU kommen. In die beiden kleinen Aufzüge passen die Racks wegen einigen Zentimetern nicht hinein. Tierisch nervende Bedienstete der Informatik, die nicht verstehen können oder wollen, dass bei so einer Anlieferung und einem nicht funktionierenden Aufzug einige sperrige Teile etwas im “Weg” stehen (müssen).

Im Rechnerraum daher noch gähnende Leere …

ah, unser neuer Rechner? Nein, nicht wirklich, aber vielleicht die Nutzer von übermorgen! Schlulklasse@RRZE

Und auch mit Aufzug und durch das Mittagessen gestärkt ist es mühsam

neuer Tag? Noch nicht, aber Stromsparmassnahmen …

So langsam tut sich nun wirklich etwas

wenngleich ein Rack mit rund 800 kg nicht so leicht dirigierbar ist

Probleme? Seit dem Aufzug eigentlich nicht mehr 🙂 Und alle Unsicherheiten über die Tragfähigkeit der Bodenplatten lassen sich schnell ausräumen.

… und relativ schnell ist die Hardware dann auch weitgehend an der Stelle, an die sie soll.

Damit ist aber natürlich die Arbeit noch lange nicht getan. Racks sauber ausrichten, verschauben, und noch ein “wenig” Konfigurationsarbeit … und der nächste Tag kommt bestimmt.

Bis zum regulären Benutzerbetrieb wird aber wohl noch einige Zeit vergehen 🙂

Abschluss von Aufbau und Konfiguration, Leistungsüberprüfung, Stabilitätstest, RRZE-Finetuning, etc. steht ja alles noch an … und nicht zu vergessen die Dokumentation auf den RRZE-Webseiten, damit nicht jeder Benutzer immer gleich bei uns anruft

MpCCI 3.0.5 released — but not yet working at RRZE

A new version of MpCCI (3.0.5) has been released. See www.mpcci.org for more details. Many things changed — in particular, the MpCCI library itself should be much more encapsulated so that it should be completely independent of the MPI version used within the application program.

However, owing to licensing problems, the new version could not yet be tested on the HPC systems of RRZE … Well, the solution was rather trivial: the license file did not contain the FQN of the server and therefore checking out or even testing the license from a remote host did not work if the FQN was used there, i.e. the solution was just to replace the shortend hostname by the full hostname (including domain) or also possible by the IP address. After restarting the license server, the licensensing problems disappeard

In the meantime (early August 2006), MpCCI 3.0.5 is also available on the LRZ machines …

However, there seem to be still some problems on the application side — probably not a big surprise as the internal structure of MpCCI 3.0.5 and the API have completely changed …

… especially on SGI Altix systems (or other machines where CPUsets are used), a start mechanism based on ssh cannot be used as CPUsets would be escaped and “batch” processes start running on interactive/boot CPUsets …

MpCCI auf SGI Altix (3)

MpCCI auf SGI Altix (3)

Tests to use the client-server mode with SGI MPT … — a never ending tragedy:

  • link your application with: ccilink -client -nompilink ..... -lmpi
  • produce the procgroup file with ccirun -server -norun ex2.cci
  • run your simulation:
    /opt/MpCCI/mpiexec/bin/mpiexec -server &
    /opt/MpCCI/mpiexec/bin/mpiexec -config=ccirun_server.procgroup >& s & x1 & x2 < /dev/null
    sleep 10
    
  • The number of CPUs requested must be at least as large as the number of server processes, i.e. those started with mpiexec.

    If you use mpich instead of MPT all processes have to be started with mpiexec. As a consequence, the number of requested CPUs must be equal to the total number of processes. The PBS-TM interface used by mpiexec does not allow to overbook CPUs.

    … after many hours of testing: IT SEEMS THAT MpCCI WITH SGI-MPT DOES *NOT* WORK RELIABLY AT ALL … mpi_comm_rank==0 on all processes 🙂 despite using the correct mpif.h files and mpiruns for the server and client applications.

    My current conclusions:

    • MpCCI does not support SGI MPT natively
    • using mpich on SGI Altix for all communications is NO option as benchmarks showed that CFD applications are slower by a factor of 2 or more when using mpich instead of MPT
    • using MpCCI in client-server mode also seems not to work (see above)

    That means, MpCCI is not at all usable on SGI Altix. Sorry for those who rely on it.

SGI Altix extension

Recently, the SGI Altix at RRZE has been extended. We now have a batch-only system altix-batch (an SGI Altix 3700 with 32 CPUs and 128 GB shared memory) and a front-end system altix (an SGI Altix 330 with 16 CPUs and 32 GB shared memory; 4 CPUs + 8 GB are used as login partition (boot cpuset) – the remaining ones are also used for batch processing).

An important thing to note is that the new machine has only half the amount of memory per CPU as the "old" one. As the cpusets introduced with SuSE SLES9/SGI ProPack4.x do not have all the features known from the old SGI Origin cpusets, in particular policy kill is missing, the systems starts swapping as soon as one process exceeds the amount of memory available in its cpuset. As a results, the complete system becomes un-responsive.

Therefore, it is very important to request the correct machine or to specify the amount of memory required in addition to the number of CPUs. Also the amount of interactive work is now much more limitated as we now have a login partition (boot cpuset) which only have access to 4 CPUs and 8 GB of memory!

Check the official web page of the SGI Altic Systems at RRZE for more details and the correct syntax for specifying recource requirements.

MpCCI on SGI Altix (part 2)

MpCCI on SGI Altix (part 2)
MpCCI is a [commercial] tool (basically a library) which allows coupling different numerical codes. MpCCI is only distributed in binary form and depends on a number of other tools, in particular on MPI. For the Itanium architecture right now only a mpich-based version is provided.
Using a standard MPICH (with p4 device) on SGI Altix is a rather bad idea as the ssh-based start mechanism does not respect CPUsets, proper clean-up is not guarateed, etc.

Due to the problems related to the ssh-based start mechanims of a standard ch4p MPICH, the corresponding mpirun has been removed on Sept. 14, 2005! This guarantees better stability of our SGI Altix system, however, requires some additional steps for users of MpCCI:

  1. load as usual the module mpcci/3.0.3-ia64-glibc23 or mpcci/3.0.3-ia64-glibc23-intel9 (I hope both still work fine)
  2. compile your code as usual (and as in the past) – MPICHHOME and MPIROOTDIR are automatically set by the MpCCI module
  3. create your MpCCI specific input files
  4. interactively run ccirun -log -norun xxx.cci
  5. edit the generated ccirun.procgroup file:
    • on the first line, you have to add --mpcci-inputfile ccirun.inputfile (see the last line of the ccirun output)
    • on all lines you have to replace the number (either 0 or 1) after the hostname by a colon.
    • a complete ccirun.procgroup file now might look like
      altix : some-path/ccirun.cci-control.0.spawn --mpcci-inputfile some-path/ccirun.inputfile
      altix : some-path/ccirun.fhp.itanium_mpccimpi.1.spawn
      altix : some-path/ccirun.Binary.2.spawn
      
  6. now prepare your PBS job file; use the following line to start your program – it replaces the previous mpirun line!
    /opt/MpCCI/mpiexec-0.80/bin/mpiexec -mpich-p4-no-shmem -config=ccirun.procgroup
    
  7. and submit your job. The number of CPUs you request must be equal to (or larger than) the number of processes you start, i.e. you have to count the MpCCI controll process!

Some additional remarks:

  • it is not clear at the moment whether the runtime of such jobs can be extended once they are submitted/running. We’ll probably habe to check this on a actual run …
  • if your reads from STDIN you need an additional step to get it working again:
    • if you have something like read(*,*) x or read *,x you have to set the envirnoment variable FOR_READ to the file which contains the input
    • if you have something like read(5,*) x or read 5,x you have to set the envirnoment variable FORT5 to the file which contains the input

two additional remarks – ccirun.procgroup for mpiexec:

  • for some reason, it seems to be necessary to use only the short hostname (e.g. “altix“) instead of the fully qualified hostnamed (e.g. altix.rrze.uni-erlangen.de)
  • with some applications, the first line in the procgroup file must be “altix : some-path/ccirun.cci-control.0.spawn --mpcci-inputfile ...“, with other applications, this line must be omitted (and the option “--mpcci-inputfile ...” has to be passed to the first actual executable

additional MpCCI remarks: … as we now have two different SGI Altix systems in our batch system, you either have to explicitly request one host using -l host=altix or -l host=altix-batch or you have to dynamically generate the config file for mpiexec.

In addition, mpiexec has been upgraded to a newer version. Just use the /opt/MpCCI/mpiexec/bin/mpiexec to always get the latest version. -mpich-p4-no-shmem is nolonger necessary as it is compiled-in as default.

MpCCI auf SGI Altix

MpCCI is a library for coupling different codes (e.g. fluid mechanics and aeroaccustics) and excahnging mesh-based data between them. It is developed and sold by Fraunhofer-Institute SCAI (http://www.scai.fraunhofer.de/mpcci.html).

MpCCI is available for several platforms and relies on MPI for communication.

The good news: MpCCI SDK is available for IA64.
The bad news: it relies on mpich

On our SGI Altix system we use PBS Professional as batch queuing system and each running job gets its own CPU-set.

When now starting a MpCCI job, a procgroup file is generated and the processes are started via ssh. And that’s exactly the problem: the sshd daemon (started by root at boot-time) runs outside the CPU-set. Consequently, all processes started via ssh are also outside the allocated cpuset … 🙂

Solutions?
* shared-memory mpich does not work as the shm device of mpich does not work with MPMD, i.e. a procgroup file is not suppoerted
* using SGI MPT (SGI’s own MPI) does not work as the binary-only MpCCI library relies on some mpich symbols
* starting the code with mpiexec does not work as there are some problems with accessing stdin from within the application
* …

Performance problems of parallel AMBER7/sander runs – SOLVED

Since the kernel update on the Cluster we see severe performance problems when running a parallel sander on more than 4 CPUs. The effect cannot not yet be explained. Further investigations are on the way… see comments for the solution

Reasonable performance after update of switch firmware: After updating/upgrading the firmware of the HP ProCurve switches 2848 the performance of parallel AMBER runs is again reasonable. Tests with 16 CPUs (8 nodes) and a few with 32 CPUs (16 nodes) show reasobale runtimes again. Strange that the performance with the old switch firmware so strongly depended on the version of the Linux kernel on the compute nodes …