Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

mutt and smime (encryption/signing)

mutt and smime (encryption/signing)
After receiving my Grid certificate I just wanted to use this certificate for signing or encrypting mails. With Thunderbird this is no problem – just include it into your keystore. However, with my favorite mail tool “mutt” I did not yet manage to get the complete certificate chain in 🙂

Keystore with complete certificate chain required:

once the complete certificate chain is available in the keystore, smime_keys add_p12 KEYSTORE works fine. Check
http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/grid/zertifikate.shtml for a detailed description on how to create a keystore with complete certificate chain.

However, you still need to add the CA certs with smime_keys add_root PCA.pem and smime_keys add_root UserCA.pem before add_p12 succeeds.

Further references:

  • http://wiki.austriangrid.at/index.php/Certificate_installation_in_mutt
  • http://kb.wisc.edu/middleware/page.php?id=4091
  • and a little patch to make smime_keys work with certain mutt versions: http://launchpadlibrarian.net/20204370/smime_keys-1.5.17%2B20080114-1ubuntu1.diff

GRID-RA at RRZE

RRZE will soon become a GRID-RA within the DFN-PKI

Once all organisatorial requirements are fullfiled at RRZE, the “personal identification” for user/server certificates according to the EUGridPMA standard can then be obtained at the HPC-Group of RRZE for O=GridGermany/ OU=Universitaet Erlangen-Nuernberg.

Further details will be announced once all details have been negogiated with DFN – hopefully not later than by the end of March.

GRID-RA@RRZE in beta test: After many organizatorial efforts, the GRID-RA@RRZE reached its beta test. See http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/grid/ for details about the proceedure for obtaining a certificat. If there are any questions, just contact me.

Globus GT4 Installation

We are still working on the upgrade of our grid gateway! GT2.4 is no longer available. The installation of GT4 is not yet finished. That effectively means, RRZE currently does not have a working grid gateway right now. Please be patient! GT4 will probably not be available before end of March.

… but as emails from other (larger German) sites show, we are not the onlyones who do not have a running GT4 installation yet …

Availability of GT4 at RRZE will depend on the support and detailed getting-started instructions (in particular integration of GT4 and PBS) from LRZ who is responsible for the Globus Toolkit within D-Grid.

Using STAR-CD with USER-routines on Cluster32

Using STAR-CD with USER-routines on Cluster32 is a little bit tricky due to the following: on the compute node only a limited environment is available. In particular there are no compileres and linkers available. Therefore, it is necessary to compile the user routines before submitting the job to PBS (but after all input files are available).

Therefore, proceed as follow:

  1. prepare all input files and copy them to the directory where you plan to run the simulation
  2. exectue the following commands in an interactive shell (i.e. login shell):
            module add star-cd/XXXX
            star [-dp] -ufile
    

    The optional option -dp is necessary if you plan to use double precission.
    The star -ufile line compiles your user routines on the login node where a complete development environment is available.

  3. Now you can submit your job and the simulation can run on the compute nodes without need for a compiler.

MpCCI on SGI Altix (part 2)

MpCCI on SGI Altix (part 2)
MpCCI is a [commercial] tool (basically a library) which allows coupling different numerical codes. MpCCI is only distributed in binary form and depends on a number of other tools, in particular on MPI. For the Itanium architecture right now only a mpich-based version is provided.
Using a standard MPICH (with p4 device) on SGI Altix is a rather bad idea as the ssh-based start mechanism does not respect CPUsets, proper clean-up is not guarateed, etc.

Due to the problems related to the ssh-based start mechanims of a standard ch4p MPICH, the corresponding mpirun has been removed on Sept. 14, 2005! This guarantees better stability of our SGI Altix system, however, requires some additional steps for users of MpCCI:

  1. load as usual the module mpcci/3.0.3-ia64-glibc23 or mpcci/3.0.3-ia64-glibc23-intel9 (I hope both still work fine)
  2. compile your code as usual (and as in the past) – MPICHHOME and MPIROOTDIR are automatically set by the MpCCI module
  3. create your MpCCI specific input files
  4. interactively run ccirun -log -norun xxx.cci
  5. edit the generated ccirun.procgroup file:
    • on the first line, you have to add --mpcci-inputfile ccirun.inputfile (see the last line of the ccirun output)
    • on all lines you have to replace the number (either 0 or 1) after the hostname by a colon.
    • a complete ccirun.procgroup file now might look like
      altix : some-path/ccirun.cci-control.0.spawn --mpcci-inputfile some-path/ccirun.inputfile
      altix : some-path/ccirun.fhp.itanium_mpccimpi.1.spawn
      altix : some-path/ccirun.Binary.2.spawn
      
  6. now prepare your PBS job file; use the following line to start your program – it replaces the previous mpirun line!
    /opt/MpCCI/mpiexec-0.80/bin/mpiexec -mpich-p4-no-shmem -config=ccirun.procgroup
    
  7. and submit your job. The number of CPUs you request must be equal to (or larger than) the number of processes you start, i.e. you have to count the MpCCI controll process!

Some additional remarks:

  • it is not clear at the moment whether the runtime of such jobs can be extended once they are submitted/running. We’ll probably habe to check this on a actual run …
  • if your reads from STDIN you need an additional step to get it working again:
    • if you have something like read(*,*) x or read *,x you have to set the envirnoment variable FOR_READ to the file which contains the input
    • if you have something like read(5,*) x or read 5,x you have to set the envirnoment variable FORT5 to the file which contains the input

two additional remarks – ccirun.procgroup for mpiexec:

  • for some reason, it seems to be necessary to use only the short hostname (e.g. “altix“) instead of the fully qualified hostnamed (e.g. altix.rrze.uni-erlangen.de)
  • with some applications, the first line in the procgroup file must be “altix : some-path/ccirun.cci-control.0.spawn --mpcci-inputfile ...“, with other applications, this line must be omitted (and the option “--mpcci-inputfile ...” has to be passed to the first actual executable

additional MpCCI remarks: … as we now have two different SGI Altix systems in our batch system, you either have to explicitly request one host using -l host=altix or -l host=altix-batch or you have to dynamically generate the config file for mpiexec.

In addition, mpiexec has been upgraded to a newer version. Just use the /opt/MpCCI/mpiexec/bin/mpiexec to always get the latest version. -mpich-p4-no-shmem is nolonger necessary as it is compiled-in as default.

STAR-CD 3.24 Update 025 on SGI Altix

For some reasons, the update 025 of STAR-CD 3.24 on our SGI Altix system tried to use Intel MPI by default instead of SGI MPT. After setting the environment variable STARFLAGS=-mpi=sgi, SGI MPT is used again also for this version.

MpCCI auf SGI Altix

MpCCI is a library for coupling different codes (e.g. fluid mechanics and aeroaccustics) and excahnging mesh-based data between them. It is developed and sold by Fraunhofer-Institute SCAI (http://www.scai.fraunhofer.de/mpcci.html).

MpCCI is available for several platforms and relies on MPI for communication.

The good news: MpCCI SDK is available for IA64.
The bad news: it relies on mpich

On our SGI Altix system we use PBS Professional as batch queuing system and each running job gets its own CPU-set.

When now starting a MpCCI job, a procgroup file is generated and the processes are started via ssh. And that’s exactly the problem: the sshd daemon (started by root at boot-time) runs outside the CPU-set. Consequently, all processes started via ssh are also outside the allocated cpuset … 🙂

Solutions?
* shared-memory mpich does not work as the shm device of mpich does not work with MPMD, i.e. a procgroup file is not suppoerted
* using SGI MPT (SGI’s own MPI) does not work as the binary-only MpCCI library relies on some mpich symbols
* starting the code with mpiexec does not work as there are some problems with accessing stdin from within the application
* …

Performance problems of parallel AMBER7/sander runs – SOLVED

Since the kernel update on the Cluster we see severe performance problems when running a parallel sander on more than 4 CPUs. The effect cannot not yet be explained. Further investigations are on the way… see comments for the solution

Reasonable performance after update of switch firmware: After updating/upgrading the firmware of the HP ProCurve switches 2848 the performance of parallel AMBER runs is again reasonable. Tests with 16 CPUs (8 nodes) and a few with 32 CPUs (16 nodes) show reasobale runtimes again. Strange that the performance with the old switch firmware so strongly depended on the version of the Linux kernel on the compute nodes …