Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things


Defaults or special options of different MPI implementations to remember

More and more MPI implementations try to scope with the increasing complexity of modern cluster nodes. However, sometimes these automatic defaults are counterproductive when trying to pinpoint special effects …

Things to remember:

  • Open MPI:
  • Mvapich2:
    • recent version of mvapich2 default to MV2_ENABLE_AFFINITY=1 which enables internal pinning and overwrites any taskset given on the command line. Either use MV2_CPU_MAPPING (containing colon-separated CPU numbers) to make the correct pinning or set MV2_ENABLE_AFFINITY=0 and do external pinning. RRZE’s mpirun wrapper (as of Feb. 2010) disables the internal mvapich2 affinity routines to make our -pin argument work (again).
    • pinning of hybrid codes (mvapich2 + OpenMP) using RRZE’s pin_omp: PINOMP_SKIP=1,2 (cf.
      The resulting command line can be rather length: e.g. env PINOMP_SKIP=1,2 OMP_NUM_THREADS=4 KMP_AFFINITY=disabled mpirun -npernode 2 -pin 0,1,2,3_4,5,6,7 env LD_PRELOAD=/apps/rrze/lib/ ./a.out
  • Intel MPI:
    • starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try env I_MPI_DEVICE=rdssm:OpenIB-mthca0-1 on Woody and TinyGPU, env I_MPI_DEVICE=rdssm:ofa-v2-mthca0-1 on Townsend or I_MPI_DEVICE=rdssm:OpenIB-mlx4_0-1 on TinyBlue or TinyFat.
      (The newer syntax to use together with I_MPI_FABRICS is I_MPI_DAPL_PROVIDER or I_MPI_DAPL_UD_PROVIDER.)
      Thanks to Intel for pointing out this OFED/Intel-MPI option. Unfortunately, there is only very little OFED documentation on the differences between CMA and SCM: (Release notes from December 2009 for OFED-1.5).
      UCM might be even more scalable; use I_MPI_DAPL_UD=enable I_MPI_FABRICS=shm:dapl. — on Townsend, Woody and Tiny{Blue,FAT,GPU} UCM is available if the experimental “dapl” moule is loaded before any Intel MPI module.
    • bug (at least) in Intel MPI connections to MPD daemons may fail if $TMPDIR is set to something different than /tmp
    • bug in Intel MPI Intel’s mpirun as shortcut for mpdboot - mpiexec - mpdallexit cannot read properly from STDIN; the problem is the newly with this release introduced sequence of mpiexec "$@" & wait $! instead of the traditional mpiexec "$@". RRZE’s PBS-based mpirun is not affected by this problem.
    • incompatibilities between RRZE’s PBS-based mpirun (i.e. Pete Wyckoff’s mpiexec) and I_MPI_DEVICE=rdssm: from time to time we observe that processes hang either during startup (at MPI_Init or at the first larger scale communication) or even more often at MPI_Finalize. I_MPI_DEVICE=rdma usually does not have these problems but is noticeable slower for certain applications. As the behavior is non-deterministic, it’s hard/impossible to debug and as the PBS-based start mechanism is not supported by Intel we cannot file a problem report neither. And of course Intel’s extensions of the PMI startup protocol also not publically documented …
    • you have recent Qlogic Infiniband HCA? I_MPI_FABRICS=tmi I_MPI_TMI_PROVIDER=psm should be appropriate (PS: I_MPI_FABRICS is the replacement for I_MPI_DEVICE introduced with Intel-MPI 4.0)
  • HP-MPI:
    • The environment variable MPIRUN_OPTIONS can be used to pass special options to HP-MPI, e.g. "-v -prot" to be verbose and report the used interconnect; –"-TCP -netaddr" to force TCP and select a specific interface (e.g. for IPoIB).
    • HP-MPI has its own pinning mechanism; cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf dor details; MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,2,1,3" should be fine for regular jobs on Woody. If you require a lot of memory and would like to use only one MPI process per socket, the correct option should be MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,1"
  • Parastation MPI as e.g. used on JUROPA of FZ-Jülich
    • Any pointer to pinning mechanisms for processes would highly be appreciated.
    • The latest versions (cf. psmgmt 5.0.32-0) have the environment variable __PSI_CPUMAP which can be set to individual cores or core ranges, e.g. "0,2,4,6,1,3,5,7" or "0-3,7-4".
  • to be continued