Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things


Defaults or special options of different MPI implementations to remember

More and more MPI implementations try to scope with the increasing complexity of modern cluster nodes. However, sometimes these automatic defaults are counterproductive when trying to pinpoint special effects …

Things to remember:

  • Open MPI:
  • Mvapich2:
    • recent version of mvapich2 default to MV2_ENABLE_AFFINITY=1 which enables internal pinning and overwrites any taskset given on the command line. Either use MV2_CPU_MAPPING (containing colon-separated CPU numbers) to make the correct pinning or set MV2_ENABLE_AFFINITY=0 and do external pinning. RRZE’s mpirun wrapper (as of Feb. 2010) disables the internal mvapich2 affinity routines to make our -pin argument work (again).
    • pinning of hybrid codes (mvapich2 + OpenMP) using RRZE’s pin_omp: PINOMP_SKIP=1,2 (cf.
      The resulting command line can be rather length: e.g. env PINOMP_SKIP=1,2 OMP_NUM_THREADS=4 KMP_AFFINITY=disabled mpirun -npernode 2 -pin 0,1,2,3_4,5,6,7 env LD_PRELOAD=/apps/rrze/lib/ ./a.out
  • Intel MPI:
    • starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try env I_MPI_DEVICE=rdssm:OpenIB-mthca0-1 on Woody and TinyGPU, env I_MPI_DEVICE=rdssm:ofa-v2-mthca0-1 on Townsend or I_MPI_DEVICE=rdssm:OpenIB-mlx4_0-1 on TinyBlue or TinyFat.
      (The newer syntax to use together with I_MPI_FABRICS is I_MPI_DAPL_PROVIDER or I_MPI_DAPL_UD_PROVIDER.)
      Thanks to Intel for pointing out this OFED/Intel-MPI option. Unfortunately, there is only very little OFED documentation on the differences between CMA and SCM: (Release notes from December 2009 for OFED-1.5).
      UCM might be even more scalable; use I_MPI_DAPL_UD=enable I_MPI_FABRICS=shm:dapl. — on Townsend, Woody and Tiny{Blue,FAT,GPU} UCM is available if the experimental “dapl” moule is loaded before any Intel MPI module.
    • bug (at least) in Intel MPI connections to MPD daemons may fail if $TMPDIR is set to something different than /tmp
    • bug in Intel MPI Intel’s mpirun as shortcut for mpdboot - mpiexec - mpdallexit cannot read properly from STDIN; the problem is the newly with this release introduced sequence of mpiexec "$@" & wait $! instead of the traditional mpiexec "$@". RRZE’s PBS-based mpirun is not affected by this problem.
    • incompatibilities between RRZE’s PBS-based mpirun (i.e. Pete Wyckoff’s mpiexec) and I_MPI_DEVICE=rdssm: from time to time we observe that processes hang either during startup (at MPI_Init or at the first larger scale communication) or even more often at MPI_Finalize. I_MPI_DEVICE=rdma usually does not have these problems but is noticeable slower for certain applications. As the behavior is non-deterministic, it’s hard/impossible to debug and as the PBS-based start mechanism is not supported by Intel we cannot file a problem report neither. And of course Intel’s extensions of the PMI startup protocol also not publically documented …
    • you have recent Qlogic Infiniband HCA? I_MPI_FABRICS=tmi I_MPI_TMI_PROVIDER=psm should be appropriate (PS: I_MPI_FABRICS is the replacement for I_MPI_DEVICE introduced with Intel-MPI 4.0)
  • HP-MPI:
    • The environment variable MPIRUN_OPTIONS can be used to pass special options to HP-MPI, e.g. "-v -prot" to be verbose and report the used interconnect; –"-TCP -netaddr" to force TCP and select a specific interface (e.g. for IPoIB).
    • HP-MPI has its own pinning mechanism; cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf dor details; MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,2,1,3" should be fine for regular jobs on Woody. If you require a lot of memory and would like to use only one MPI process per socket, the correct option should be MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,1"
  • Parastation MPI as e.g. used on JUROPA of FZ-Jülich
    • Any pointer to pinning mechanisms for processes would highly be appreciated.
    • The latest versions (cf. psmgmt 5.0.32-0) have the environment variable __PSI_CPUMAP which can be set to individual cores or core ranges, e.g. "0,2,4,6,1,3,5,7" or "0-3,7-4".
  • to be continued

4 Responses to Defaults or special options of different MPI implementations to remember

  1. Thomas Zeiser says:

    With Intel MPI, Intel introduced a new environment variable (I_MPI_JOB_RESPECT_PROCESS_PLACEMENT) which — if manually set to OFF — will make mpiexec.hydra respect -pernode again. This new Intel MPI version will soon be the default on all RRZE cluster and the module file will set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT)=OFF for you. (See for details.)

  2. Thomas Zeiser says:

    The problems we sometimes observe with hanging processes at startup (right at the MPI_Init or the first time a larger communication occurs) or at the very end (at or after MPI_Finalise), were confirmed by an external user, too. His solution was to switch from I_MPI_FABRICS=shm:dapl to I_MPI_FABRICS=shm:ofa.

  3. Thomas Zeiser says:

    At some point Intel switched the default for I_MPI_PMI_EXTENSIONS to off. I.e. for e.g. Intel MPI 4.0.3 you have to manually set I_MPI_PMI_EXTENSIONS=on to make the default OSC mpiexec version work properly.

    In February 2012, Doug Johnson also implemented an “impi-helper” to provide the expected PMI_FD file descriptors. Cf. for the alternative “-intel-bug” workaround.

  4. Thomas Zeiser says:

    There are some new “bugs/incompatibilities” in Intel MPI (i.e. Intel MPI 4.0up1):

    MPI-IO to lustre-based parallel filesystem now requires I_MPI_EXTRA_FILESYSTEM=on and I_MPI_EXTRA_FILESYSTEM_LIST=lustre otherwise you’ll get locking-related errors. (Change in behavior confirmed by Intel.)

    OSC’s mpiexec (0.84) does not start parallel instances but only N serial copies of your code if I_MPI_PMI_EXTENSIONS=off; however, mpiexec seems to work as expected if the default of I_MPI_PMI_EXTENSIONS=on is effective.

Comments are closed.