Thomas Zeiser

Suche


16. Februar 2010

Defaults or special options of different MPI implementations to remember

Thomas Zeiser, 20:12 Uhr in HPC allgemein, HPC-Cluster@RRZE

More and more MPI implementations try to scope with the increasing complexity of modern cluster nodes. However, sometimes these automatic defaults are counterproductive when trying to pinpoint special effects …

Things to remember:

  • Open MPI:
  • Mvapich2:
    • recent version of mvapich2 default to MV2_ENABLE_AFFINITY=1 which enables internal pinning and overwrites any taskset given on the command line. Either use MV2_CPU_MAPPING (containing colon-separated CPU numbers) to make the correct pinning or set MV2_ENABLE_AFFINITY=0 and do external pinning. RRZE’s mpirun wrapper (as of Feb. 2010) disables the internal mvapich2 affinity routines to make our -pin argument work (again).
    • pinning of hybrid codes (mvapich2 + OpenMP) using RRZE’s pin_omp: PINOMP_SKIP=1,2 (cf. http://www.blogs.uni-erlangen.de/JohannesHabich/stories/3414/#4721)
      The resulting command line can be rather length: e.g. env PINOMP_SKIP=1,2 OMP_NUM_THREADS=4 KMP_AFFINITY=disabled mpirun -npernode 2 -pin 0,1,2,3_4,5,6,7 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so ./a.out
  • Intel MPI:
    • starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try env I_MPI_DEVICE=rdssm:OpenIB-mthca0-1 on Woody and TinyGPU, env I_MPI_DEVICE=rdssm:ofa-v2-mthca0-1 on Townsend or I_MPI_DEVICE=rdssm:OpenIB-mlx4_0-1 on TinyBlue or TinyFat.
      Thanks to Intel for pointing this OFED/Intel-MPI option out. Unfortunately, there is only very little OFED documentation on the differences between CMA and SCM: http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_release_notes.txt (Release notes from December 2009 for OFED-1.5).
      UCM might be even more scalable — on Townsend, Woody and Tiny{Blue,FAT,GPU} UCM is available if the experimental “dapl” moule is loaded before any Intel MPI module.
    • bug (at least) in Intel MPI 3.1.0.038: connections to MPD daemons may fail if $TMPDIR is set to something different than /tmp
    • bug in Intel MPI 4.0.0.025: Intel’s mpirun as shortcut for mpdboot - mpiexec - mpdallexit cannot read properly from STDIN; the problem is the newly with this release introduced sequence of mpiexec "$@" & wait $! instead of the traditional mpiexec "$@". RRZE’s PBS-based mpirun is not affected by this problem.
    • incompatibilities between RRZE’s PBS-based mpirun (i.e. Pete Wyckoff’s mpiexec) and I_MPI_DEVICE=rdssm: from time to time we observe that processes hang either during startup (at MPI_Init or at the first larger scale communication) or even more often at MPI_Finalize. I_MPI_DEVICE=rdma usually does not have these problems but is noticeable slower for certain applications. As the behavior is non-deterministic, it’s hard/impossible to debug and as the PBS-based start mechanism is not supported by Intel we cannot file a problem report neither. And of course Intel’s extensions of the PMI startup protocol also not publically documented …
    • you have recent Qlogic Infiniband HCA? I_MPI_FABRICS=tmi I_MPI_TMI_PROVIDER=psm should be appropriate (PS: I_MPI_FABRICS is the replacement for I_MPI_DEVICE introduced with Intel-MPI 4.0)
  • HP-MPI:
    • The environment variable MPIRUN_OPTIONS can be used to pass special options to HP-MPI, e.g. "-v -prot" to be verbose and report the used interconnect; -"-TCP -netaddr 10.188.84.0/255.255.254.0" to force TCP and select a specific interface (e.g. for IPoIB).
    • HP-MPI has its own pinning mechanism; cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf dor details; MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,2,1,3" should be fine for regular jobs on Woody. If you require a lot of memory and would like to use only one MPI process per socket, the correct option should be MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,1"
  • Parastation MPI as e.g. used on JUROPA of FZ-Jülich
    • Any pointer to pinning mechanisms for processes would highly be appreciated.
    • The latest versions (cf. psmgmt 5.0.32-0) have the environment variable __PSI_CPUMAP which can be set to individual cores or core ranges, e.g. "0,2,4,6,1,3,5,7" or "0-3,7-4".
  • to be continued

Nach oben