More and more MPI implementations try to scope with the increasing complexity of modern cluster nodes. However, sometimes these automatic defaults are counterproductive when trying to pinpoint special effects …
Things to remember:
- Open MPI:
- mpirun --mca btl tcp,self ...switches to TCP but not necessarily to GBit-Ethernet; if there is IPoverIB available, Open MPI will take that. Thus add- --mca btl_tcp_if_include eth1to select a specific interface. Cf. http://www.open-mpi.org/faq/?category=tcp#tcp-selection
- pinning of hybrid codes (OpenMPI + OpenMP) using RRZE’s pin_omp:PINOMP_MASK=7(cf. http://www.blogs.uni-erlangen.de/JohannesHabich/stories/3414/#3621)
 
- Mvapich2:
- recent version of mvapich2 default to MV2_ENABLE_AFFINITY=1which enables internal pinning and overwrites anytasksetgiven on the command line. Either useMV2_CPU_MAPPING(containing colon-separated CPU numbers) to make the correct pinning or setMV2_ENABLE_AFFINITY=0and do external pinning. RRZE’s mpirun wrapper (as of Feb. 2010) disables the internal mvapich2 affinity routines to make our-pinargument work (again).
- pinning of hybrid codes (mvapich2 + OpenMP) using RRZE’s pin_omp:PINOMP_SKIP=1,2(cf. http://www.blogs.uni-erlangen.de/JohannesHabich/stories/3414/#4721)
 The resulting command line can be rather length: e.g.env PINOMP_SKIP=1,2 OMP_NUM_THREADS=4 KMP_AFFINITY=disabled mpirun -npernode 2 -pin 0,1,2,3_4,5,6,7 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so ./a.out
 
- recent version of mvapich2 default to 
- Intel MPI:
- starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try env I_MPI_DEVICE=rdssm:OpenIB-mthca0-1on Woody and TinyGPU,env I_MPI_DEVICE=rdssm:ofa-v2-mthca0-1on Townsend orI_MPI_DEVICE=rdssm:OpenIB-mlx4_0-1on TinyBlue or TinyFat.
 (The newer syntax to use together withI_MPI_FABRICSisI_MPI_DAPL_PROVIDERorI_MPI_DAPL_UD_PROVIDER.)
 Thanks to Intel for pointing out this OFED/Intel-MPI option. Unfortunately, there is only very little OFED documentation on the differences between CMA and SCM: http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_release_notes.txt (Release notes from December 2009 for OFED-1.5).
 UCM might be even more scalable; useI_MPI_DAPL_UD=enable I_MPI_FABRICS=shm:dapl.— on Townsend, Woody and Tiny{Blue,FAT,GPU} UCM is available if the experimental “dapl” moule is loaded before any Intel MPI module.
- bug (at least) in Intel MPI 3.1.0.038: connections to MPD daemons may fail if $TMPDIRis set to something different than/tmp
- bug in Intel MPI 4.0.0.025: Intel’s mpirunas shortcut formpdboot - mpiexec - mpdallexitcannot read properly from STDIN; the problem is the newly with this release introduced sequence ofmpiexec "$@" & wait $!instead of the traditionalmpiexec "$@". RRZE’s PBS-based mpirun is not affected by this problem.
- incompatibilities between RRZE’s PBS-based mpirun (i.e. Pete Wyckoff’s mpiexec) and I_MPI_DEVICE=rdssm: from time to time we observe that processes hang either during startup (at MPI_Init or at the first larger scale communication) or even more often at MPI_Finalize.I_MPI_DEVICE=rdmausually does not have these problems but is noticeable slower for certain applications. As the behavior is non-deterministic, it’s hard/impossible to debug and as the PBS-based start mechanism is not supported by Intel we cannot file a problem report neither. And of course Intel’s extensions of the PMI startup protocol also not publically documented …
- you have recent Qlogic Infiniband HCA? I_MPI_FABRICS=tmi I_MPI_TMI_PROVIDER=psmshould be appropriate (PS:I_MPI_FABRICSis the replacement forI_MPI_DEVICEintroduced with Intel-MPI 4.0)
 
- starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try 
- HP-MPI:
- The environment variable MPIRUN_OPTIONScan be used to pass special options to HP-MPI, e.g."-v -prot"to be verbose and report the used interconnect; –"-TCP -netaddr 10.188.84.0/255.255.254.0"to force TCP and select a specific interface (e.g. for IPoIB).
- HP-MPI has its own pinning mechanism; cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf dor details; MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,2,1,3"should be fine for regular jobs on Woody. If you require a lot of memory and would like to use only one MPI process per socket, the correct option should beMPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,1"
 
- The environment variable 
- Parastation MPI as e.g. used on JUROPA of FZ-Jülich
- Any pointer to pinning mechanisms for processes would highly be appreciated.
- The latest versions (cf. psmgmt 5.0.32-0) have the environment variable __PSI_CPUMAPwhich can be set to individual cores or core ranges, e.g."0,2,4,6,1,3,5,7"or"0-3,7-4".
 
- to be continued
With Intel MPI 4.1.03.048, Intel introduced a new environment variable (I_MPI_JOB_RESPECT_PROCESS_PLACEMENT) which — if manually set to OFF — will make mpiexec.hydra respect -pernode again. This new Intel MPI version will soon be the default on all RRZE cluster and the module file will set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT)=OFF for you. (See for details.)
The problems we sometimes observe with hanging processes at startup (right at the MPI_Init or the first time a larger communication occurs) or at the very end (at or after MPI_Finalise), were confirmed by an external user, too. His solution was to switch from
I_MPI_FABRICS=shm:dapltoI_MPI_FABRICS=shm:ofa.At some point Intel switched the default for
I_MPI_PMI_EXTENSIONStooff. I.e. for e.g. Intel MPI 4.0.3 you have to manually setI_MPI_PMI_EXTENSIONS=onto make the default OSC mpiexec version work properly.In February 2012, Doug Johnson also implemented an “impi-helper” to provide the expected PMI_FD file descriptors. Cf. http://lists.osc.edu/pipermail/mpiexec/2012/001215.html for the alternative “-intel-bug” workaround.
There are some new “bugs/incompatibilities” in Intel MPI 4.0.1.007 (i.e. Intel MPI 4.0up1):
MPI-IO to lustre-based parallel filesystem now requires
I_MPI_EXTRA_FILESYSTEM=onandI_MPI_EXTRA_FILESYSTEM_LIST=lustreotherwise you’ll get locking-related errors. (Change in behavior confirmed by Intel.)OSC’s mpiexec (0.84) does not start parallel instances but only N serial copies of your code if
I_MPI_PMI_EXTENSIONS=off; however, mpiexec seems to work as expected if the default ofI_MPI_PMI_EXTENSIONS=onis effective.