More and more MPI implementations try to scope with the increasing complexity of modern cluster nodes. However, sometimes these automatic defaults are counterproductive when trying to pinpoint special effects …
Things to remember:
- Open MPI:
mpirun --mca btl tcp,self ...
switches to TCP but not necessarily to GBit-Ethernet; if there is IPoverIB available, Open MPI will take that. Thus add--mca btl_tcp_if_include eth1
to select a specific interface. Cf. http://www.open-mpi.org/faq/?category=tcp#tcp-selection- pinning of hybrid codes (OpenMPI + OpenMP) using RRZE’s
pin_omp
:PINOMP_MASK=7
(cf. http://www.blogs.uni-erlangen.de/JohannesHabich/stories/3414/#3621)
- Mvapich2:
- recent version of mvapich2 default to
MV2_ENABLE_AFFINITY=1
which enables internal pinning and overwrites anytaskset
given on the command line. Either useMV2_CPU_MAPPING
(containing colon-separated CPU numbers) to make the correct pinning or setMV2_ENABLE_AFFINITY=0
and do external pinning. RRZE’s mpirun wrapper (as of Feb. 2010) disables the internal mvapich2 affinity routines to make our-pin
argument work (again). - pinning of hybrid codes (mvapich2 + OpenMP) using RRZE’s
pin_omp
:PINOMP_SKIP=1,2
(cf. http://www.blogs.uni-erlangen.de/JohannesHabich/stories/3414/#4721)
The resulting command line can be rather length: e.g.env PINOMP_SKIP=1,2 OMP_NUM_THREADS=4 KMP_AFFINITY=disabled mpirun -npernode 2 -pin 0,1,2,3_4,5,6,7 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so ./a.out
- recent version of mvapich2 default to
- Intel MPI:
- starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try
env I_MPI_DEVICE=rdssm:OpenIB-mthca0-1
on Woody and TinyGPU,env I_MPI_DEVICE=rdssm:ofa-v2-mthca0-1
on Townsend orI_MPI_DEVICE=rdssm:OpenIB-mlx4_0-1
on TinyBlue or TinyFat.
(The newer syntax to use together withI_MPI_FABRICS
isI_MPI_DAPL_PROVIDER
orI_MPI_DAPL_UD_PROVIDER
.)
Thanks to Intel for pointing out this OFED/Intel-MPI option. Unfortunately, there is only very little OFED documentation on the differences between CMA and SCM: http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_release_notes.txt (Release notes from December 2009 for OFED-1.5).
UCM might be even more scalable; useI_MPI_DAPL_UD=enable I_MPI_FABRICS=shm:dapl
.— on Townsend, Woody and Tiny{Blue,FAT,GPU} UCM is available if the experimental “dapl” moule is loaded before any Intel MPI module. - bug (at least) in Intel MPI 3.1.0.038: connections to MPD daemons may fail if
$TMPDIR
is set to something different than/tmp
- bug in Intel MPI 4.0.0.025: Intel’s
mpirun
as shortcut formpdboot - mpiexec - mpdallexit
cannot read properly from STDIN; the problem is the newly with this release introduced sequence ofmpiexec "$@" & wait $!
instead of the traditionalmpiexec "$@"
. RRZE’s PBS-based mpirun is not affected by this problem. - incompatibilities between RRZE’s PBS-based mpirun (i.e. Pete Wyckoff’s mpiexec) and
I_MPI_DEVICE=rdssm
: from time to time we observe that processes hang either during startup (at MPI_Init or at the first larger scale communication) or even more often at MPI_Finalize.I_MPI_DEVICE=rdma
usually does not have these problems but is noticeable slower for certain applications. As the behavior is non-deterministic, it’s hard/impossible to debug and as the PBS-based start mechanism is not supported by Intel we cannot file a problem report neither. And of course Intel’s extensions of the PMI startup protocol also not publically documented … - you have recent Qlogic Infiniband HCA?
I_MPI_FABRICS=tmi I_MPI_TMI_PROVIDER=psm
should be appropriate (PS:I_MPI_FABRICS
is the replacement forI_MPI_DEVICE
introduced with Intel-MPI 4.0)
- starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try
- HP-MPI:
- The environment variable
MPIRUN_OPTIONS
can be used to pass special options to HP-MPI, e.g."-v -prot"
to be verbose and report the used interconnect; –"-TCP -netaddr 10.188.84.0/255.255.254.0"
to force TCP and select a specific interface (e.g. for IPoIB). - HP-MPI has its own pinning mechanism; cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf dor details;
MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,2,1,3"
should be fine for regular jobs on Woody. If you require a lot of memory and would like to use only one MPI process per socket, the correct option should beMPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,1"
- The environment variable
- Parastation MPI as e.g. used on JUROPA of FZ-Jülich
- Any pointer to pinning mechanisms for processes would highly be appreciated.
- The latest versions (cf. psmgmt 5.0.32-0) have the environment variable
__PSI_CPUMAP
which can be set to individual cores or core ranges, e.g."0,2,4,6,1,3,5,7"
or"0-3,7-4"
.
- to be continued