Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

“-perhost #” option in Intel’s mpiexec.hydra is broken

On RRZE’s HPC systems, you always have to allocate multiples of complete compute nodes (e.g. ppn=24 for LiMa). However, as SMT is enabled on all systems supporting it, it is quite common that you want to have less MPI processes running on a node than entries in the $PBS_NODEFILE are. Intel’s mpiexec.hydra does have the “-perhost #” option which is assumed to do what it tells. Well, reality is different. Starting with version 4.1.0.030 (version 4.1.0.024 was still o.k.), Intel does not respect the “-perhost” command line argument if mpiexec.hydra detects that the job is started within a job of the batch system.

Here is the answer from Intel’s support:

“I reported the case to our engineering team, this is a known issue to us.
Overriding the Resource Manager process distribution with -perhost option (-ppn) is unsafe; we had customer’s issue where Resource Manager just kills processes if we try to put them to non-allocated cores.
Currently we have no “workaround” to enable the -perhost option.
However, we are considering to add some functionality in order to allow users to force the IMPI runtime to overwrite the Resource Manager distribution with its own parameters.

What does that mean? Well you cannot use Intel’s mpiexec.hydra on RRZE’s clusters together with -pernode  (unless you fiddle around with a custom hostfile manually. And never ever try to modify the contents of $PBS_NODEFILE! You will mess up the batch system completely. If you mean that you must generate your specific own hostfile, always create a new file and delete it once you no longer need it.)

If you nevertheless just use Intel’s mpiexec.hydra, your first nodes will always be filled completely (e.g. with 24 processes on LiMa) and, if you specified some -pernode, the last nodes will be just idle doing nothing. E.g. in case of LiMa and -pernode 12: half of the node nodes you requested will be idle while the other half gets overloaded. RRZE may kill such jobs wasting resources without further warning.

RRZE’s recommendation of typical users is to use /apps/rrze/bin/mpirun (or on Emmy mpirun_rrze which will be in the search patch once an intelmpi module is loaded) together with proper pinning of processes.

 

Update Feb. 2014:

With Intel MPI 4.1.03.048, Intel introduced a new environment variable (I_MPI_JOB_RESPECT_PROCESS_PLACEMENT) which — if manually set to OFF — will make mpiexec.hydra respect -pernode again. This new Intel MPI version will soon be the default on all RRZE cluster and the module file will set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT)=OFF for you.