Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

No free lunch – dying the death of parallelism

The following table gives a short overview of the main HPC systems at RRZE and the development of number of cores, clock frequency, peak performance, etc.:

original Woody (w0xxx) w10xx w11xx w12xx + w13xx LiMa Emmy Meggie
Year of installation Q1/2007 Q1/2012 Q4/2013 Q1/2016 + Q4/2016 Q4/2010 Q4/2013 Q4/2016
total number of compute nodes 222 40 72 8 + 56 500 560 728
total number of cores 888 160 288 32 + 224 6000 11200 14560
double precision peak performance of the complete system 10 TFlop/s 4.5 TFlop/s 15 TFlop/s 1.8 + 12.5 TFlop/s 63 TFlop/s 197 TFlop/s ~0.5 PFlop/s (assuming the non-AVX base frequency as AVX turbo frequency)
increase of peak performance of the complete system compared to Woody 1.0 0.4 0.8 0.2 + 1.25 6.3 20 50
max. power consumption of compute nodes and interconnect 100 kW 7 kW 10 kW 2 + 6,5 kW 200 kW 225 kW ~200 kW
Intel CPU generation Woodcrest SandyBridge (E3-1280) Haswell (E3-1240 v3) Skylake (E3-1240 v5) Westmere-EP (X5650) IvyBridge-EP (E5-2660 v2) Intel Broadwell EP (E5-2630 v4)
base clock frequency 3.0 GHz 3.5 GHz 3.4 GHz 3.5 GHz 2.66 GHz 2.2 GHz 2.2 GHz
number of sockets per node 2 1 1 1 2 2 2
number of (physical) cores per node 4 4 4 4 12 20 20
SIMD vector lenth 128 bit (SSE) 256 bit (AVX) 256 bit (AVX+FMA) 256 bit (AVX+FMA) 128 bit (SSE) 256 bit (AVX) 256 bit (AVX+FMA)
maximum single precision peak performance per node 96 GFlop/s 224 GFlop/s 435 GFlop/s 448 GFlop/s 255 GFlop/s 704 GFlop/s 1408 GFlop/s
peak performance per node compared to Woody 1.0 2.3 4.5 4.5 2.6 7.3 14.7
single precision peak performance of serial, non-vectorized code 6 GFlop/s 7.0 GFlop/s 6.8 GFlop/s 7.0 GFlop/s 5.3 GFlop/s 4.4 GFlop/s 4.4 GFlop/s
performance for unoptimized serial code compared to Woody 1.0 1.17 1.13 1.17 0.88 0.73 0.73
main memory per node 8 GB 8 GB 8 GB 16 GB / 32 GB 24 GB 64 GB 64 GB
memory bandwidth per node 6.4 GB/s 20 GB/s 20 GB/s 25 GB/s 40 GB/s 80 GB/s 100 GB/s
memory bandwidth compared to Woody 1.0 3.1 3.1 3.9 6.2 13 15.6

If one only looks at the increase of peak performance of the complete Emmy systems only, the world is bright: 20x increase in 6 years. Not bad.

However, if one has an unoptimized (i.e. non-vectorized) serial code which is compute bound, the speed on the latest system coming in 2013 will only be 73% of the one bought in 2007! The unoptimized (i.e. non-vectorized) serial code can neither benefit from the wider SIMD units nor from the increased number of cores per node but suffers from the decreased clock frequency.

But also optimized parallel codes are challenged: the degree of parallelism increased from Woody to Emmy by a factor of 25. You remember Amdahl’s law (link to Wikipedia) for strong scaling? To scale up to 20 parallel processes, it is enough if 95% of the runtime can be executed in parallel, i.e. 5% can remain non-parallel. To scale up to 11200 processes, less then 0.01% must be executed non-parallel and there must be no other overhead, e.g. due to communication or HALO exchange!

Additional throughput nodes added to Woody cluster

Recently, 40 additional nodes with an aggregated AVX-Linpack performance of 4 TFlop/s have been added to RRZE’s Woody cluster. The nodes were bought by RRZE and ECAP and shall provide additional resources especially for sequential and single-node throughput calculations. Each node has a single-socket socket with Intel’s latest “SandyBridge” 4-core CPUs (Xeon E3-1200 series), 8 GB of main memory, currently no harddisk (and thus no swap) and GBit ethernet.

Current status: most of the new nodes are available for general batch processing; the configuration and software environment stabilized

Open problems:

  • no known ones

User visible changes and solved problems:

  • End of April 2012: all the new w10xx nodes got their harddisk in the meantime and have been reinstalled with SLES10 to match the old w0xx nodes
  • The module command was not available in PBS batch jobs; fixed since 2011-12-17 by patching /etc/profile to always source bashrc even in non-interactive shells
  • The environment variable $HOSTNAME was not defined. Fixed since 2011-12-19 via csh.cshrc.local and bash.bashrc.local.
  • SMT disabled on all nodes (since 2011-12-19). All visible cores are physical cores.
  • qsub is now generally wrapped – but that that should be completely transparent for users (2012-01-16).
  • /wsfs = $FASTTMP is now available, too (2012-01-23)

Configuration notes:

  • The additional nodes are named w10xx.
  • The base operating system is Ubuntu 10.04 LTS. SuSE SLES10 as on the rest of Woody.
    • The diskless images provisioned using Perceus. Autoinstall + cfengine
    • This is different to the rest of Woody which has stateful SuSE SLES10SP4.
    • However, Tiny* for example also uses Ubuntu 10.04 (but in a stateful installation) and binaries should run on SLES and Ubuntu without recompilation.
  • The w10xx nodes have python-2.6 while the other w0xxx nodes have python-2.4. You can load the python/2.7.1 module to ensure a common Python environment.
  • compilation of C++ code on the compute nodes using one of RRZE’s gcc modules will probably fail; however, we never guaranteed that compiling on any compute nodes works; either use the system g++, compile on the frontend nodes, or …
  • The PBS daemon (pbs_mom) running on the additional nodes is much newer than on the old Woody nodes (2.5.9 v.s. 2.3.x?); but the difference should not be visible for users.
  • Each PBS job runs in a cpuset. Thus, you only have access to the CPUs assigned to you by the queuing system. Memory, however, is not partitioned. Thus, make sure that you only use less than 2 GB per requested core as memory constraints cannot be imposed.
  • As the w10xx nodes currently do not have any local harddisk, they are also operated without swap. Thus, the virtual address space and the physically allocated memory of all processes must not exceed 7.2 GB in total. Also /tmp and /scratch are part of the main memory. Stdout and stderr of PBS jobs are also first spooled to main memory before they are copied to the final destination after the job ended.
  • multi-node jobs are not supported as the nodes are a throughput component

Queue configuration / how to submit jobs:

  • The old w0xx nodes got the properties :c2 (as they are Intel Core2-based) and :any.
    The addition w10xx nodes got the properties :sb (as they are Intel SandyBridge-based) and :any.
  • Multi-node jobs (-lnodes=X:ppn=4 or -lnodes=X:ppn=4:c2 with X>1) are only eligible for the old w0xx nodes. :c2 will be added automatically if not present.
    Multi-node jobs which ask for :sb or :any are rejected.
  • Single-node jobs (-lnodes=1:ppn=4) by default also will only access the old w0xx nodes, i.e. :c2 will be added automatically if no node property is given. Thus, -lnodes=1:ppn=4 is identical to requesting -lnodes=1:ppn=4:c2.
    Single-node jobs which specify :sb (i.e. -lnodes=1:ppn=4:sb) will only go to the new w10xx nodes.
    Jobs with :any (i.e. -lnodes=1:ppn=4:any) will run on any available node.
  • Single-core jobs (-lnodes=1:ppn=Y:sb with Y<4, i.e. requesting less than a complete node) are only supported on the new w10xx nodes. Specifying :sb is mandatory.

Technical details:

  • PBS routing originally did not work as expected for jobs where the resource requests are given on the command line (e.g. qsub -lnodes=1:ppn=4 job.sh caused trouble).
    Some technical background: (1) the torque-submitfilter cannot modify the resource requests given on the command line and (2) routing queues cannot add node properties to resource requests any more, thus, for this type of job routing to the old nodes does not seem to be possible … Using distinct queues for the old and new nodes has the disadvantage that jobs cannot ask for “any available CPU”. Moreover, the maui scheduler does not support multi-dimensional throttling policies, i.e. has problems if one user submits jobs to different queues at the same time.
    The solution probably is a wrapper around qsub as suggested in the Torque mailinglist back in May 2008. At RRZE we already use qsub-wrappers for e.g. qsub.tinyblue. Duplicating some of the logic of the submit filter into the submit wrapper is not really elegant but seems to be the only solution right now. (As a side node: interactive jobs do not seem to suffer from the problem as there is special handling in the qsub source code which writes the command line arguments to a temporary file which is subject to processing by the submit filter.)

Recipe for building OpenFOAM-1.7.1 with Intel Compilers and Intel MPI

Compared with other software, installing OpenFOAM is (still) a nightmare. They use their very own build system, there are tons of environment variables to set, etc. But it seems that users in academia and industry accept OpenFOAM nevertheless. For release 1.7.1, I took the time to create a little receipt (in some parts very specifically tailored to RRZE’s installation of software packages) to more or less automatically build OpenFOAM and some accompanying Third Party packages from scratch using the Intel Compilers (icc/icpc) and Intel MPI instead of Gcc and Open MPI (only Qt and Paraview are still built using gcc). The script is provided as-is without any guarantee that it works elsewhere and of course also without any support. The script assumes that the required source code packages have already been downloaded. Where necessary, the unpacked sources are patched and the compilation commands are executed. Finally, two new tar balls are created which contain the required “output” for a clean binary installation, i.e. intermediate output files (e.g. *.dep) are not included …

Compilation takes ages, but that’s not really surprising. Only extracting the tar balls with the sources amounts to 1.6 GB in almost 45k files/directories. After compilation (although neither Open MPI nor Gcc are built) the size is increased to 6.5 GB or 120k files. If all intermediate compilation files are removed, there are still about 1 GB or 30k files/directories remaining in my “clean installation” (with only the Qt/ParaView libraries/binaries in the ThirdParty tree).

RRZE users find OpenFOAM-1.7.1 as module on Woody and TinyBlue. The binaries used for Woody and TinyBlue are slightly different as both were natively compiled on SuSE SLES 10SP3 and Ubuntu 8.04, respectively. The main difference should only be in the Qt/Paraview part as SLES10 and Ubuntu 8.04 come with different Python versions. ParaView should also be compiled with MPI support.

Note (2012-06-08): to be able to compile src/finiteVolume/fields/fvPatchFields/constraint/wedge/wedgeFvPatchScalarField.C with recent versions of the Intel compiler, one has to patch this file to avoid an no instance of overloaded function “Foam:operator==” matches the argument list error message; cf. http://www.cfd-online.com/Forums/openfoam-installation/101961-compiling-2-1-0-rhel6-2-icc.html and https://github.com/OpenFOAM/OpenFOAM-2.1.x/commit/8cf1d398d16551c4931d20d9fc3e42957d0f93ca. These links are for OF-2.1.x but the fix works for OF-1.7.1 as well.

slight change of the Intel compiler modules: MKL modules loaded automatically

Starting from today, our intel64 compiler modules for version 11.0 and 11.1 of the Intel compilers will automatically load the MKL module corresponding to the bundled version in the Intel Processional Compilers, too.

  • If you do not use MKL, this change will not affect you.
  • If you want (or have) to use a different version of MKL, you have to load the MKL module before loading the compiler module OR you have to unload the automatically loaded MKL module before loading your desired version.

Motivation for this change was to make the compiler option -mkl work properly (or at least better), i.e. give the user a reasonable chance to find the MKL libraries at runtime. You still have to load the compiler (or mkl) module at runtime OR you have to add -Wl,-rpath,$MKLPATH explicitly to your linker arguments …

Defaults or special options of different MPI implementations to remember

More and more MPI implementations try to scope with the increasing complexity of modern cluster nodes. However, sometimes these automatic defaults are counterproductive when trying to pinpoint special effects …

Things to remember:

  • Open MPI:
  • Mvapich2:
    • recent version of mvapich2 default to MV2_ENABLE_AFFINITY=1 which enables internal pinning and overwrites any taskset given on the command line. Either use MV2_CPU_MAPPING (containing colon-separated CPU numbers) to make the correct pinning or set MV2_ENABLE_AFFINITY=0 and do external pinning. RRZE’s mpirun wrapper (as of Feb. 2010) disables the internal mvapich2 affinity routines to make our -pin argument work (again).
    • pinning of hybrid codes (mvapich2 + OpenMP) using RRZE’s pin_omp: PINOMP_SKIP=1,2 (cf. http://www.blogs.uni-erlangen.de/JohannesHabich/stories/3414/#4721)
      The resulting command line can be rather length: e.g. env PINOMP_SKIP=1,2 OMP_NUM_THREADS=4 KMP_AFFINITY=disabled mpirun -npernode 2 -pin 0,1,2,3_4,5,6,7 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so ./a.out
  • Intel MPI:
    • starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try env I_MPI_DEVICE=rdssm:OpenIB-mthca0-1 on Woody and TinyGPU, env I_MPI_DEVICE=rdssm:ofa-v2-mthca0-1 on Townsend or I_MPI_DEVICE=rdssm:OpenIB-mlx4_0-1 on TinyBlue or TinyFat.
      (The newer syntax to use together with I_MPI_FABRICS is I_MPI_DAPL_PROVIDER or I_MPI_DAPL_UD_PROVIDER.)
      Thanks to Intel for pointing out this OFED/Intel-MPI option. Unfortunately, there is only very little OFED documentation on the differences between CMA and SCM: http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_release_notes.txt (Release notes from December 2009 for OFED-1.5).
      UCM might be even more scalable; use I_MPI_DAPL_UD=enable I_MPI_FABRICS=shm:dapl. — on Townsend, Woody and Tiny{Blue,FAT,GPU} UCM is available if the experimental “dapl” moule is loaded before any Intel MPI module.
    • bug (at least) in Intel MPI 3.1.0.038: connections to MPD daemons may fail if $TMPDIR is set to something different than /tmp
    • bug in Intel MPI 4.0.0.025: Intel’s mpirun as shortcut for mpdboot - mpiexec - mpdallexit cannot read properly from STDIN; the problem is the newly with this release introduced sequence of mpiexec "$@" & wait $! instead of the traditional mpiexec "$@". RRZE’s PBS-based mpirun is not affected by this problem.
    • incompatibilities between RRZE’s PBS-based mpirun (i.e. Pete Wyckoff’s mpiexec) and I_MPI_DEVICE=rdssm: from time to time we observe that processes hang either during startup (at MPI_Init or at the first larger scale communication) or even more often at MPI_Finalize. I_MPI_DEVICE=rdma usually does not have these problems but is noticeable slower for certain applications. As the behavior is non-deterministic, it’s hard/impossible to debug and as the PBS-based start mechanism is not supported by Intel we cannot file a problem report neither. And of course Intel’s extensions of the PMI startup protocol also not publically documented …
    • you have recent Qlogic Infiniband HCA? I_MPI_FABRICS=tmi I_MPI_TMI_PROVIDER=psm should be appropriate (PS: I_MPI_FABRICS is the replacement for I_MPI_DEVICE introduced with Intel-MPI 4.0)
  • HP-MPI:
    • The environment variable MPIRUN_OPTIONS can be used to pass special options to HP-MPI, e.g. "-v -prot" to be verbose and report the used interconnect; –"-TCP -netaddr 10.188.84.0/255.255.254.0" to force TCP and select a specific interface (e.g. for IPoIB).
    • HP-MPI has its own pinning mechanism; cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf dor details; MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,2,1,3" should be fine for regular jobs on Woody. If you require a lot of memory and would like to use only one MPI process per socket, the correct option should be MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,1"
  • Parastation MPI as e.g. used on JUROPA of FZ-Jülich
    • Any pointer to pinning mechanisms for processes would highly be appreciated.
    • The latest versions (cf. psmgmt 5.0.32-0) have the environment variable __PSI_CPUMAP which can be set to individual cores or core ranges, e.g. "0,2,4,6,1,3,5,7" or "0-3,7-4".
  • to be continued

minor modifications to RRZE’s mpirun wrapper

For certain applications and processor numbers, there were problems with hanging MPI processes in the startup phase when using Intel MPI together with PBS’s mpiexec and I_MPI_DEVICE=rdssm. Therefore, we changed the default I_MPI_DEVICE for Woody, Townsend and TinyBlue from rdssm to rdma. In practice, one should not see any difference as rdssm with recent versions of Intel MPI did not use shm but rdma also within the node.

Single node jobs on Woody and TinyBlue using Intel MPI and our mpirun wrapper will from now on use shm by default if exactly one node was requested for the PBS jobs.

additional settings for mvapich2:

To make our -pin argument work with recent mvapich2 versions, we recently added a default of MV2_ENABLE_AFFINITY=0 to our mpirun wrapper.

combining several sequential jobs in one PBS job to fill a complete node

Sometimes trivial parallelism is the most efficient way to parallelize work, e.g. for parameter studies with a sequential program. If only complete nodes may be allocated on a certain cluster, several sequenatial runs can very easily be bundled into one job file:

[shell]
#!/bin/bash -l
# allocate 1 nodes (4 CPUs) for 8 hours
#PBS -l nodes=1:ppn=4,walltime=08:00:00
# job name
#PBS -N xyz
# first non-empty non-comment line ends PBS options

# jobs always start in HOME
# but we want to go to the directory where we submitted the job
cd $PBS_O_WORKDIR

# run 4 sequential parameter studies in parallel and bind eachone
# to a specific core
(taskset -c 0 ./a.out input1.dat ) &
(taskset -c 1 ./a.out input2.dat ) &
(taskset -c 2 ./a.out input3.dat ) &
(taskset -c 3 ./a.out input4.dat ) &

# wait for all background processes to finish (“wait” is a bash built-in)
wait
[/shell]

The bash builtin wait ensures that all background processes have finished once wait returns.

For this to work efficiently, of course all parameter runs should take about the same time …

Intel compiler and -mcmodel=...

As it seems to be come a FAQ – although it is documented as a small note in the ifort man page: if the 64-bit Intel compilers (for EM64T/Opteron but not for IA64) are used and statically allocated data (e.g. in Fortran common blocks) exceeds 2GB, the -mcmodel=medium or -mcmodel=large switches must be used. As a consequence, the -i-dynamic or -shared-intel flag depending on the compiler version must also be specified for the linking step otherwise strange error messages about relocation truncated occur for libifcore.a routines during linking.

fork and OFED Infiniband stack

Attention: OFED disallows system(const char*) or fork/exec after initializing the Infiniband libraries. Some documentation mentions about this:
… the Mellanox InfiniBand driver has ssues with buffers sharing pages when fork() is used. Pinned (locked in memory) pages are normally marked copy-on-write during a fork. If a page is pinned before a fork and subsequently written to while RDMA operations are being performed on the same page, silent data corruption can occur as RDMA operations continue to stream data to a page that has moved. To avoid this, the Mellanox driver does not use copy-on-write behavior during a fork for pinned pages. Instead, access to these pages by the child process will result in a segmentation violation.
Fork support from kernel 2.6.12 and above is available provided that applications do not use threads. The fork() is supported as long as parent process does not run before child exits or calls exec(). The former can be achieved by calling wait(childpid) the later can be achieved by application specific means. Posix system() call is supported.

Woody is running a SuSE SLES9 kernel, i.e. 2.6.5. Thus, no support for fork and similar things!

Some users already hit this problem! Even a Fortran user who had call system('some command') in his code! In the latter case, the application just hang in some (matching) MPI_send/MPI_recv calls.

building mvapich/mvapich2 for woody

Just as an internal information for ourselves: mvapich and mvapich2 heavily depend on the underlaying Infiniband stack. Therefore, both MPI libraries had to be recompiled while changing from Voltaire IBHOST-3.4.5 stack to GridStack 4.1.5 (aka OFED-1.1).

As mvapich/mvapich2 are statically linked into executables, probably those applications have to be recompiled, too.
On the other side, Intel-MPI (which is the default on woody) is not affected, thus, most of the users’ applications will not require recompilation.

The RRZE versions of mvapich-0.9.8 and mvapich2-0.9.8 were now compiled using:

module load intel64-c/9.1.049, intel64-f/9.1.045
export CC=icc
export CXX=icpc
export F77=ifort
export F90=ifort
export ARCH=_EM64T_
export OPEN_IB_HOME=/usr/local/ofed
export PREFIX=/apps/mvapich2/0.9.8-gridstack-4.1.5
./make.mvapich[2].gen2