Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

“-perhost #” option in Intel’s mpiexec.hydra is broken

On RRZE’s HPC systems, you always have to allocate multiples of complete compute nodes (e.g. ppn=24 for LiMa). However, as SMT is enabled on all systems supporting it, it is quite common that you want to have less MPI processes running on a node than entries in the $PBS_NODEFILE are. Intel’s mpiexec.hydra does have the “-perhost #” option which is assumed to do what it tells. Well, reality is different. Starting with version 4.1.0.030 (version 4.1.0.024 was still o.k.), Intel does not respect the “-perhost” command line argument if mpiexec.hydra detects that the job is started within a job of the batch system.

Here is the answer from Intel’s support:

“I reported the case to our engineering team, this is a known issue to us.
Overriding the Resource Manager process distribution with -perhost option (-ppn) is unsafe; we had customer’s issue where Resource Manager just kills processes if we try to put them to non-allocated cores.
Currently we have no “workaround” to enable the -perhost option.
However, we are considering to add some functionality in order to allow users to force the IMPI runtime to overwrite the Resource Manager distribution with its own parameters.

What does that mean? Well you cannot use Intel’s mpiexec.hydra on RRZE’s clusters together with -pernode  (unless you fiddle around with a custom hostfile manually. And never ever try to modify the contents of $PBS_NODEFILE! You will mess up the batch system completely. If you mean that you must generate your specific own hostfile, always create a new file and delete it once you no longer need it.)

If you nevertheless just use Intel’s mpiexec.hydra, your first nodes will always be filled completely (e.g. with 24 processes on LiMa) and, if you specified some -pernode, the last nodes will be just idle doing nothing. E.g. in case of LiMa and -pernode 12: half of the node nodes you requested will be idle while the other half gets overloaded. RRZE may kill such jobs wasting resources without further warning.

RRZE’s recommendation of typical users is to use /apps/rrze/bin/mpirun (or on Emmy mpirun_rrze which will be in the search patch once an intelmpi module is loaded) together with proper pinning of processes.

 

Update Feb. 2014:

With Intel MPI 4.1.03.048, Intel introduced a new environment variable (I_MPI_JOB_RESPECT_PROCESS_PLACEMENT) which — if manually set to OFF — will make mpiexec.hydra respect -pernode again. This new Intel MPI version will soon be the default on all RRZE cluster and the module file will set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT)=OFF for you.

Xserver/VirtualGL on NVidia K20m GPGPUs

As delivered, our NVidia K20m GPGPUs were in GPU Operation Mode 1/COMPUTE. If one tries to start the Xorg server, /var/log/Xorg.0.log shows:

    [   134.516] (II) NVIDIA: Using 4095.62 MB of virtual memory for indirect memory
    [   134.516] (II) NVIDIA:     access.
    [   134.523] (EE) NVIDIA(0): Failed to allocate 2D engine
    [   134.523] (EE) NVIDIA(0):  *** Aborting ***
    [   134.523] (EE) NVIDIA(0): Failed to allocate 2D objects
    [   134.523] (EE) NVIDIA(0):  *** Aborting ***
    [   134.619] Fatal server error:
    [   134.619] AddScreen/ScreenInit failed for driver 0

Using nvidia-smi --gom=0 and rebooting the node, the GPU Operation Mode can be set to 0/ALL_ON. Now, starting the Xorg server succeeds (and VirtualGL should work for remote visualization).

According to the documentation for NVidia driver 325.15, setting the GPU Operation Mode is only supported on  “GK110 M-class and X-class Tesla products from the Kepler family”.

Cluster2013: Racks for “Emmy” cluster delivered

On Monday, July 29, eleven new water cooled racks for our new cluster have been delivered and brought to their final position in the server room which is now quite full.

View from the entrance on one of the two rows of the new water cooled racks of the "Emmy" cluster which connect to the old "Woody" cluster in the front and which is close to the "LiMa" cluster in the back.

View from the entrance on one of the two rows of the new water cooled racks of the “Emmy” cluster which connect to the old “Woody” cluster in the front and which is close to the “LiMa” cluster in the back.

Eleven new water cooled racks of the "Emmy" cluster.

Eleven new water cooled racks of the “Emmy” cluster.

"Cool door" of the new water cooled racks of the "Emmy" cluster.

“Cool door” of the new water cooled racks of the “Emmy” cluster.

However, there is no clear date yet when the new racks will be connected to the cold water supply pipe nor when the compute hardware will be mounted. Let’s hope …

No free lunch – dying the death of parallelism

The following table gives a short overview of the main HPC systems at RRZE and the development of number of cores, clock frequency, peak performance, etc.:

original Woody (w0xxx) w10xx w11xx w12xx + w13xx LiMa Emmy Meggie
Year of installation Q1/2007 Q1/2012 Q4/2013 Q1/2016 + Q4/2016 Q4/2010 Q4/2013 Q4/2016
total number of compute nodes 222 40 72 8 + 56 500 560 728
total number of cores 888 160 288 32 + 224 6000 11200 14560
double precision peak performance of the complete system 10 TFlop/s 4.5 TFlop/s 15 TFlop/s 1.8 + 12.5 TFlop/s 63 TFlop/s 197 TFlop/s ~0.5 PFlop/s (assuming the non-AVX base frequency as AVX turbo frequency)
increase of peak performance of the complete system compared to Woody 1.0 0.4 0.8 0.2 + 1.25 6.3 20 50
max. power consumption of compute nodes and interconnect 100 kW 7 kW 10 kW 2 + 6,5 kW 200 kW 225 kW ~200 kW
Intel CPU generation Woodcrest SandyBridge (E3-1280) Haswell (E3-1240 v3) Skylake (E3-1240 v5) Westmere-EP (X5650) IvyBridge-EP (E5-2660 v2) Intel Broadwell EP (E5-2630 v4)
base clock frequency 3.0 GHz 3.5 GHz 3.4 GHz 3.5 GHz 2.66 GHz 2.2 GHz 2.2 GHz
number of sockets per node 2 1 1 1 2 2 2
number of (physical) cores per node 4 4 4 4 12 20 20
SIMD vector lenth 128 bit (SSE) 256 bit (AVX) 256 bit (AVX+FMA) 256 bit (AVX+FMA) 128 bit (SSE) 256 bit (AVX) 256 bit (AVX+FMA)
maximum single precision peak performance per node 96 GFlop/s 224 GFlop/s 435 GFlop/s 448 GFlop/s 255 GFlop/s 704 GFlop/s 1408 GFlop/s
peak performance per node compared to Woody 1.0 2.3 4.5 4.5 2.6 7.3 14.7
single precision peak performance of serial, non-vectorized code 6 GFlop/s 7.0 GFlop/s 6.8 GFlop/s 7.0 GFlop/s 5.3 GFlop/s 4.4 GFlop/s 4.4 GFlop/s
performance for unoptimized serial code compared to Woody 1.0 1.17 1.13 1.17 0.88 0.73 0.73
main memory per node 8 GB 8 GB 8 GB 16 GB / 32 GB 24 GB 64 GB 64 GB
memory bandwidth per node 6.4 GB/s 20 GB/s 20 GB/s 25 GB/s 40 GB/s 80 GB/s 100 GB/s
memory bandwidth compared to Woody 1.0 3.1 3.1 3.9 6.2 13 15.6

If one only looks at the increase of peak performance of the complete Emmy systems only, the world is bright: 20x increase in 6 years. Not bad.

However, if one has an unoptimized (i.e. non-vectorized) serial code which is compute bound, the speed on the latest system coming in 2013 will only be 73% of the one bought in 2007! The unoptimized (i.e. non-vectorized) serial code can neither benefit from the wider SIMD units nor from the increased number of cores per node but suffers from the decreased clock frequency.

But also optimized parallel codes are challenged: the degree of parallelism increased from Woody to Emmy by a factor of 25. You remember Amdahl’s law (link to Wikipedia) for strong scaling? To scale up to 20 parallel processes, it is enough if 95% of the runtime can be executed in parallel, i.e. 5% can remain non-parallel. To scale up to 11200 processes, less then 0.01% must be executed non-parallel and there must be no other overhead, e.g. due to communication or HALO exchange!

Abschlussbericht des BMBF-Projekts SKALB

Das Förderung für das Projekts SKALB (Lattice-Boltzmann-Methoden für skalierbare Multi-Physik-Anwendungen) aus dem ersten Call für Forschungsvorhaben auf dem Gebiet “HPC-Software für skalierbare Parallelrechner” im Rahmen des Förderprogramms IKT 2020 – Forschung für Innovationen des BMBF von 1997 ist am 31.12.2011 ausgelaufen. Inzwischen ist der offizielle Abschlussbericht fertig und kann hier als PDF-Datei heruntergeladen werden. Demnächst sollte der Abschlussbericht auch über Technische Informationsbibliothek (TIB) in Hannover verfügbar sein.

Besonders eindrucksvoll ist, dass durch die SKALB-Förderung bis jetzt über 60 Publikationen und annähernd 100 Vorträge auf Fachveranstaltungen entstanden sind.

Why cfx5solve from Ansys-13.0 fails on SuSE SLES11SP2 …

Recently, the operating system of one of RRZE’s HPC clusters was upgraded from SuSE SLES10 SP4 to SuSE SLES11 SP2 … one of the few things which broke due to the OS upgrade is Ansys/CFX-13.0. cfx5solve now aborts with

ccl2flow: * command language error *
Message: getChildList: unable to find the requested path
Context: returned by cclApi call

As one can expect, Ansys does not support running Ansys-13.0 on SuSE SLES11 or SLES11 SP2. There are also lots of reports on this error for different unsupported OS versions in the CFX forum at cfd-online but no explanations or workarounds yet.

So, where does the problem come from? A long story starts …

First guess: SuSE SLES11 SP2 runs a 3.0 kernel. Thus, there might be some script which does not correctly parse the uname or so. However, the problem persists if cfx5solve is run using uname26 (or the equivalent long setarch variant). On the other hand, the problem does not occur if e.g. a CentOS-5 chroot is started on the SLES11 SP2 kernel, i.e. still the same kernel but old user space. This clearly indicates that it is no kernel issue but some library or tool problem.

Next guess: Perl comes bundled with Ansys/CFX but it might be some other command line tool from the Linux distribution which is used by cfx5solve, e.g. sed and friends or some changed bash behavior. Using strace on cfx5solve reveals several calls of such tools. But actually, none of them is problematic.

Thus, it must be a library issue: Ansys/CFX comes with most of the libraries it needs bundled but there is always the glibc, i.e. /lib64/ld-linux-x86-64.so.2, /lib64/libc.so.6, etc. SuSE SLES10 used glibc-2.4, RHEL5 uses glibc-2.5 but SLES11 SP2 uses glibc-2.11.3

The glibc cannot be overwritten using LD_LIBRARY_PATH as any another library. But there are ways to do it anyway …

The error message suggests that ccl2flow.exe is causing the problems. So, let’s run that with an old glibc version. As cfx5solve allows specifying a custom ccl2flow binary we can use a simple shell script to call the actual ccl2flow.exe using the loader and glibc libraries from the CentOS5 glibc-2.5. Nothing changes; still the very same getChildList error message in the out file. Does that mean that ccl2flow.exe is not the bad guy?

Interlude: Let’s see how ccl2flow.exe is called. The shell wrapper for ccl2flow was already there, thus, let’s add some echo statements to the command line arguments and a sleep statement to inspect the working directory. Et vola. On a good system, a quite long ccl file has just been created before ccl2flow is called; however, on a bad system running the new OS the ccl file is almost empty. Thus, we should not blame ccl2flow.exe but what happens before. Well, before there is just the Ansys supplied perl running.

Let’s have a closer look at the perl script: Understanding what the cfx5solve Perl script does seems to be impossible. Even if the Perl script is traced on a good and bad system there are no real insights. At some point, the bad system does not return an object while the other does. Thus, let’s run perl using the old glibc version. That’s a little bit more tricky as cfx5solve is not a binary but a shell script which calls another shell script before finally calling an Ansys-supplied perl binary. But one can also manage these additional difficulties. Et vola, the error message disappeared. What’s going on? Perl is running fine but producing different results depending on the glibc version.

Interlude Ansys/CFX-14.0: This version if officially only supported on SuSE SLES11 but not SLES11 SP2 if I got it correctly. But it runs fine on SLES11 SP2, too. What Perl version do they use? Exactly the same version, even the very same binary (i.e. both binaries have the same checksum!). Thus, it is not the Perl itself but some CFX-specific library it dynamically loads.

End of the story? Not yet but Almost. Spending already so much time on the problem I finally wanted to know which glibc versions are good or evil. I already knew Redhat’s glibc-2.5 is good and SuSE’s glibc-2.11.3 is evil. Thus, let’s try the versions in between using the official sources from ftp.gnu.org/gnu/glibc. Versions <2.10 or so require a fix for the configure script to recognice a modern as or ld as good version. A few versions do not compile properly at all on my system. But there is no bad version, even with 2.11.3 there is no CFX error. Only starting from glibc-2.12.1 on there is the well-known ccl2flow error. Not really surprising. SuSE and other Linux distributors have long lists of patches they apply, including back-ports from newer releases. There are almost 100 SuSE patches included in their version of glibc-2.11.3-17.39.1; no chance to see what they are doing.

My next guess is that the problem must be a commit between 2.11.3 and 2.12.1 of the official glibc versions. GNU proves a Git repository and git bisect is your friend. This leads to commit f89d2f30 from Dec. 2009: Enable multiarch whenever possible. This commit did not change any actual code but only the default configuration parameters. That means, the code causing the fault must be in the sources already much before. It only debuted once multi-arch was switched on in 2.12.1 of the vanilla version or earlier in the SuSE version (the spec file contains an --enable-multi-arch line; proved).

Going back in history, it finally turns out that glibc commit ab6a873f from Jun 2009 (SSSE3 strcpy/stpcpy for x86-64) is responsible for the problems leading to the failing ccl2flow.

Unfortunately, it is not possible to see if the most recent glibc versions still cause problems as cfx5solve already aborts earlier with some error message (Can’t call method “numstring” on an undefined value).

It is also not clear whether it is a glibc error, a problem in one of the CFX library or if it just because of the tools used when Ansys-13.0 was compiled.

End of the story: If you a willing to take the risk of getting wrong results, you may make v130/CFX/tools/perl-5.8.0-1/bin/Linux-x86_64/perl use an older glibc version (or one compiled without multi-arch support) and thus avoid the ccl2flow error. But who knows what else fails visibly or behind the scene. There is a unknown risk of wrong results even if cfx5solve now runs in principle on SuSE SLES11 SP2.

I fully understand that users do not want to switch versions within a running project. Thus, it is really a pity that ISVs force users (and sys admins) to run very old OS versions. SuSE SLES 10 was released in 2006 and will reach end of general support in July 2013; SLES11 was released in March 2009 while Ansys13 was released only in autumn 2010. And we still shall stick to SLES10? It’s time to increase the pressure on ISVs or to start developing in-house codes again.

Additional throughput nodes added to Woody cluster

Recently, 40 additional nodes with an aggregated AVX-Linpack performance of 4 TFlop/s have been added to RRZE’s Woody cluster. The nodes were bought by RRZE and ECAP and shall provide additional resources especially for sequential and single-node throughput calculations. Each node has a single-socket socket with Intel’s latest “SandyBridge” 4-core CPUs (Xeon E3-1200 series), 8 GB of main memory, currently no harddisk (and thus no swap) and GBit ethernet.

Current status: most of the new nodes are available for general batch processing; the configuration and software environment stabilized

Open problems:

  • no known ones

User visible changes and solved problems:

  • End of April 2012: all the new w10xx nodes got their harddisk in the meantime and have been reinstalled with SLES10 to match the old w0xx nodes
  • The module command was not available in PBS batch jobs; fixed since 2011-12-17 by patching /etc/profile to always source bashrc even in non-interactive shells
  • The environment variable $HOSTNAME was not defined. Fixed since 2011-12-19 via csh.cshrc.local and bash.bashrc.local.
  • SMT disabled on all nodes (since 2011-12-19). All visible cores are physical cores.
  • qsub is now generally wrapped – but that that should be completely transparent for users (2012-01-16).
  • /wsfs = $FASTTMP is now available, too (2012-01-23)

Configuration notes:

  • The additional nodes are named w10xx.
  • The base operating system is Ubuntu 10.04 LTS. SuSE SLES10 as on the rest of Woody.
    • The diskless images provisioned using Perceus. Autoinstall + cfengine
    • This is different to the rest of Woody which has stateful SuSE SLES10SP4.
    • However, Tiny* for example also uses Ubuntu 10.04 (but in a stateful installation) and binaries should run on SLES and Ubuntu without recompilation.
  • The w10xx nodes have python-2.6 while the other w0xxx nodes have python-2.4. You can load the python/2.7.1 module to ensure a common Python environment.
  • compilation of C++ code on the compute nodes using one of RRZE’s gcc modules will probably fail; however, we never guaranteed that compiling on any compute nodes works; either use the system g++, compile on the frontend nodes, or …
  • The PBS daemon (pbs_mom) running on the additional nodes is much newer than on the old Woody nodes (2.5.9 v.s. 2.3.x?); but the difference should not be visible for users.
  • Each PBS job runs in a cpuset. Thus, you only have access to the CPUs assigned to you by the queuing system. Memory, however, is not partitioned. Thus, make sure that you only use less than 2 GB per requested core as memory constraints cannot be imposed.
  • As the w10xx nodes currently do not have any local harddisk, they are also operated without swap. Thus, the virtual address space and the physically allocated memory of all processes must not exceed 7.2 GB in total. Also /tmp and /scratch are part of the main memory. Stdout and stderr of PBS jobs are also first spooled to main memory before they are copied to the final destination after the job ended.
  • multi-node jobs are not supported as the nodes are a throughput component

Queue configuration / how to submit jobs:

  • The old w0xx nodes got the properties :c2 (as they are Intel Core2-based) and :any.
    The addition w10xx nodes got the properties :sb (as they are Intel SandyBridge-based) and :any.
  • Multi-node jobs (-lnodes=X:ppn=4 or -lnodes=X:ppn=4:c2 with X>1) are only eligible for the old w0xx nodes. :c2 will be added automatically if not present.
    Multi-node jobs which ask for :sb or :any are rejected.
  • Single-node jobs (-lnodes=1:ppn=4) by default also will only access the old w0xx nodes, i.e. :c2 will be added automatically if no node property is given. Thus, -lnodes=1:ppn=4 is identical to requesting -lnodes=1:ppn=4:c2.
    Single-node jobs which specify :sb (i.e. -lnodes=1:ppn=4:sb) will only go to the new w10xx nodes.
    Jobs with :any (i.e. -lnodes=1:ppn=4:any) will run on any available node.
  • Single-core jobs (-lnodes=1:ppn=Y:sb with Y<4, i.e. requesting less than a complete node) are only supported on the new w10xx nodes. Specifying :sb is mandatory.

Technical details:

  • PBS routing originally did not work as expected for jobs where the resource requests are given on the command line (e.g. qsub -lnodes=1:ppn=4 job.sh caused trouble).
    Some technical background: (1) the torque-submitfilter cannot modify the resource requests given on the command line and (2) routing queues cannot add node properties to resource requests any more, thus, for this type of job routing to the old nodes does not seem to be possible … Using distinct queues for the old and new nodes has the disadvantage that jobs cannot ask for “any available CPU”. Moreover, the maui scheduler does not support multi-dimensional throttling policies, i.e. has problems if one user submits jobs to different queues at the same time.
    The solution probably is a wrapper around qsub as suggested in the Torque mailinglist back in May 2008. At RRZE we already use qsub-wrappers for e.g. qsub.tinyblue. Duplicating some of the logic of the submit filter into the submit wrapper is not really elegant but seems to be the only solution right now. (As a side node: interactive jobs do not seem to suffer from the problem as there is special handling in the qsub source code which writes the command line arguments to a temporary file which is subject to processing by the submit filter.)

STAR-CCM+ fails with “mpid: Not enough shared memory”

If STAR-CCM+ fails on large shared memory nodes with the message “mpid: Not enough shared memory”, your sysadmin might need to increase the kernel limits for SHMMAX (maximum size of shared memory segment in bytes), i.e. sysctl -w kernel.shmmax=.... Especially, the Ubuntu/Debian default of 32 MB seems to be too small even for 2-socket nodes with 8-core AMD Opteron processors, i.e. 16 cores/node …