Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things


Early usage of Emmy


This article is no longer updated as Emmy is in the meantime in regular production.

Please check the official documentation available at


RRZE’s new HPC system “Emmy” is open to the very first early adopters …

Measured LINPACK performance: 191 TFlop/s (using the CPUs of all 560 nodes and enabling Turbo Mode) – #210 in the Top500 of November 2013!


  • 544 regular compute nodes (e[01]XXX), each with
    • 2 sockets with Intel “Ivy Bridge” 10-core CPUs (Intel Xeon E5-2660 v2 @ 2.20GHz), i.e. 20 physical cores per node showing up as 40 virtual cores thanks to SMT
    • 64 GB main memory (approx. 60 GB usable; DDR3-1600)
    • QDR Infiniband
    • no local harddisk
  • 16 nodes with accelerators (e1[01]6X)
    • 2 sockets with Intel “Ivy Bridge” 10-core CPUs (Intel Xeon E5-2660 v2 @ 2.20GHz), i.e. 20 physical cores per node showing up as 40 virtual cores thanks to SMT
    • 64 GB main memory (DDR3-1866)
    • QDR Infiniband
    • 1 TB local harddisk
    • either one or two Intel Xeon Phi “MIC” or NVidia Keppler K20m
  • 2 login nodes “”
  • parallel file system (approx. 440 TB capacity)

Login nodes:

  • SSH to to compile and submit jobs.
  • all usual HPC file systems are available; Emmy has a new parallel file system which is only available on the new system and nowhere else.

Job submission:

  • Job submission is possible from the login nodes only as well as from cshpc (using qsub.emmy in the latter case).
  • ppn=40 is mandatory; other limits are RRZE-typical (e.g. 1h for devel and 24h for work).
  • Single node jobs are unwanted on the new system.
  • specific node types can be requested using their properties :ddr1600 (544 nodes qualify), :ddr1866 (16 nodes qualify), :k20m (one or two NVidia Keppler cards; 10 nodes qualify), k20m1x (one NVidia Keppler cards; 4 nodes qualify), k20m2x (two NVidia Keppler cards; 6 nodes qualify), phi (one or two Xeon Phi “MIC”, 10 nodes qualify), phi1x (one Xeon Phi, 4 nodes qualify), phi2x (two Xeon Phi, 6 nodes qualify) — see below for special notes on the Xeon Phi nodes!
  • specific clock frequencies can be requested; supported are: turbo,noturbo,f2.2,f2.1,f2.0,f1.9,f1.8,f1.7,f1.6,f1.5,f1.4,f1.3,f1.2
    by default, “ondemand” going up to turbo (typically 2.6 GHz) is used.
  • The typical web pages showing the system status/utilization are available at

Software/Modules/Running MPI jobs:

  • All users have BASH as login shell. No exceptions.
  • All major software and libraries should be available; most software is available in the same versions also on LiMa since the last update.
  • Intel’s mpiexec.hydra was broken for quite some time (i.e. several versions); see Typical users should go with RRZE’s mpirun_rrze.
  • If an intelmpi module is loaded, mpirun_rrze is available in the search path. It learned a new pinning option for pure-MPI binaries: “-pinexpr EXPR” where EPXR can be any syntax likwid-pin understands, e.g. S0:0-9@S1:0-9 to select ten cores on socket 0 and socket 1.
  • Intel MPI is recommended, but OpenMPI is available, too.
  • Sample job scripts for Ansys/CFX and STAR-CCM+ are available in /apss/Ansys/ and /apps/STAR-CCM+/ (visible only to those who are eligible for these commercial software packages). For Ansys/CFX there is a preview of version 15.0 available which should scale much better than version 14.5. Version 15.0pre3 right now uses a separate license (expired on Oct. 17th) which must only be used for benchmarking.

Current limitations: (last updates 2013-11-22)

  • Every HPC user can log into the login nodes of Emmy; but only specifically enabled accounts (“early adopters”) can submit jobs. Early adopters are by invitation only.
  • Jobs cannot be submitted from cshpc due to incompatibilities of PBS versions.
  • Everyone can only see his/her own jobs in the queue (and not also the jobs of other members of the group).
  • If there are other jobs, a single user cannot get more than 192 nodes or 48 running jobs. If there are no other jobs, up to 384 nodes or 96 running jobs are possible. (updated limits as of 2013-10-07)
  • pbsdsh (and other software which uses the PBS-TM API, e.g. mpiexec.pbs) does not work reliably.
  • if mpirun_rrze within a batch job fails with “… Too many logins …”, contact hpc@rrze and try for now to add a “sleep 5” before calling mpirun. (2013-11-05)
  • Runtime of jobs cannot be extended by sysadmins. Jobs are always killed after their initial walltime request is over. Might be fixed as of 2013-10-02.
  • Directories on the parallel file system ($FASTTMP = /elxfs) are not yet created automatically. Contact RRZE to get a directory.
  • Xeon Phi “MIC” not usable at all so far. Xeon Phis should be usable since 2013-10-25 – configuration is still improving; some additional remarks:
    • TODO: cross-compilation for Xeon Phi does NOT work yet (as of 2013-10-25) on the login nodes; request a phi nodes and compile there for now.
    • TODO: users are not yet regularly synchronized to the Xeon Phis (has to be done manually by the admins by running /root/ on the management node)
    • Hostbased SSH should work now (2013-10-25) also for Xeon Phi similarly as for the regular compute nodes (i.e. from with the compute nodes but not from the login nodes), thus, there should be no need to have password-less SSH keys around
    • offload-mode has not been tested yet; volunteers?
    • ATTENTION as of 2013-10-25: for Intel MPI between Host and Phi or between Phis, explicitly use I_MPI_DEVICE=shm:dapl
      behaves strange (i.e. fails unless I_MPI_OFA_ADAPTER_NAME=mlx4_0 is set which is now done automatically; and with set I_MPI_OFA_ADAPTER_NAME=mlx4_0 performance between Host and Phi is not really good)
    • NOTE as of 2013-10-25: I_MPI_OFA_ADAPTER_NAME=mlx4_0 is required on nodes with Xeon Phi for communication between the normal compute nodes; otherwise Intel MPI does some mess (i.e. uses the wrong device). I_MPI_OFA_ADAPTER_NAME=mlx4_0 is now set automatically by the intelmpi module file. This setting hopefully does not have any drawback for normal jobs.
    • Jobs with property :phi / :phi1x / :phi2x do not start automatically. Contact hpc@rrze (2013-11-18)
  • The system is moving pre-production to regular operation. Thus still expect outages without announcements. But nevertheless, always report strange behaviour to!

Xserver/VirtualGL on NVidia K20m GPGPUs

As delivered, our NVidia K20m GPGPUs were in GPU Operation Mode 1/COMPUTE. If one tries to start the Xorg server, /var/log/Xorg.0.log shows:

    [   134.516] (II) NVIDIA: Using 4095.62 MB of virtual memory for indirect memory
    [   134.516] (II) NVIDIA:     access.
    [   134.523] (EE) NVIDIA(0): Failed to allocate 2D engine
    [   134.523] (EE) NVIDIA(0):  *** Aborting ***
    [   134.523] (EE) NVIDIA(0): Failed to allocate 2D objects
    [   134.523] (EE) NVIDIA(0):  *** Aborting ***
    [   134.619] Fatal server error:
    [   134.619] AddScreen/ScreenInit failed for driver 0

Using nvidia-smi --gom=0 and rebooting the node, the GPU Operation Mode can be set to 0/ALL_ON. Now, starting the Xorg server succeeds (and VirtualGL should work for remote visualization).

According to the documentation for NVidia driver 325.15, setting the GPU Operation Mode is only supported on  “GK110 M-class and X-class Tesla products from the Kepler family”.

Cluster2013: Racks for “Emmy” cluster delivered

On Monday, July 29, eleven new water cooled racks for our new cluster have been delivered and brought to their final position in the server room which is now quite full.

View from the entrance on one of the two rows of the new water cooled racks of the "Emmy" cluster which connect to the old "Woody" cluster in the front and which is close to the "LiMa" cluster in the back.

View from the entrance on one of the two rows of the new water cooled racks of the “Emmy” cluster which connect to the old “Woody” cluster in the front and which is close to the “LiMa” cluster in the back.

Eleven new water cooled racks of the "Emmy" cluster.

Eleven new water cooled racks of the “Emmy” cluster.

"Cool door" of the new water cooled racks of the "Emmy" cluster.

“Cool door” of the new water cooled racks of the “Emmy” cluster.

However, there is no clear date yet when the new racks will be connected to the cold water supply pipe nor when the compute hardware will be mounted. Let’s hope …

No free lunch – dying the death of parallelism

The following table gives a short overview of the main HPC systems at RRZE and the development of number of cores, clock frequency, peak performance, etc.:

original Woody (w0xxx) w10xx w11xx w12xx + w13xx LiMa Emmy Meggie
Year of installation Q1/2007 Q1/2012 Q4/2013 Q1/2016 + Q4/2016 Q4/2010 Q4/2013 Q4/2016
total number of compute nodes 222 40 72 8 + 56 500 560 728
total number of cores 888 160 288 32 + 224 6000 11200 14560
double precision peak performance of the complete system 10 TFlop/s 4.5 TFlop/s 15 TFlop/s 1.8 + 12.5 TFlop/s 63 TFlop/s 197 TFlop/s ~0.5 PFlop/s (assuming the non-AVX base frequency as AVX turbo frequency)
increase of peak performance of the complete system compared to Woody 1.0 0.4 0.8 0.2 + 1.25 6.3 20 50
max. power consumption of compute nodes and interconnect 100 kW 7 kW 10 kW 2 + 6,5 kW 200 kW 225 kW ~200 kW
Intel CPU generation Woodcrest SandyBridge (E3-1280) Haswell (E3-1240 v3) Skylake (E3-1240 v5) Westmere-EP (X5650) IvyBridge-EP (E5-2660 v2) Intel Broadwell EP (E5-2630 v4)
base clock frequency 3.0 GHz 3.5 GHz 3.4 GHz 3.5 GHz 2.66 GHz 2.2 GHz 2.2 GHz
number of sockets per node 2 1 1 1 2 2 2
number of (physical) cores per node 4 4 4 4 12 20 20
SIMD vector lenth 128 bit (SSE) 256 bit (AVX) 256 bit (AVX+FMA) 256 bit (AVX+FMA) 128 bit (SSE) 256 bit (AVX) 256 bit (AVX+FMA)
maximum single precision peak performance per node 96 GFlop/s 224 GFlop/s 435 GFlop/s 448 GFlop/s 255 GFlop/s 704 GFlop/s 1408 GFlop/s
peak performance per node compared to Woody 1.0 2.3 4.5 4.5 2.6 7.3 14.7
single precision peak performance of serial, non-vectorized code 6 GFlop/s 7.0 GFlop/s 6.8 GFlop/s 7.0 GFlop/s 5.3 GFlop/s 4.4 GFlop/s 4.4 GFlop/s
performance for unoptimized serial code compared to Woody 1.0 1.17 1.13 1.17 0.88 0.73 0.73
main memory per node 8 GB 8 GB 8 GB 16 GB / 32 GB 24 GB 64 GB 64 GB
memory bandwidth per node 6.4 GB/s 20 GB/s 20 GB/s 25 GB/s 40 GB/s 80 GB/s 100 GB/s
memory bandwidth compared to Woody 1.0 3.1 3.1 3.9 6.2 13 15.6

If one only looks at the increase of peak performance of the complete Emmy systems only, the world is bright: 20x increase in 6 years. Not bad.

However, if one has an unoptimized (i.e. non-vectorized) serial code which is compute bound, the speed on the latest system coming in 2013 will only be 73% of the one bought in 2007! The unoptimized (i.e. non-vectorized) serial code can neither benefit from the wider SIMD units nor from the increased number of cores per node but suffers from the decreased clock frequency.

But also optimized parallel codes are challenged: the degree of parallelism increased from Woody to Emmy by a factor of 25. You remember Amdahl’s law (link to Wikipedia) for strong scaling? To scale up to 20 parallel processes, it is enough if 95% of the runtime can be executed in parallel, i.e. 5% can remain non-parallel. To scale up to 11200 processes, less then 0.01% must be executed non-parallel and there must be no other overhead, e.g. due to communication or HALO exchange!