Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

Additional throughput nodes added to Woody cluster

Recently, 40 additional nodes with an aggregated AVX-Linpack performance of 4 TFlop/s have been added to RRZE’s Woody cluster. The nodes were bought by RRZE and ECAP and shall provide additional resources especially for sequential and single-node throughput calculations. Each node has a single-socket socket with Intel’s latest “SandyBridge” 4-core CPUs (Xeon E3-1200 series), 8 GB of main memory, currently no harddisk (and thus no swap) and GBit ethernet.

Current status: most of the new nodes are available for general batch processing; the configuration and software environment stabilized

Open problems:

  • no known ones

User visible changes and solved problems:

  • End of April 2012: all the new w10xx nodes got their harddisk in the meantime and have been reinstalled with SLES10 to match the old w0xx nodes
  • The module command was not available in PBS batch jobs; fixed since 2011-12-17 by patching /etc/profile to always source bashrc even in non-interactive shells
  • The environment variable $HOSTNAME was not defined. Fixed since 2011-12-19 via csh.cshrc.local and bash.bashrc.local.
  • SMT disabled on all nodes (since 2011-12-19). All visible cores are physical cores.
  • qsub is now generally wrapped – but that that should be completely transparent for users (2012-01-16).
  • /wsfs = $FASTTMP is now available, too (2012-01-23)

Configuration notes:

  • The additional nodes are named w10xx.
  • The base operating system is Ubuntu 10.04 LTS. SuSE SLES10 as on the rest of Woody.
    • The diskless images provisioned using Perceus. Autoinstall + cfengine
    • This is different to the rest of Woody which has stateful SuSE SLES10SP4.
    • However, Tiny* for example also uses Ubuntu 10.04 (but in a stateful installation) and binaries should run on SLES and Ubuntu without recompilation.
  • The w10xx nodes have python-2.6 while the other w0xxx nodes have python-2.4. You can load the python/2.7.1 module to ensure a common Python environment.
  • compilation of C++ code on the compute nodes using one of RRZE’s gcc modules will probably fail; however, we never guaranteed that compiling on any compute nodes works; either use the system g++, compile on the frontend nodes, or …
  • The PBS daemon (pbs_mom) running on the additional nodes is much newer than on the old Woody nodes (2.5.9 v.s. 2.3.x?); but the difference should not be visible for users.
  • Each PBS job runs in a cpuset. Thus, you only have access to the CPUs assigned to you by the queuing system. Memory, however, is not partitioned. Thus, make sure that you only use less than 2 GB per requested core as memory constraints cannot be imposed.
  • As the w10xx nodes currently do not have any local harddisk, they are also operated without swap. Thus, the virtual address space and the physically allocated memory of all processes must not exceed 7.2 GB in total. Also /tmp and /scratch are part of the main memory. Stdout and stderr of PBS jobs are also first spooled to main memory before they are copied to the final destination after the job ended.
  • multi-node jobs are not supported as the nodes are a throughput component

Queue configuration / how to submit jobs:

  • The old w0xx nodes got the properties :c2 (as they are Intel Core2-based) and :any.
    The addition w10xx nodes got the properties :sb (as they are Intel SandyBridge-based) and :any.
  • Multi-node jobs (-lnodes=X:ppn=4 or -lnodes=X:ppn=4:c2 with X>1) are only eligible for the old w0xx nodes. :c2 will be added automatically if not present.
    Multi-node jobs which ask for :sb or :any are rejected.
  • Single-node jobs (-lnodes=1:ppn=4) by default also will only access the old w0xx nodes, i.e. :c2 will be added automatically if no node property is given. Thus, -lnodes=1:ppn=4 is identical to requesting -lnodes=1:ppn=4:c2.
    Single-node jobs which specify :sb (i.e. -lnodes=1:ppn=4:sb) will only go to the new w10xx nodes.
    Jobs with :any (i.e. -lnodes=1:ppn=4:any) will run on any available node.
  • Single-core jobs (-lnodes=1:ppn=Y:sb with Y<4, i.e. requesting less than a complete node) are only supported on the new w10xx nodes. Specifying :sb is mandatory.

Technical details:

  • PBS routing originally did not work as expected for jobs where the resource requests are given on the command line (e.g. qsub -lnodes=1:ppn=4 job.sh caused trouble).
    Some technical background: (1) the torque-submitfilter cannot modify the resource requests given on the command line and (2) routing queues cannot add node properties to resource requests any more, thus, for this type of job routing to the old nodes does not seem to be possible … Using distinct queues for the old and new nodes has the disadvantage that jobs cannot ask for “any available CPU”. Moreover, the maui scheduler does not support multi-dimensional throttling policies, i.e. has problems if one user submits jobs to different queues at the same time.
    The solution probably is a wrapper around qsub as suggested in the Torque mailinglist back in May 2008. At RRZE we already use qsub-wrappers for e.g. qsub.tinyblue. Duplicating some of the logic of the submit filter into the submit wrapper is not really elegant but seems to be the only solution right now. (As a side node: interactive jobs do not seem to suffer from the problem as there is special handling in the qsub source code which writes the command line arguments to a temporary file which is subject to processing by the submit filter.)