Thomas Zeiser, 11:49 Uhr in HPC-Cluster@RRZE
Recently, 40 additional nodes with an aggregated AVX-Linpack performance of 4 TFlop/s have been added to RRZE’s Woody cluster. The nodes were bought by RRZE and ECAP and shall provide additional resources especially for sequential and single-node throughput calculations. Each node has a single-socket socket with Intel’s latest “SandyBridge” 4-core CPUs (Xeon E3-1200 series), 8 GB of main memory, currently no harddisk (and thus no swap) and GBit ethernet.
Current status: most of the new nodes are available for general batch processing; the configuration and software environment stabilized
Open problems:
User visible changes and solved problems:
- End of April 2012: all the new w10xx nodes got their harddisk in the meantime and have been reinstalled with SLES10 to match the old w0xx nodes
- The
module command was not available in PBS batch jobs; fixed since 2011-12-17 by patching /etc/profile to always source bashrc even in non-interactive shells
- The environment variable
$HOSTNAME was not defined. Fixed since 2011-12-19 via csh.cshrc.local and bash.bashrc.local.
- SMT disabled on all nodes (since 2011-12-19). All visible cores are physical cores.
qsub is now generally wrapped – but that that should be completely transparent for users (2012-01-16).
/wsfs = $FASTTMP is now available, too (2012-01-23)
Configuration notes:
- The additional nodes are named
w10xx.
- The base operating system is
Ubuntu 10.04 LTS. SuSE SLES10 as on the rest of Woody.
The diskless images provisioned using Perceus. Autoinstall + cfengine
This is different to the rest of Woody which has stateful SuSE SLES10SP4.
However, Tiny* for example also uses Ubuntu 10.04 (but in a stateful installation) and binaries should run on SLES and Ubuntu without recompilation.
The w10xx nodes have python-2.6 while the other w0xxx nodes have python-2.4. You can load the python/2.7.1 module to ensure a common Python environment.
compilation of C++ code on the compute nodes using one of RRZE’s gcc modules will probably fail; however, we never guaranteed that compiling on any compute nodes works; either use the system g++, compile on the frontend nodes, or …
- The PBS daemon (
pbs_mom) running on the additional nodes is much newer than on the old Woody nodes (2.5.9 v.s. 2.3.x?); but the difference should not be visible for users.
- Each PBS job runs in a cpuset. Thus, you only have access to the CPUs assigned to you by the queuing system. Memory, however, is not partitioned. Thus, make sure that you only use less than 2 GB per requested core as memory constraints cannot be imposed.
As the w10xx nodes currently do not have any local harddisk, they are also operated without swap. Thus, the virtual address space and the physically allocated memory of all processes must not exceed 7.2 GB in total. Also /tmp and /scratch are part of the main memory. Stdout and stderr of PBS jobs are also first spooled to main memory before they are copied to the final destination after the job ended.
- multi-node jobs are not supported as the nodes are a throughput component
Queue configuration / how to submit jobs:
- The old w0xx nodes got the properties
:c2 (as they are Intel Core2-based) and :any.
The addition w10xx nodes got the properties :sb (as they are Intel SandyBridge-based) and :any.
- Multi-node jobs (
-lnodes=X:ppn=4 or -lnodes=X:ppn=4:c2 with X>1) are only eligible for the old w0xx nodes. :c2 will be added automatically if not present.
Multi-node jobs which ask for :sb or :any are rejected.
- Single-node jobs (
-lnodes=1:ppn=4) by default also will only access the old w0xx nodes, i.e. :c2 will be added automatically if no node property is given. Thus, -lnodes=1:ppn=4 is identical to requesting -lnodes=1:ppn=4:c2.
Single-node jobs which specify :sb (i.e. -lnodes=1:ppn=4:sb) will only go to the new w10xx nodes.
Jobs with :any (i.e. -lnodes=1:ppn=4:any) will run on any available node.
- Single-core jobs (
-lnodes=1:ppn=Y:sb with Y<4, i.e. requesting less than a complete node) are only supported on the new w10xx nodes. Specifying :sb is mandatory.
Technical details:
- PBS routing originally did not work as expected for jobs where the resource requests are given on the command line (e.g.
qsub -lnodes=1:ppn=4 job.sh caused trouble).
Some technical background: (1) the torque-submitfilter cannot modify the resource requests given on the command line and (2) routing queues cannot add node properties to resource requests any more, thus, for this type of job routing to the old nodes does not seem to be possible … Using distinct queues for the old and new nodes has the disadvantage that jobs cannot ask for “any available CPU”. Moreover, the maui scheduler does not support multi-dimensional throttling policies, i.e. has problems if one user submits jobs to different queues at the same time.
The solution probably is a wrapper around qsub as suggested in the Torque mailinglist back in May 2008. At RRZE we already use qsub-wrappers for e.g. qsub.tinyblue. Duplicating some of the logic of the submit filter into the submit wrapper is not really elegant but seems to be the only solution right now. (As a side node: interactive jobs do not seem to suffer from the problem as there is special handling in the qsub source code which writes the command line arguments to a temporary file which is subject to processing by the submit filter.)
Kommentare deaktiviert
Thomas Zeiser, 18:17 Uhr in OpenFOAM
Compared with other software, installing OpenFOAM is (still) a nightmare. They use their very own build system, there are tons of environment variables to set, etc. But it seems that users in academia and industry accept OpenFOAM nevertheless. For release 1.7.1, I took the time to create a little receipt (in some parts very specifically tailored to RRZE’s installation of software packages) to more or less automatically build OpenFOAM and some accompanying Third Party packages from scratch using the Intel Compilers (icc/icpc) and Intel MPI instead of Gcc and Open MPI (only Qt and Paraview are still built using gcc). The script is provided as-is without any guarantee that it works elsewhere and of course also without any support. The script assumes that the required source code packages have already been downloaded. Where necessary, the unpacked sources are patched and the compilation commands are executed. Finally, two new tar balls are created which contain the required “output” for a clean binary installation, i.e. intermediate output files (e.g. *.dep) are not included …
Compilation takes ages, but that’s not really surprising. Only extracting the tar balls with the sources amounts to 1.6 GB in almost 45k files/directories. After compilation (although neither Open MPI nor Gcc are built) the size is increased to 6.5 GB or 120k files. If all intermediate compilation files are removed, there are still about 1 GB or 30k files/directories remaining in my “clean installation” (with only the Qt/ParaView libraries/binaries in the ThirdParty tree).
RRZE users find OpenFOAM-1.7.1 as module on Woody and TinyBlue. The binaries used for Woody and TinyBlue are slightly different as both were natively compiled on SuSE SLES 10SP3 and Ubuntu 8.04, respectively. The main difference should only be in the Qt/Paraview part as SLES10 and Ubuntu 8.04 come with different Python versions. ParaView should also be compiled with MPI support.
Note (2012-06-08): to be able to compile src/finiteVolume/fields/fvPatchFields/constraint/wedge/wedgeFvPatchScalarField.C with recent versions of the Intel compiler, one has to patch this file to avoid an no instance of overloaded function “Foam:operator==” matches the argument list error message; cf. http://www.cfd-online.com/Forums/openfoam-installation/101961-compiling-2-1-0-rhel6-2-icc.html and https://github.com/OpenFOAM/OpenFOAM-2.1.x/commit/8cf1d398d16551c4931d20d9fc3e42957d0f93ca. These links are for OF-2.1.x but the fix works for OF-1.7.1 as well.
Kommentare deaktiviert
Thomas Zeiser, 16:07 Uhr in HPC-Cluster@RRZE
Starting from today, our intel64 compiler modules for version 11.0 and 11.1 of the Intel compilers will automatically load the MKL module corresponding to the bundled version in the Intel Processional Compilers, too.
- If you do not use MKL, this change will not affect you.
- If you want (or have) to use a different version of MKL, you have to load the MKL module before loading the compiler module OR you have to unload the automatically loaded MKL module before loading your desired version.
Motivation for this change was to make the compiler option -mkl work properly (or at least better), i.e. give the user a reasonable chance to find the MKL libraries at runtime. You still have to load the compiler (or mkl) module at runtime OR you have to add -Wl,-rpath,$MKLPATH explicitly to your linker arguments …
Kommentare deaktiviert
Thomas Zeiser, 20:12 Uhr in HPC allgemein, HPC-Cluster@RRZE
More and more MPI implementations try to scope with the increasing complexity of modern cluster nodes. However, sometimes these automatic defaults are counterproductive when trying to pinpoint special effects …
Things to remember:
- Open MPI:
- Mvapich2:
- recent version of mvapich2 default to
MV2_ENABLE_AFFINITY=1 which enables internal pinning and overwrites any taskset given on the command line. Either use MV2_CPU_MAPPING (containing colon-separated CPU numbers) to make the correct pinning or set MV2_ENABLE_AFFINITY=0 and do external pinning. RRZE’s mpirun wrapper (as of Feb. 2010) disables the internal mvapich2 affinity routines to make our -pin argument work (again).
- pinning of hybrid codes (mvapich2 + OpenMP) using RRZE’s
pin_omp: PINOMP_SKIP=1,2 (cf. http://www.blogs.uni-erlangen.de/JohannesHabich/stories/3414/#4721)
The resulting command line can be rather length: e.g. env PINOMP_SKIP=1,2 OMP_NUM_THREADS=4 KMP_AFFINITY=disabled mpirun -npernode 2 -pin 0,1,2,3_4,5,6,7 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so ./a.out
- Intel MPI:
- starting large jobs over Infiniband: using the socket connection manager (SCM) instead of the default connection manager (CMA) might improve the start-up sequence. One might try
env I_MPI_DEVICE=rdssm:OpenIB-mthca0-1 on Woody and TinyGPU, env I_MPI_DEVICE=rdssm:ofa-v2-mthca0-1 on Townsend or I_MPI_DEVICE=rdssm:OpenIB-mlx4_0-1 on TinyBlue or TinyFat.
Thanks to Intel for pointing this OFED/Intel-MPI option out. Unfortunately, there is only very little OFED documentation on the differences between CMA and SCM: http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_release_notes.txt (Release notes from December 2009 for OFED-1.5).
UCM might be even more scalable — on Townsend, Woody and Tiny{Blue,FAT,GPU} UCM is available if the experimental “dapl” moule is loaded before any Intel MPI module.
- bug (at least) in Intel MPI 3.1.0.038: connections to MPD daemons may fail if
$TMPDIR is set to something different than /tmp
- bug in Intel MPI 4.0.0.025: Intel’s
mpirun as shortcut for mpdboot - mpiexec - mpdallexit cannot read properly from STDIN; the problem is the newly with this release introduced sequence of mpiexec "$@" & wait $! instead of the traditional mpiexec "$@". RRZE’s PBS-based mpirun is not affected by this problem.
- incompatibilities between RRZE’s PBS-based mpirun (i.e. Pete Wyckoff’s mpiexec) and
I_MPI_DEVICE=rdssm: from time to time we observe that processes hang either during startup (at MPI_Init or at the first larger scale communication) or even more often at MPI_Finalize. I_MPI_DEVICE=rdma usually does not have these problems but is noticeable slower for certain applications. As the behavior is non-deterministic, it’s hard/impossible to debug and as the PBS-based start mechanism is not supported by Intel we cannot file a problem report neither. And of course Intel’s extensions of the PMI startup protocol also not publically documented …
- you have recent Qlogic Infiniband HCA?
I_MPI_FABRICS=tmi I_MPI_TMI_PROVIDER=psm should be appropriate (PS: I_MPI_FABRICS is the replacement for I_MPI_DEVICE introduced with Intel-MPI 4.0)
- HP-MPI:
- The environment variable
MPIRUN_OPTIONS can be used to pass special options to HP-MPI, e.g. "-v -prot" to be verbose and report the used interconnect; -"-TCP -netaddr 10.188.84.0/255.255.254.0" to force TCP and select a specific interface (e.g. for IPoIB).
- HP-MPI has its own pinning mechanism; cf. page 16/17 of /apps/HP-MPI/latest/doc/hp-mpi.02.02.rn.pdf dor details;
MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,2,1,3" should be fine for regular jobs on Woody. If you require a lot of memory and would like to use only one MPI process per socket, the correct option should be MPIRUN_OPTIONS="-cpu_bind=v,map_cpu:0,1"
- Parastation MPI as e.g. used on JUROPA of FZ-Jülich
- Any pointer to pinning mechanisms for processes would highly be appreciated.
- The latest versions (cf. psmgmt 5.0.32-0) have the environment variable
__PSI_CPUMAP which can be set to individual cores or core ranges, e.g. "0,2,4,6,1,3,5,7" or "0-3,7-4".
- to be continued
Kommentare (3)
Thomas Zeiser, 15:52 Uhr in HPC-Cluster@RRZE
For certain applications and processor numbers, there were problems with hanging MPI processes in the startup phase when using Intel MPI together with PBS’s mpiexec and I_MPI_DEVICE=rdssm. Therefore, we changed the default I_MPI_DEVICE for Woody, Townsend and TinyBlue from rdssm to rdma. In practice, one should not see any difference as rdssm with recent versions of Intel MPI did not use shm but rdma also within the node.
Single node jobs on Woody and TinyBlue using Intel MPI and our mpirun wrapper will from now on use shm by default if exactly one node was requested for the PBS jobs.
additional settings for mvapich2:
To make our -pin argument work with recent mvapich2 versions, we recently added a default of MV2_ENABLE_AFFINITY=0 to our mpirun wrapper.
Kommentare deaktiviert
Thomas Zeiser, 18:40 Uhr in HPC-Cluster@RRZE
Sometimes trivial parallelism is the most efficient way to parallelize work, e.g. for parameter studies with a sequential program. If only complete nodes may be allocated on a certain cluster, several sequenatial runs can very easily be bundled into one job file:
#!/bin/bash -l
# allocate 1 nodes (4 CPUs) for 8 hours
#PBS -l nodes=1:ppn=4,walltime=08:00:00
# job name
#PBS -N xyz
# first non-empty non-comment line ends PBS options
# jobs always start in HOME
# but we want to go to the directory where we submitted the job
cd $PBS_O_WORKDIR
# run 4 sequential parameter studies in parallel and bind eachone
# to a specific core
(taskset -c 0 ./a.out input1.dat ) &
(taskset -c 1 ./a.out input2.dat ) &
(taskset -c 2 ./a.out input3.dat ) &
(taskset -c 3 ./a.out input4.dat ) &
# wait for all background processes to finish ("wait" is a bash built-in)
wait
The bash builtin wait ensures that all background processes have finished once wait returns.
For this to work efficiently, of course all parameter runs should take about the same time …
Kommentare deaktiviert
Thomas Zeiser, 09:45 Uhr in HPC-Cluster@RRZE
As it seems to be come a FAQ – although it is documented as a small note in the ifort man page: if the 64-bit Intel compilers (for EM64T/Opteron but not for IA64) are used and statically allocated data (e.g. in Fortran common blocks) exceeds 2GB, the -mcmodel=medium or -mcmodel=large switches must be used. As a consequence, the -i-dynamic or -shared-intel flag depending on the compiler version must also be specified for the linking step otherwise strange error messages about relocation truncated occur for libifcore.a routines during linking.
Kommentare deaktiviert
Thomas Zeiser, 11:20 Uhr in HPC-Cluster@RRZE
Attention: OFED disallows system(const char*) or fork/exec after initializing the Infiniband libraries. Some documentation mentions about this:
… the Mellanox InfiniBand driver has ssues with buffers sharing pages when fork() is used. Pinned (locked in memory) pages are normally marked copy-on-write during a fork. If a page is pinned before a fork and subsequently written to while RDMA operations are being performed on the same page, silent data corruption can occur as RDMA operations continue to stream data to a page that has moved. To avoid this, the Mellanox driver does not use copy-on-write behavior during a fork for pinned pages. Instead, access to these pages by the child process will result in a segmentation violation.
Fork support from kernel 2.6.12 and above is available provided that applications do not use threads. The fork() is supported as long as parent process does not run before child exits or calls exec(). The former can be achieved by calling wait(childpid) the later can be achieved by application specific means. Posix system() call is supported.
Woody is running a SuSE SLES9 kernel, i.e. 2.6.5. Thus, no support for fork and similar things!
Some users already hit this problem! Even a Fortran user who had call system('some command') in his code! In the latter case, the application just hang in some (matching) MPI_send/MPI_recv calls.
Kommentare deaktiviert
Thomas Zeiser, 16:17 Uhr in HPC-Cluster@RRZE
Just as an internal information for ourselves: mvapich and mvapich2 heavily depend on the underlaying Infiniband stack. Therefore, both MPI libraries had to be recompiled while changing from Voltaire IBHOST-3.4.5 stack to GridStack 4.1.5 (aka OFED-1.1).
As mvapich/mvapich2 are statically linked into executables, probably those applications have to be recompiled, too.
On the other side, Intel-MPI (which is the default on woody) is not affected, thus, most of the users’ applications will not require recompilation.
The RRZE versions of mvapich-0.9.8 and mvapich2-0.9.8 were now compiled using:
module load intel64-c/9.1.049, intel64-f/9.1.045
export CC=icc
export CXX=icpc
export F77=ifort
export F90=ifort
export ARCH=_EM64T_
export OPEN_IB_HOME=/usr/local/ofed
export PREFIX=/apps/mvapich2/0.9.8-gridstack-4.1.5
./make.mvapich[2].gen2
Kommentare deaktiviert
Thomas Zeiser, 19:48 Uhr in HPC-Cluster@RRZE
some notes on using Intel Trace Collector (ITC/ITA)
Preliminary note: these are just some “random” notes for myself …
- Intel 10.0 (beta) compilers introduced
-tcollect switch
- inserts instrumentation probes calling the ITC API, i.e. shows functions names instead of just “user code” in the trace
- has to be used for compilation and linking
- not sure yet if ITV libraries have to be specified during linking – but probably one wants to add “$ITC_LIB” (defined by RRZE) anyway
- works fine (at least in initial tests)
- source code locations are not inclides in the trace file by default; compile with
-g and set VT_PCTRACE to some value (either ON or the call level to be recorded (with optional skip levels); see ITC-RefGuide Sec. 3.8, p. 19/20 and 95
- OS counters can be recorded;
VT_COUNTER variable has to be set accordingly; see ITC-RefGuide Sec. 3.10, p. 22
VT_LOGFILE_NAME can be used to specify the name of the tracefile
- to change colors, one has to the “all functions” item as the color toggle is deactivated in the default “generated groups/major functions” item
- unfortunately, I did not yet find a “redraw” button – and chaning the colors only redisplays the current chart automatically
Kommentare deaktiviert