Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

fork and OFED Infiniband stack

Attention: OFED disallows system(const char*) or fork/exec after initializing the Infiniband libraries. Some documentation mentions about this:
… the Mellanox InfiniBand driver has ssues with buffers sharing pages when fork() is used. Pinned (locked in memory) pages are normally marked copy-on-write during a fork. If a page is pinned before a fork and subsequently written to while RDMA operations are being performed on the same page, silent data corruption can occur as RDMA operations continue to stream data to a page that has moved. To avoid this, the Mellanox driver does not use copy-on-write behavior during a fork for pinned pages. Instead, access to these pages by the child process will result in a segmentation violation.
Fork support from kernel 2.6.12 and above is available provided that applications do not use threads. The fork() is supported as long as parent process does not run before child exits or calls exec(). The former can be achieved by calling wait(childpid) the later can be achieved by application specific means. Posix system() call is supported.

Woody is running a SuSE SLES9 kernel, i.e. 2.6.5. Thus, no support for fork and similar things!

Some users already hit this problem! Even a Fortran user who had call system('some command') in his code! In the latter case, the application just hang in some (matching) MPI_send/MPI_recv calls.

Common license pool for STAR-CD probably continues for next three years

It took quite long until a solution for prolonging the joint license pool with an increased number of licenses for parallel runs could be found. But everything seems to be solved now for the next three years.

Further chairs can join at any time – of course, the license may only be used for education and scientific research (and not for industrial research or projects). If additional groups join, this will not increase the total costs (unsless additional license features will be required) but reduce the amount the individual groups have to pay annually …

Also check some notes if you use STAR-CD on RRZE’s new parallel computer Woody.

Invitation: KONWIHR-Workshop and 3rd High-End-Computing Symposium on July 2nd at FAU

Dear HPC users,

we are glad to invite you on July 2nd to the

KONWIHR Results Workshop
and
3rd Erlangen International High-End-Computing Symposium

which take place at the University of Erlangen.

The event is jointly organized by Lehrstuhl für Systemsimulation (LSS), Regionales Rechenzentrum Erlangen (RRZE), Bavarian Graduate School in Computational Engineering (BGCE) and Competence Network for Technical, Scientific High Performance Computing in Bavaria (KONWIHR).

The program is a follows:

In the morning, nine Bavarian groups will report about the wide spectrum of HPC activities carried out in the framework of the Competence Network for Technical, Scientific High Performance Computing in Bavaria (KONWIHR).

In the afternoon, five internationally accepted experts will review High-End-Computing from an international perspective and give an outlook to future developments.

Participation is free of charge, but a registration is expected to allow us better planning. For all further/updated information and the registration form visit http://www10.informatik.uni-erlangen.de/Misc/EIHECS3/

We are looking forward to see you on July 2nd at RRZE!

HPC@RRZE also on befalf of the other organizers

PS: additional information, in particular for speakers, can be found in an other story of my blog.

KONWIHR-Workshop and IHECS3 on July 2, 2007

Contributors to the KONWIHR-Workshop:

  1. 09:00-09:10 Opening remarks and Greetings
  2. 09:10-10:10 Computer Science and Physics
    1. G. Wellein: HQS@HPC – Improving Scalability of large sparse ED studies on HLRB-II; RRZE, Uni-Erlangen
    2. T. Gradl: HHG: Multigrid for Finite Elements on up to 9728 Processors; LS Systemsimulation, Uni-Erlangen
    3. W. Hanke: CUHE – From Supercomputing to High-Temperature Superconductivity: Where are we at?; Chair of Theoretical Physics I, Uni Würzburg
  3. 10:20-11:20 Geo and Life Sciences
    1. H. Igel, H. Wang, M. Ewald: NBW – Strong Ground Motion Variations in the Los Angeles Basin due to Earthquake Processes; Department of Earth Sciences – Geophysics, LMU Munich
    2. M. Ott: ParBaum – Reconstructing the Tree of Life on HPC systems; LS Rechnertechnik und Rechnerorganisation, TU München
    3. T. Clark: QM-CI/MM FRET Simulationen; Center for Computational Chemistry, Uni-Erlangen
  4. 11:30-12:30 Computational Fluid Dynamics
    1. M. Mehl: SkvG – Effiziente parallele Simulation von Strömungsvorgängen auf kartesischen Gittern; LS Informatik mit Schwerpunkt Wissenschaftliches Rechnen, TU München
    2. P. Wenisch: VISimLab – Computational Steering on the Altix 4700: Towards Interactive Comfort Analysis in Civil Engineering; LS Bauinformatik, TU München
    3. F. Durst: BESTWIHR – Complex flow predictions using lattice Boltzmann computations; LS Strömungsmechanik, Uni-Erlangen
  5. 12:30-13:30 Lunch Break
  6. 13:30-18:00 High-End-Computing Symposium in Lecture Room H4 (ground&1st floor in the RRZE building)

Notes:

  • the KONWIHR Results Workshop takes place in RRZE’s seminar room, 2.049, Martensstr. 1 (2nd floor)
  • each presentation is 15+5 minutes
  • the presentations can be given either in German or English
  • we will provide one central notebook for all presentations to save time when switching from presentation to presentation!
  • video conference transmission to Munich or other sites may be arranged upon request

More information and a registration form can be found at the official KONWIHR/EIHECS web page !

The collected presentations of the KONWIHR Results Workshop are available online as PDF file (10.5 MB)!

building mvapich/mvapich2 for woody

Just as an internal information for ourselves: mvapich and mvapich2 heavily depend on the underlaying Infiniband stack. Therefore, both MPI libraries had to be recompiled while changing from Voltaire IBHOST-3.4.5 stack to GridStack 4.1.5 (aka OFED-1.1).

As mvapich/mvapich2 are statically linked into executables, probably those applications have to be recompiled, too.
On the other side, Intel-MPI (which is the default on woody) is not affected, thus, most of the users’ applications will not require recompilation.

The RRZE versions of mvapich-0.9.8 and mvapich2-0.9.8 were now compiled using:

module load intel64-c/9.1.049, intel64-f/9.1.045
export CC=icc
export CXX=icpc
export F77=ifort
export F90=ifort
export ARCH=_EM64T_
export OPEN_IB_HOME=/usr/local/ofed
export PREFIX=/apps/mvapich2/0.9.8-gridstack-4.1.5
./make.mvapich[2].gen2

Short Course “Lattice Boltzmann” at ParCFD2007 and ICMMES2007

During ParCFD2007 there was a short course on lattice Boltzmann methods. The slides are made available on the tutorials web page of Parallel CFD 2007. Please come back and check for updates as some parts are still missing.

The next lattice Boltzmann short course will take place at the beginning of ICMMES 2007 taking place in Munich from June 16-20.

At both conferences, RRZE is in charge of the HPC aspects of implementing lattice Boltzmann methods.

some notes on using Intel Trace Collector (ITC/ITA)

some notes on using Intel Trace Collector (ITC/ITA)

Preliminary note: these are just some “random” notes for myself …

  • Intel 10.0 (beta) compilers introduced -tcollect switch
    • inserts instrumentation probes calling the ITC API, i.e. shows functions names instead of just “user code” in the trace
    • has to be used for compilation and linking
    • not sure yet if ITV libraries have to be specified during linking – but probably one wants to add “$ITC_LIB” (defined by RRZE) anyway
    • works fine (at least in initial tests)
  • source code locations are not inclides in the trace file by default; compile with -g and set VT_PCTRACE to some value (either ON or the call level to be recorded (with optional skip levels); see ITC-RefGuide Sec. 3.8, p. 19/20 and 95
  • OS counters can be recorded; VT_COUNTER variable has to be set accordingly; see ITC-RefGuide Sec. 3.10, p. 22
  • VT_LOGFILE_NAME can be used to specify the name of the tracefile
  • to change colors, one has to the “all functions” item as the color toggle is deactivated in the default “generated groups/major functions” item
  • unfortunately, I did not yet find a “redraw” button – and chaning the colors only redisplays the current chart automatically

mpiexec + taskset: code snippet to generate config file

mpiexec + taskset: code snippet to generate config file

We usually use Pete Wyckoff’s mpiexec to start MPI processes from within batch jobs. The following code snippet may be used to pin individual MPI processes to their CPU. A similar snippet could be used to allow round-robin distribution of the MPI processes …

[shell]
#!/bin/bash
# usage: run.sh 8 “0 2 4 6” ./a.out “-b -c”
# start a.out with arguments -b -c
# use 4 MPI processes
# pin processes to CPUs 0,2,4,6

NUMPROC=$1
TASKCPUS=”$2″
EXE=$3
ARGS=$4

TASKS=`echo $TASKCPUS | wc -w`
echo “running $NUMPROC MPI processes with $TASKS per node”

cat /dev/null > cfg.$PBS_JOBID
for node in `uniq $PBS_NODEFILE`; do
for cpu in $TASKCPUS; do
echo “$node : taskset -c $cpu $EXE $ARGS” >> cfg.$PBS_JOBID
NUMPROC=$((NUMPROC – 1))
if [ $NUMPROC -eq 0 ]; then
break 2
fi
done
done
mpiexec -comm pmi -kill -config cfg.$PBS_JOBID
[/shell]

Vector-TRIAD on woody using different versions of the Intel EM64T Fortran compiler

Switching from one compiler version to an other can have significant influence on performance, but even moving one patch level ahead may change your performance …

The Vector-TRIAD benchmark (a(:)=b(:)+c(:)*d(:) according to Schoenauer) was run on the new Woodcrest cluster at RRZE which consists of HP DL140G3 boxes. The performance is given in MFlop/s for a loop length of 8388608. The value is the aggregated bandwidth of 4 MPI processes running on the node in saturation mode.

SNOOP filter of the 5000X chipset enabled

Performance in MFlop/s for loop length 8388608
on 2-socket Woodcrest node with 4 MPI processes
compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 374.0   | 352.5
9.1-039 | 374.1   | 352.3
9.1-045 | 359.0 ! | 352.4
10.0-13 | 373.4   | 377.6 !
10.0-17 | 373.7   | 352.0

SNOOP filter of the 5000X chipset disabled (switching the snoop filter off only works with the latest BIOS (v1.12) released on April 16!)

compiler| default | USE_COMMON
--------+---------+-----------
9.1-033 | 376     | 332
9.1-039 | 376     | 331
9.1-045 | 341  !! | 331
10.0-13 | 376     | 380 !!
10.0-17 | 376     | 331

The “default” version always refers to arrays which were known at compile time; “USE_COMMON” meaans that the arrays have additionally been put into a common block.

 

And for reference the STREAM values in MB/s for 4 OpenMP threads (and added NONTEMPORAL directives and Array size = 20000000, Offset = 0) are also given:

      Snoopfilter on | Snoopfilter off
Function Rate (MB/s) | Rate (MB/s)
Copy:      7492.0444 | 6178.7991
Scale:     7485.3591 | 6174.8763
Add:       6145.5004 | 6180.6296
Triad:     6152.6559 | 6189.2369

The results were more or less identical when using fce-9.1.039 and fce-9.1.45!

Reasons for performance differences: The main reason for the performance differences is unrolling (add -unroll0 to avoid it – thanks to Intel for pointing this out). The 10.0 compilers seem to be much more aggressive when doing optimizations and vectorization. By default, non-temporal stores may now be used in certain cases automatically by the 10.0 compiler. If you want to avoid that use -opt-streaming-stores never. Even if non-temporal stores were disabled via command line, the compiler directive vector nontemporal will still be respected. A directive to avoid non-temporal stores for a specific loop only is not (yet) available.