21. March 2012

HIWI Positions available at RRZE’s HPC group

Johannes Habich, 08:00 Uhr in Allgemeines, HPC

Are you  a student at FAU or OHN and  interested in HPC? There are open HiWi positions in our group!


Current topics:


Just write an Email to hpc “at” rrze.uni-erlangen.de and ask for our current research topics.




Status: Open ; Last update 21.03.2012

HIWI Positions available at RRZE’s HPC Group

Johannes Habich, 06:45 Uhr in Allgemeines, HPC, Windows HPC

Are you  a student at FAU or OHN and  interested in HPC? There are open HiWi positions in our group!


Current topics:


Just write an Email to hpc “at” rrze.uni-erlangen.de and ask for our current research topics.


Status: Open ; Last update 21.03.2012

10. February 2012

Future of this Blog

Johannes Habich, 06:35 Uhr in Allgemeines, Contact, CUDA, Home, HPC, Parallel Computing MPI/OpenMP, PRACE, Publications, SKALB, Tools, Windows HPC

Please find new posts to this blog here:



There will be no more posts here

22. December 2011

GETOPT for Windows

Johannes Habich, 11:47 Uhr in HPC, Tools, Windows HPC

The portability of programs between *nix und Windows is often limited not by the program itself but by certain helper tools. One of those is getopt which provides a fast and easy way to read in command line arguments.

Please find a windows version here: http://suacommunity.com/dictionary/getopt-entry.php


You will find more other useful ports here as well.

8. November 2011

WINDOWS: Build Boost for Visual Studio 2010

Johannes Habich, 12:04 Uhr in HPC, Windows HPC



If you are using headers only libraries, then all you need to do is to unarchive the boost download and set up the environment variables. The instruction below set the environment variables for Visual Studio only, and not across the system as a whole. Note you only have to do it once.

  1. Unarchive the latest version of boost (1.47.0 as of writing) into a directory of your choice (e.g.C:\boost_1_47_0).
  2. Create a new empty project in Visual Studio.
  3. Open the Property Manager and expand one of the configuration for the platform of your choice.
  4. Select & right click Microsoft.Cpp.<Platform>.user, and select Properties to open the Property Page for edit.
  5. Select VC++ Directories on the left.
  6. Edit the Include Directories section to include the path to your boost source files.
  7. Repeat steps 3 – 6 for different platform of your choice if needed.

If you want to use the part of boost that require building, but none of the features that requires external dependencies, then building it is fairly simple.

  1. Unarchive the latest version of boost (1.47.0 as of writing) into a directory of your choice (e.g.C:\boost_1_47_0).
  2. Start the Visual Studio Command Prompt for the platform of your choice and navigate to where boost is.
  3. Run: bootstrap.bat to build b2.exe (previously named bjam).
  4. Run b2: (Win32) b2 --toolset=msvc-10.0 --build-type=complete stage ; (x64) b2 --toolset=msvc-10.0 --build-type=complete architecture=x86 address-model=64 stage. Go for a walk / watch a movie or 2 / ….
  5. Go through steps 2 – 6 from the set of instruction above to set the environment variables.
  6. Edit the Library Directories section to include the path to your boost libraries output. (The default for the example and instructions above would be C:\boost_1_47_0\stage\lib. Rename and move the directory first if you want to have x86 & x64 side by side (such as to<BOOST_PATH>\lib\x86 & <BOOST_PATH>\lib\x64).
  7. Repeat steps 2 – 6 for different platform of your choice if needed.
If you want both x64 & win32 side by side, add “–stagedir=lib/win32″ and “–stagedir=lib/x64″ to the respective builds

9. August 2011

OpenCL GPU computing tutorial

Johannes Habich, 14:45 Uhr in CUDA, HPC

Very nice tutorial with hands on about installation, setup and programming:


5. August 2011

NVIDIA CUDA disable/enable ECC , new commands

Johannes Habich, 07:50 Uhr in CUDA, HPC

In order to query ECC status in CUDA 4.0 (and hopefully the next versions), nvidia-smi works a little bit different:

nvidia-smi -q | grep -e"Ecc Mode" -e"Product" -A2

will give you a excerpt.
Just use nvidia-smi -q for full output

To enable ECC:

nvidia-smi -i 0 -e 1

To disable ECC:

nvidia-smi -i 0 -e 0

Where i is the GPU ID!

4. August 2011

TinyGPU upgrade to CUDA Toolkit 4.0

Johannes Habich, 07:07 Uhr in CUDA, HPC

All nodes of the TinyGPU Cluster is now on the current CUDA Toolkit 4.0 and the appropriate driver.

21. June 2011

Disable Fermi Cache

Johannes Habich, 10:33 Uhr in CUDA, HPC

To disable Fermi L1 Cache in CUDA just compile with: -Xptxas -dlcm=cg

Any idea on how to do this with OpenCL?

20. June 2011

7th Erlangen International High-End-Computing Symposium

Johannes Habich, 15:23 Uhr in Allgemeines, HPC

Please note the upcoming 7th EIHECS on this weeks Friday 24th of June .

17. December 2010

CUDA Windows Development Environment

Johannes Habich, 11:03 Uhr in Allgemeines, CUDA, HPC

Supported by the new NVIDIA Tesla Compute Cluster (TCC) driver we offer now an integrated development environment for CUDA on basis of Windows HPC2008 and Visual Studio.

For nearly 2 years now RRZE provides development resources for CUDA GPGPU computing under the Linux OS. In December 2009 a GPU cluster joined the two development machines in order to support production runs on up to 16 GPUs. Please contact us if you are interested in GPGPU computing, whether for Linux/Windows based development or production runs.

9. December 2010

Win2008 HPC Server and CUDA TCC revisited

Johannes Habich, 14:07 Uhr in CUDA, HPC, Windows HPC

The release of the stable NVIDIA Driver 260.83 broke my Windows CUDA programming environment.
With the currently newst driver, 263.06, I gave it another shoot. Initially the CUDASDK sample programs did not recognize the GPU as CUDA capable and there was just some babbling about DRIVER and TK mismatch.
However this time searching the web got me to an IBM webpage which got a solution for their servers running Windows 2008 R2.
I tried this in Win2008 and it works like charm:

  • Enter the registry edit utility typing regedit in the run dialog and navigate to:


  • You will find subfolders named 0001 0002 aso. depending on the number of GPUs in your system.
  • For each card you want to enable CUDA go to that 000X directory and add the following reg key (32bit dword worked for me):


If you access the system via RDP read my blog entry on Using nvidia-smi for TCC on how to set this up!

Source of this information is IBM and can be found here for further reference and even more details: IBM Support Site

13. October 2010

NVIDIA CUDA TCC Driver Released 260.83

Johannes Habich, 08:03 Uhr in CUDA, HPC, Windows HPC

Just today Nvidia released the WHQL certified Tesla Compute Cluster driver TCC 260.83 for usage in e.g. Windows 2008 Server/HPC.
Till now only a beta version was available
With that special driver you have the ability to use GPGPU compute resources via RDP or via WindowsHPC batch processing mode.

Download the driver here

Actually installing this driver broke my working environment. So be sure to keep a backup of the system. Even reinstalling the beta version did not solve the problem.

12. October 2010

Win2008 HPC Server and CUDA TCC

Johannes Habich, 13:54 Uhr in CUDA, HPC, Windows HPC

Nvidia now provides a beta driver called Tesla Compute Cluster (TCC) in order to use CUDA GPUs within a windows cluster environment. Not only remotely via RDP but also in batch processing. Till now, the HPCServer lacked this ability, as Windows did not fire up the graphics driver inside the limited batch logon mode.

My first steps with TCC took a little bit longer than estimated.

First of all It is not possible to have a NVIDIA and AMD or INTEL GPU side by side as Windows needs to use one unified WDM and thats either one or the other vendor. This was completely new to me.

After this first minor setback and reequipped with only the tesla C2050 the BIOS did not finish, so be sure to be up to date with your BIOS revision.
Another NVIDIA card was the quick fix on my side.

Next thing is the setup. Install the 260 (currently beta) drivers and the exclamation mark in the device manager should vanish.
After that install the toolkit and SDK if you like.
With the nvidia-smi tool, which you find in one of the uncountable NVIDIA folders which are there now, you can have a look if the card is initally correctly recognized.
As well set the TCC mode of the Tesla card to enabled if you want to have remote cuda capabilities:

nvidia-smi -s –> shows the current status
nvidia-smi -g 0 -c 1 –> enables TCC on GPU 0

Next thing you want to test the device query coming with the SDK.
If it runs and everything looks fine, feel gifted!

Nothing did run on my setup. So first of all I tried to build the SDK example myself. Therefore first of all build the Cuda utilities, lying somewhere in the SDK within the folder “common”.
Depending on the Nsight or TK version you have installed you get an error when opening the VS project fles . The you need to edit the visual studio with a text editor of your choice and replace the outdated build rule with the one actually installed.

  • In the error message get to the folder where VS does not find the file.
  • Copy the path and go there with your file browser
  • Find the file most equal to the one in the VS error message.
  • Once found open the VS file and replace the wrong filename there with the correct one
  • VS should open

In order to compile, add the correct include and library directories to the VS project.
Finally you can build deviceQuery or any other program.

Still this setup gave me the same error as the precompiled deviceQuery:
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

With the help of the DependencyWalker i found out that a missing DLL was the problem, namely:

You can get this by adding the feature named “Desktop Experience” through the server manager.
Once installed and rebooted the device query worked.

3. September 2010

TinyGPU offers new hardware

Johannes Habich, 13:20 Uhr in CUDA, HPC

TinyGPU has new hardware: tg010. The hardware configuration and the currently deployed software are different to the non-Fermi nodes:

  • Ubuntu 10.04 LTS (instead of 8.04 LTS) as OS.
    Note: For using the Intel Compiler <= 11.1 locally on tg010, you have to use gcc/3.3.6 Module [currently]. If not,  libstdc++.so.5 is missing , as Ubuntu 10.04 does no longer contain this version. This is necessary only for compilation. Compiled Intel binaries will run as expected.
  • /home/hpc and /home/vault are mounted [only] through NFS  (and natively via GPFS-Cross-Cluster-Mount)
  • Dual-Socket-System with  Intel Westmere X5650 (2.66 GHz) processor, having 6 native cores per socket (instead of Dual-Socket-System with  Intel Nehalem X5550 (2.66 GHz), having  4 native cores per socket)
  • 48 GB DDR3 RAM (instead of  24 GB DDR3 RAM)
  • 1x NVidia Tesla C250 (“Fermi” with  3 GB GDDR5 featuring ECC)
  • 1x NVidia GTX 280 (Consumer-Card with 1 GB RAM – formerly know as F22)
  • 2 further PCIe2.0 16x slots will be equipped with  NVidia C2070 Cards (“Fermi” with  6 GB GDDR5 featuring ECC) in Q4, instead of  2x NVidia Tesla M1060 (“Tesla” with  4 GB RAM) as in the remaining cluster nodes
  • SuperServer 7046GT-TRF / X8DTG-QF with  dual Intel 5520 (Tylersburg) chipset instead of  SuperServer 6016GT-TF-TM2 / X8DTG-DF with  Intel 5520 (Tylersburg) chipset

To allocate the fermi node, specify  :ppn=24 with your job  (instead of  :ppn=16) and explicitly submit to  the  TinyGPU-Queue fermi. The wallclock limit is set to the default of 24h . The ECC Memory status is shown on job startup.
This article tries to be a translation from the original posted here: Zuwachs im TinyGPU-Cluster

11. June 2010

Thread Pinning/Affinity

Johannes Habich, 09:43 Uhr in HPC

Thread pinning is very important in order to get feasible and reliable results on todays multi and manycore architectures. Otherwise threads will migrate from one core to another, causing the waste of clock cycles. Even more important, if you placed your memory correctly by first touch on ccNUMA systems, e.g. SGI Altix or every dual socket Intel XEON Core i7, the thread accessing the memory has to go over the QPI interface connecting the two sockets to access the memory if it is migrated to another socket.

Jan Treibig developed a tool for that called likwid-pin.

A sample usage would be as follows:

likwid-pin -s 0 -c”0,1,2,3,4,5,6,7″ ./Exec

This would pin 8 Threads of the executable of the cores 0 to 8.
For information about the topology, just use the other tool, called likwid-topology which gives you cache and core hierarchy.
The skipmask is important and thread implementation specific. Also consider, that in hybrid programs, e.g. OpenMP and MPI, multiple shepard threads are present.

28. May 2010

Single Precision: Friend or Foe

Johannes Habich, 14:14 Uhr in HPC

The recent developments of so called disruptive technologies always lead to some kind of everlasting discussion.
Today I want to say something about the hassle whether GPUs are feasible in any way for scientific computing as their double precision Performance is nowadays not too far away from standard CPUs. And single precision is not worth the discussion, as nobody wants to board a plane or a ship which was simulated just in single precision.

So for non-simulators first some explanation: single precision means a floating point representation of a given number using up to 4 bytes. Double precision uses up to 8 bytes and can therefore provide much more accuracy.

GPUs are originally designed for graphics applications that do not actually need single precision. There is a bunch of very fast FLOP commands just working on 24 bits instead of 32 bits (again 32 bits = 4 byte = single precision).
E.g. current NVIDIA cards just have 1 dp FLOP unit per 8 sp FLOP unit.

Till now its obvious why everyone complains about the worse dp performance in contrast to sp performance. However, nobody (well I do) complains about the low dp performance I actually get off a current x86 processor. There are some kinds of system configuration were you will just get about 10% or even less the performance.

This comes as data is brought much slower to the computing units than it is computed on there.
This is true for most scientific codes, e.g. stencil codes. Therefore you will see the usual breakdown to 50% of performance when switching from sp to dp on GPUs as you see on CPUs, because you simply transfer twice the data over the same system bus.
So, the dp units are most often not the limit of compute performance.

10. May 2010

JUROPA MPI Buffer on demand

Johannes Habich, 15:36 Uhr in HPC

To enable huge runs with lots of MPI ranks you have to disable the per default allocated all-to-all send buffer on the NEC- Nehalem Cluster Juropa at FZ Jülich.

Here is an excerpt from the official docu:

Most MPI programs do not need every connection
  • Nearest neighbor communication
  • Scatter/Gather and Allreduce based on binary trees
  • Typically just a few dozen connections when having hundreds of
  • ParaStation MPI supports this with ?on demand connections?
  • export PSP_ONDEMAND=1
  • was used for the Linpack runs (np > 24000)
  • mpiexec –ondemand
  • Backdraw
  • Late all-to-all communication might fail due to short memory
  • Default on JuRoPA is not to use ?on demand connections?

Juropa Introduction @ FZJ

16. July 2009

Tracing of MPI Programs

Johannes Habich, 12:11 Uhr in HPC



To trace MPI programs with the intel mpi tracing capabilities the following steps are at least necessary.
(Note that his guide demands not to be the only way nor to be complete and error proof!)

  1. module load itac
  2. env LD_PRELOAD=/apps/intel/itac/ mpirun -pernode ./bin/solver ./examples/2X_2Y_2Z_200X200X200c_file.prm
  3. Watch that additional LD_PRELOAD commands might override this one!

  4. e.g: env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so:/apps/intel/itac/ mpirun -npernode 2 $MPIPINNING ./bin/solver ./examples/8X_8Y_4Z_800X800X400c_file.prm
  5. Another way of doing this is to run mpiexec -trace ….. (remember this is true for intel MPI)

env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so:/apps/intel/itac/ mpirun -npernode 2 $MPIPINNING ./bin/solver ./examples/8X_8Y_4Z_800X800X400c_file.prm

Official Intel Docu on that matter

Intel® Trace Analyzer and Collector for Linux* OS
Getting Started Guide
To simplify the use of the Intel® Trace Analyzer and Collector, a set of environmental scripts is provided to you. Source/execute the appropriate script (/bin/itacvars.

sh) in your shell before using the software. For example, if using the Bash shell:

$ source /bin/itacvars.sh # better added to $HOME/.profile or similar

The typical use of the Trace Analyzer and Collector is as follows:

* Let your application run together with the Trace Collector to generate one (or more) trace file(s).
* Start the Trace Analyzer and to load the generated trace for analysis.

Generating a Trace File
Generating a trace file from an MPI application can be as simple as setting just one environment variable or adding an argument to mpiexec. Assume you start your application with the following command:

$ mpiexec -n 4 myApp

Then generating a trace can be accomplished by adding:

$ LD_PRELOAD=/slib/libVT.so mpiexec -n 4 myApp

or even simpler (for the Intel® MPI Library)

$ mpiexec -trace -n 4 myApp

This will create a set of trace files named myApp.stf* containing trace information for all MPI calls issued by the application.

If your application is statically linked against the Intel® MPI Library you have to re-link your binary like this:

$ mpiicc -trace -o myApp # when using the Intel® C++ Compiler


$ mpiifort -trace -o myApp # when using the Intel® Fortran Compiler

Normal execution of your application:

$ mpiexec -n 4 myApp

will then create the trace files named myApp.stf*.
Analyzing a Trace File
To analyze the generated trace, invoke the graphical user interface:

$ traceanalyzer myApp.stf

Read section For the Impatient in the Trace Analyzer Reference Guide to get guidance on the first steps with this tool.

24. June 2009

EIHECS roundup

Johannes Habich, 14:29 Uhr in HPC

Geballte HPC-Expertise in Erlangen
I myself got the nice little Lenovo S10e Ideapad for being second in the Call for Innovative Multi- and Many-Core Programming.

Currently I’m running Windows 7 RC1 with no problems at all but getting away from it.

19. June 2009

MPI/OpenMP Hybrid pinning or no pinning that is the question

Johannes Habich, 14:34 Uhr in HPC, Parallel Computing MPI/OpenMP


Recently performed benchmarks of a hybrid-parallelized flow solver showed what one has to consider in order to get best performance.
On the theoretical side, hybrid implementations are thought to be most flexible and still maintain high performance. This is because, one thinks that OpenMP is perfect for intranode communication and faster than MPI there.
Between nodes, MPI anyway is the choice for portable distributed memory parallelization.

In reality however not few MPI implementations already use shared memory buffers when communicating with other ranks in the same shared memory system. So basically there is no advantage between a parallelization of MPI and OpenMP on the same level, when using MPI nevertheless for internode communication.

Quite contrary, it means apart from the additional implementation, a lot of more understanding of processor and memory hierarchy layout, thread and process affinity to cores than the pure MPI implementation.

Nevertheless there are scenarios, where hybrid really can pay off, as MPI lacks the OpenMP feature of accessing shared data in shared caches for example.

Finally if you want to run your hybrid code on RRZE systems, there are the following features available.

Pinning of MPI/OpenMP hybrids

I assume you use the mpirun wrapper provided

  • mpirun -pernode issues just one MPI process per node regardless of the nodefile content
  • mpirun -npernode issues just n MPI processes per node regardless of the nodefile content
  • mpirun -npernode 2 -pin 0,1_2,3 env LD_PRELOAD=/apps/rrze/lib/ptoverride-ubuntu64.so Issues 2 MPI processes per node and gives threads of the MPI process 0 just access to core 0 and 1 and threads of MPI process 1 access to cores 2 and 3 (of course the MPI processes themselves are also limited to that cores). Furthermore the , e.g. OpenMP, threads are pinned to one core only, so that migration is no longer an issue
  • mpirun -npernode 2 -pin 0_1_2_3 is your choice if you would like to test 1 OpenMP thread per MPI process and 4 MPI processes in total per node. Adding the LD_PRELOAD from above however decreases performance a lot. This is currently under investigation.
  • export PINOMP_MASK=2 changes the skip mask of the pinning tool

OpenMP spawns not only worker threads but also threads for administrative business as synchronization etc. Usually you would only pin the threads, contributing to the computation. The default skip mask, skipping the non computationally intensive threads, might not be correct in the case of hybrid programming, as MPI as well spawns non-worker threads. The PINOMP_MASK variable is hereby interpreted like a bitmask, e.g. 2 –> 10 and 6 –> 110. A zero means to pin the thread and a 1 means to skip the pinning of the thread. The least significant bit hereby corresponds to thread zero (bit 0 is 0 in the examples above ) .

    6 was used in the algorithm under investigation as soon as one MPI process and 4 OpenMP worker threads were used per node, to have the correct thread pinning.

The usage of the rankfile for pinning hybrid jobs is described in Thomas Zeisers Blog

Thanks to Thomas Zeiser and Michael Meier for their help in resolving this issue.

Keyowords: Thread Pinning Hybrid Affinity

Incorporated Comments of Thomas Zeiser

Thomas Zeiser, Donnerstag, 23. Juli 2009, 18:15

PINOMP_MASK for hybrid codes using Open-MPI

If recent Intel compilers and Open-MPI are used for hybrid OpenMP-MPI programming, the correct PINOMP_MASK seems to be 7 (instead of 6 for hybrid codes using Intel-MPI).

Thomas Zeiser, Montag, 22. Februar 2010, 20:41
PIN OMP and recent mvapich2

Also recent mvapich2 requires special handling for pinning hybrid codes: PINOMP_SKIP=1,2 seems to be appropriate.

20. May 2009

OpenMP Fortran

Johannes Habich, 12:12 Uhr in HPC

I’m currently investigating, that a scalar (integer 4 byte) variable cannot be defined as FIRSTPRIVATE inside an !$OMP Parallel section.
This will cause abnormal program abortion, seg faults and undefined behavihor.
However, defining the varibale as PRIVATE works and SHARED of course, too.
Hopefully a small code snippet will provide more insight.

16. October 2008

Co-array Fortran and UPC

Johannes Habich, 08:38 Uhr in HPC

CAF and UPC are Fortran and C extensions for the Partitioned Global Adress Space (PGAS) model.
So independent of the hardware restrictions, each processor can access (read and write) data from other processors, without the need of additional communication libraries, e.g. MPI.

HLRS provided an introductory course about this.
At the current development stage I do not clearly see the benefit for production codes. However, some ideas might be implemented more quickly with these paradigms than with ordinary MPI for testing purposes.

Nach oben