Johannes Habichs Blog

11. August 2011

OpenCL programming

Johannes Habich, 08:20 Uhr in Allgemeines

OpenCL Kernels get compiled at execution time (Just in Time, JIT).
This means that any error inside the kernel is discovered at that time.
The error message I get till now are very silent about what the problem of the kernel is.
So there is just one method of commenting and uncommenting to get a kernel debugged just for compilation.

Intel however recently released it’s first beta version of OpenCL with a lot of tools.
As with early Parallel Studio, these tools are only available for Windows (in particular Win Vista + and Server 2008) and not for Linux.
Note, that the runtime for compiling and running OpenCL is available for Linux, Mac and Windows!

Included is the Intel OpenCL Offline Compiler where you can load your kernel and precompile it.
Here the error messages are much more helpfull (of course, helpfull in a way ordinary compiler messages are helpfull :-)).
Nevertheless, a great tool which makes daily programming a lot easier.

4. August 2011

TinyGPU upgrade to CUDA Toolkit 4.0

Johannes Habich, 07:07 Uhr in CUDA, HPC

All nodes of the TinyGPU Cluster is now on the current CUDA Toolkit 4.0 and the appropriate driver.

21. June 2011

Disable Fermi Cache

Johannes Habich, 10:33 Uhr in CUDA, HPC

To disable Fermi L1 Cache in CUDA just compile with: -Xptxas -dlcm=cg

Any idea on how to do this with OpenCL?

10. December 2010

Windows and CUDA; enabling TCC with nvidia-smi

Johannes Habich, 08:03 Uhr in Allgemeines, CUDA, Windows HPC

Like in Linux you can use nvidia-smi to set different modes on the TESLA GPUs.
nvidia-smi is located usually at: C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe

Go there with a command prompt and administrative privileges and type nvidia-smi -s. This gives you the current status and the status of TCC mode after reboot.
Set exclusive compute mode enabled for the first GPU by nvidia-smi -g 0 -c 1
Set exclusive compute mode disabled for the first GPU by nvidia-smi -g 0 -c 0
For other GPUs increment the number after -g

/edit 24.12.2010:
Also look at the first comment on how to change between WDDM and TCC driver model.
Thanky Francoise for reporting my mistake. I corrected it above

9. December 2010

Win2008 HPC Server and CUDA TCC revisited

Johannes Habich, 14:07 Uhr in CUDA, HPC, Windows HPC

The release of the stable NVIDIA Driver 260.83 broke my Windows CUDA programming environment.
With the currently newst driver, 263.06, I gave it another shoot. Initially the CUDASDK sample programs did not recognize the GPU as CUDA capable and there was just some babbling about DRIVER and TK mismatch.
However this time searching the web got me to an IBM webpage which got a solution for their servers running Windows 2008 R2.
I tried this in Win2008 and it works like charm:

• Enter the registry edit utility typing regedit in the run dialog and navigate to:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E968-E325-11CE-BFC1-08002BE10318}\

• You will find subfolders named 0001 0002 aso. depending on the number of GPUs in your system.
• For each card you want to enable CUDA go to that 000X directory and add the following reg key (32bit dword worked for me):

If you access the system via RDP read my blog entry on Using nvidia-smi for TCC on how to set this up!

Source of this information is IBM and can be found here for further reference and even more details: IBM Support Site

13. October 2010

NVIDIA CUDA TCC Driver Released 260.83

Johannes Habich, 08:03 Uhr in CUDA, HPC, Windows HPC

Just today Nvidia released the WHQL certified Tesla Compute Cluster driver TCC 260.83 for usage in e.g. Windows 2008 Server/HPC.
Till now only a beta version was available
With that special driver you have the ability to use GPGPU compute resources via RDP or via WindowsHPC batch processing mode.

/edit:
Actually installing this driver broke my working environment. So be sure to keep a backup of the system. Even reinstalling the beta version did not solve the problem.

12. October 2010

Win2008 HPC Server and CUDA TCC

Johannes Habich, 13:54 Uhr in CUDA, HPC, Windows HPC

Nvidia now provides a beta driver called Tesla Compute Cluster (TCC) in order to use CUDA GPUs within a windows cluster environment. Not only remotely via RDP but also in batch processing. Till now, the HPCServer lacked this ability, as Windows did not fire up the graphics driver inside the limited batch logon mode.

My first steps with TCC took a little bit longer than estimated.

First of all It is not possible to have a NVIDIA and AMD or INTEL GPU side by side as Windows needs to use one unified WDM and thats either one or the other vendor. This was completely new to me.

After this first minor setback and reequipped with only the tesla C2050 the BIOS did not finish, so be sure to be up to date with your BIOS revision.
Another NVIDIA card was the quick fix on my side.

Next thing is the setup. Install the 260 (currently beta) drivers and the exclamation mark in the device manager should vanish.
After that install the toolkit and SDK if you like.
With the nvidia-smi tool, which you find in one of the uncountable NVIDIA folders which are there now, you can have a look if the card is initally correctly recognized.
As well set the TCC mode of the Tesla card to enabled if you want to have remote cuda capabilities:

nvidia-smi -s –> shows the current status
nvidia-smi -g 0 -c 1 –> enables TCC on GPU 0

Next thing you want to test the device query coming with the SDK.
If it runs and everything looks fine, feel gifted!

Nothing did run on my setup. So first of all I tried to build the SDK example myself. Therefore first of all build the Cuda utilities, lying somewhere in the SDK within the folder “common”.
Depending on the Nsight or TK version you have installed you get an error when opening the VS project fles . The you need to edit the visual studio with a text editor of your choice and replace the outdated build rule with the one actually installed.

• In the error message get to the folder where VS does not find the file.
• Copy the path and go there with your file browser
• Find the file most equal to the one in the VS error message.
• Once found open the VS file and replace the wrong filename there with the correct one
• VS should open

In order to compile, add the correct include and library directories to the VS project.
Finally you can build deviceQuery or any other program.

Still this setup gave me the same error as the precompiled deviceQuery:
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

With the help of the DependencyWalker i found out that a missing DLL was the problem, namely:

You can get this by adding the feature named “Desktop Experience” through the server manager.
Once installed and rebooted the device query worked.

3. September 2010

TinyGPU offers new hardware

Johannes Habich, 13:20 Uhr in CUDA, HPC

TinyGPU has new hardware: tg010. The hardware configuration and the currently deployed software are different to the non-Fermi nodes:

• Ubuntu 10.04 LTS (instead of 8.04 LTS) as OS.
Note: For using the Intel Compiler <= 11.1 locally on tg010, you have to use gcc/3.3.6 Module [currently]. If not,  libstdc++.so.5 is missing , as Ubuntu 10.04 does no longer contain this version. This is necessary only for compilation. Compiled Intel binaries will run as expected.
• /home/hpc and /home/vault are mounted [only] through NFS  (and natively via GPFS-Cross-Cluster-Mount)
• Dual-Socket-System with  Intel Westmere X5650 (2.66 GHz) processor, having 6 native cores per socket (instead of Dual-Socket-System with  Intel Nehalem X5550 (2.66 GHz), having  4 native cores per socket)
• 48 GB DDR3 RAM (instead of  24 GB DDR3 RAM)
• 1x NVidia Tesla C250 (“Fermi” with  3 GB GDDR5 featuring ECC)
• 1x NVidia GTX 280 (Consumer-Card with 1 GB RAM – formerly know as F22)
• 2 further PCIe2.0 16x slots will be equipped with  NVidia C2070 Cards (“Fermi” with  6 GB GDDR5 featuring ECC) in Q4, instead of  2x NVidia Tesla M1060 (“Tesla” with  4 GB RAM) as in the remaining cluster nodes
• SuperServer 7046GT-TRF / X8DTG-QF with  dual Intel 5520 (Tylersburg) chipset instead of  SuperServer 6016GT-TF-TM2 / X8DTG-DF with  Intel 5520 (Tylersburg) chipset

To allocate the fermi node, specify  :ppn=24 with your job  (instead of  :ppn=16) and explicitly submit to  the  TinyGPU-Queue fermi. The wallclock limit is set to the default of 24h . The ECC Memory status is shown on job startup.
This article tries to be a translation from the original posted here: Zuwachs im TinyGPU-Cluster

24. September 2009

PCI express pinned Host Memory

Johannes Habich, 08:40 Uhr in CUDA

Retesting my benchmarks with the current release of Cuda 2.3 I finally incorporated new features like pinned host memory allocation. Specs say that this improves the host to device transfers and vice versa.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.

The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.

Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.

23. July 2009

Cuda 2.3 released

Johannes Habich, 09:06 Uhr in CUDA

NVIDIA just released Cuda Version 2.3 with the corresponding driver.
F22 @RRZE has already been updated to support this Version.