CUDA

Suche


10. February 2012

Future of this Blog

Johannes Habich, 06:35 Uhr in Allgemeines, Contact, CUDA, Home, HPC, Parallel Computing MPI/OpenMP, PRACE, Publications, SKALB, Tools, Windows HPC

Please find new posts to this blog here:

http://johanneshabich.blogspot.com/

 

There will be no more posts here

9. August 2011

OpenCL GPU computing tutorial

Johannes Habich, 14:45 Uhr in CUDA, HPC

Very nice tutorial with hands on about installation, setup and programming:

http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/

5. August 2011

NVIDIA CUDA disable/enable ECC , new commands

Johannes Habich, 07:50 Uhr in CUDA, HPC

In order to query ECC status in CUDA 4.0 (and hopefully the next versions), nvidia-smi works a little bit different:

nvidia-smi -q | grep -e"Ecc Mode" -e"Product" -A2

will give you a excerpt.
Just use nvidia-smi -q for full output

To enable ECC:

nvidia-smi -i 0 -e 1

To disable ECC:

nvidia-smi -i 0 -e 0

Where i is the GPU ID!

4. August 2011

TinyGPU upgrade to CUDA Toolkit 4.0

Johannes Habich, 07:07 Uhr in CUDA, HPC

All nodes of the TinyGPU Cluster is now on the current CUDA Toolkit 4.0 and the appropriate driver.

21. June 2011

Disable Fermi Cache

Johannes Habich, 10:33 Uhr in CUDA, HPC

To disable Fermi L1 Cache in CUDA just compile with: -Xptxas -dlcm=cg

Any idea on how to do this with OpenCL?

18. April 2011

NVIDIA CUDA disable/enable ECC

Johannes Habich, 13:30 Uhr in CUDA

Show ECC config nvidia-smi -r
Enable ECC on GPU 0: nvidia-smi -g 0 -e 1
Disable ECC on GPU 0: nvidia-smi -g 0 -e 0

You need a reboot to get settings active

17. December 2010

CUDA Windows Development Environment

Johannes Habich, 11:03 Uhr in Allgemeines, CUDA, HPC

Supported by the new NVIDIA Tesla Compute Cluster (TCC) driver we offer now an integrated development environment for CUDA on basis of Windows HPC2008 and Visual Studio.

For nearly 2 years now RRZE provides development resources for CUDA GPGPU computing under the Linux OS. In December 2009 a GPU cluster joined the two development machines in order to support production runs on up to 16 GPUs. Please contact us if you are interested in GPGPU computing, whether for Linux/Windows based development or production runs.

10. December 2010

Windows and CUDA; enabling TCC with nvidia-smi

Johannes Habich, 08:03 Uhr in Allgemeines, CUDA, Windows HPC

Like in Linux you can use nvidia-smi to set different modes on the TESLA GPUs.
nvidia-smi is located usually at: C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe

Go there with a command prompt and administrative privileges and type nvidia-smi -s. This gives you the current status and the status of TCC mode after reboot.
Set exclusive compute mode enabled for the first GPU by nvidia-smi -g 0 -c 1
Set exclusive compute mode disabled for the first GPU by nvidia-smi -g 0 -c 0
For other GPUs increment the number after -g

/edit 24.12.2010:
Also look at the first comment on how to change between WDDM and TCC driver model.
Thanky Francoise for reporting my mistake. I corrected it above

9. December 2010

Win2008 HPC Server and CUDA TCC revisited

Johannes Habich, 14:07 Uhr in CUDA, HPC, Windows HPC

The release of the stable NVIDIA Driver 260.83 broke my Windows CUDA programming environment.
With the currently newst driver, 263.06, I gave it another shoot. Initially the CUDASDK sample programs did not recognize the GPU as CUDA capable and there was just some babbling about DRIVER and TK mismatch.
However this time searching the web got me to an IBM webpage which got a solution for their servers running Windows 2008 R2.
I tried this in Win2008 and it works like charm:

  • Enter the registry edit utility typing regedit in the run dialog and navigate to:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E968-E325-11CE-BFC1-08002BE10318}\

  • You will find subfolders named 0001 0002 aso. depending on the number of GPUs in your system.
  • For each card you want to enable CUDA go to that 000X directory and add the following reg key (32bit dword worked for me):

“AdapterType”=dword:00000002

If you access the system via RDP read my blog entry on Using nvidia-smi for TCC on how to set this up!

Source of this information is IBM and can be found here for further reference and even more details: IBM Support Site

13. October 2010

NVIDIA CUDA TCC Driver Released 260.83

Johannes Habich, 08:03 Uhr in CUDA, HPC, Windows HPC

Just today Nvidia released the WHQL certified Tesla Compute Cluster driver TCC 260.83 for usage in e.g. Windows 2008 Server/HPC.
Till now only a beta version was available
With that special driver you have the ability to use GPGPU compute resources via RDP or via WindowsHPC batch processing mode.

Download the driver here

/edit:
Actually installing this driver broke my working environment. So be sure to keep a backup of the system. Even reinstalling the beta version did not solve the problem.

12. October 2010

Win2008 HPC Server and CUDA TCC

Johannes Habich, 13:54 Uhr in CUDA, HPC, Windows HPC

Nvidia now provides a beta driver called Tesla Compute Cluster (TCC) in order to use CUDA GPUs within a windows cluster environment. Not only remotely via RDP but also in batch processing. Till now, the HPCServer lacked this ability, as Windows did not fire up the graphics driver inside the limited batch logon mode.

My first steps with TCC took a little bit longer than estimated.

First of all It is not possible to have a NVIDIA and AMD or INTEL GPU side by side as Windows needs to use one unified WDM and thats either one or the other vendor. This was completely new to me.

After this first minor setback and reequipped with only the tesla C2050 the BIOS did not finish, so be sure to be up to date with your BIOS revision.
Another NVIDIA card was the quick fix on my side.

Next thing is the setup. Install the 260 (currently beta) drivers and the exclamation mark in the device manager should vanish.
After that install the toolkit and SDK if you like.
With the nvidia-smi tool, which you find in one of the uncountable NVIDIA folders which are there now, you can have a look if the card is initally correctly recognized.
As well set the TCC mode of the Tesla card to enabled if you want to have remote cuda capabilities:

nvidia-smi -s –> shows the current status
nvidia-smi -g 0 -c 1 –> enables TCC on GPU 0

Next thing you want to test the device query coming with the SDK.
If it runs and everything looks fine, feel gifted!

Nothing did run on my setup. So first of all I tried to build the SDK example myself. Therefore first of all build the Cuda utilities, lying somewhere in the SDK within the folder “common”.
Depending on the Nsight or TK version you have installed you get an error when opening the VS project fles . The you need to edit the visual studio with a text editor of your choice and replace the outdated build rule with the one actually installed.

  • In the error message get to the folder where VS does not find the file.
  • Copy the path and go there with your file browser
  • Find the file most equal to the one in the VS error message.
  • Once found open the VS file and replace the wrong filename there with the correct one
  • VS should open

In order to compile, add the correct include and library directories to the VS project.
Finally you can build deviceQuery or any other program.

Still this setup gave me the same error as the precompiled deviceQuery:
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

With the help of the DependencyWalker i found out that a missing DLL was the problem, namely:
linkinfo.dll.

You can get this by adding the feature named “Desktop Experience” through the server manager.
Once installed and rebooted the device query worked.

3. September 2010

TinyGPU offers new hardware

Johannes Habich, 13:20 Uhr in CUDA, HPC

TinyGPU has new hardware: tg010. The hardware configuration and the currently deployed software are different to the non-Fermi nodes:

  • Ubuntu 10.04 LTS (instead of 8.04 LTS) as OS.
    Note: For using the Intel Compiler <= 11.1 locally on tg010, you have to use gcc/3.3.6 Module [currently]. If not,  libstdc++.so.5 is missing , as Ubuntu 10.04 does no longer contain this version. This is necessary only for compilation. Compiled Intel binaries will run as expected.
  • /home/hpc and /home/vault are mounted [only] through NFS  (and natively via GPFS-Cross-Cluster-Mount)
  • Dual-Socket-System with  Intel Westmere X5650 (2.66 GHz) processor, having 6 native cores per socket (instead of Dual-Socket-System with  Intel Nehalem X5550 (2.66 GHz), having  4 native cores per socket)
  • 48 GB DDR3 RAM (instead of  24 GB DDR3 RAM)
  • 1x NVidia Tesla C250 (“Fermi” with  3 GB GDDR5 featuring ECC)
  • 1x NVidia GTX 280 (Consumer-Card with 1 GB RAM – formerly know as F22)
  • 2 further PCIe2.0 16x slots will be equipped with  NVidia C2070 Cards (“Fermi” with  6 GB GDDR5 featuring ECC) in Q4, instead of  2x NVidia Tesla M1060 (“Tesla” with  4 GB RAM) as in the remaining cluster nodes
  • SuperServer 7046GT-TRF / X8DTG-QF with  dual Intel 5520 (Tylersburg) chipset instead of  SuperServer 6016GT-TF-TM2 / X8DTG-DF with  Intel 5520 (Tylersburg) chipset

To allocate the fermi node, specify  :ppn=24 with your job  (instead of  :ppn=16) and explicitly submit to  the  TinyGPU-Queue fermi. The wallclock limit is set to the default of 24h . The ECC Memory status is shown on job startup.
This article tries to be a translation from the original posted here: Zuwachs im TinyGPU-Cluster

24. September 2009

PCI express pinned Host Memory

Johannes Habich, 08:40 Uhr in CUDA

Retesting my benchmarks with the current release of Cuda 2.3 I finally incorporated new features like pinned host memory allocation. Specs say that this improves the host to device transfers and vice versa.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.

The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.

Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.

PCIe Bandwidth Measurements GTX280 using pinned Memory

23. July 2009

Cuda 2.3 released

Johannes Habich, 09:06 Uhr in CUDA

NVIDIA just released Cuda Version 2.3 with the corresponding driver.
F22 @RRZE has already been updated to support this Version.

8. July 2009

Cuda Machines @ RRZE

Johannes Habich, 08:23 Uhr in CUDA

This information will not be updated any more. Please visit our official page as we provide GPU computing now as a cluster ressource:

RRZE HPC Services

Currently the available CUDA test systems @ RRZE are:

lightning: (available with upgraded hardware)
Ubuntu 8.04 x86_64
2x Quadcore Intel Clovertown (2,33 GHz), 4 MB L2 pro 2 Cores,
GeForce 8800 Ultra (768 MB) (G80 core)
Cuda Driver Version: 180.22
Cuda Toolkit: 2.0

f22: (Last Update 29.09.09)
Ubuntu 8.04 x86_64
2x Quadcore Intel Xeon L5420 (2.5 GHz)
GeForce GTX 280 SC (1 GB) (GT200 Core)
Current: Cuda Driver Version: 190.29(Cuda2.3) –> with OpenCL Support!
Before: Cuda Driver Version: 190.16 (Cuda2.3)
Cuda Toolkit: 2.3

7. July 2009

Cuda Tutorial @ RRZE

Johannes Habich, 17:10 Uhr in CUDA

Currently we have two test systems running different GPUs from NVIDIA inside the testcluster environment.

  • Please apply for a HPC account at RRZE (ask your local administrator) .
  • You get access to one of the machines by issuing either a job script or by requesting an interactive shell, e.g.:
  • qsub -I -lnodes=f22:ppn=8,walltime=01:00:00

  • Note, that interactive sessions are limited to one hour, but it is the recommended way to try things out in the beginning
  • The module system now supplies you with various versions of compilers and CUDA Versions, e.g.
  • module load cuda/2.2 will give you Cuda Version 2.2 64bit

  • Next thing you wanna try is compiling the SDK examples.
    • Therefore, download the SDK matching the CUDA version you want to use (please chek wether it is available too!) and extract it to some directory by running it.
    • The cuda path you have to specify (not the install path!) is /usr/local/cudaXX were XX is the version and the architecture (e.g. -32 ).
    • Then enter the directory you extracted to and type make. It should compile, if it doesn’t please look to /usr/local/cudaXX/bin/linux/release/. If you find executables in there and you can acutally run them, Then somewhere in your settings is a mistake. If you are trying to compile in 32bit mode, please contact us at hpc@rrze.uni-erlangen.de because then you would need further assistance.
  • Assuming compilation went well (went well = no errors; We neglect the warnings here), you should have runable SDK examples in /bin/release/linux/
  • Now your basic CUDA environment is set up and ready to go for your own codes.

15. December 2008

PCI express revisited

Johannes Habich, 11:42 Uhr in CUDA

Test results with the new generation, i.e. GT 200 based and PCIe Generation 2.0 with doubled performance, show that general naive implemented copys do not get any speedups.
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.

pcix bandwidth measurements 8800 gtx vs. gtx 280

Link to first article

23. November 2008

Yeehhaa: NVIDIA GT200 rocks

Johannes Habich, 23:33 Uhr in CUDA

An exemplar of the new NVIDIA Series GT200 based GTX280 Graphics card arrived at our Computing Center last Friday . The card was installed and set up right away and the first benchmark ran on Saturday 22nd of November and finished today.

Some preliminary figures show the great improvement of this new generation as I expected from the data sheets. Soon I will post some verified results here and some about the changes from the G80 generation to the current GT200 chip.

10. September 2008

PCI express bandwidth measurements

Johannes Habich, 14:26 Uhr in CUDA

Benchmarking the PCI express capabilities with CUDA I stumbled across the weird behaviour that a 4 MB block seems to achieve the best sustainable bandwidth. At least when writing to the host.
However, transmitting more than 4 MB but with 4 MB data packets (let’s call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.

Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols
Performance of PCI Express transfers to NVIDIA G80 8800 GTX card

2. September 2008

Towards Teraflops for Games

Johannes Habich, 12:04 Uhr in CUDA

With the release of the next generation of GPUs, NVIDIA and AMD (former ATI) graphic boards deliver now performance in the order of one teraflop in single precision accuracy. NVIDIA nearly doubled both the count of processors and the memory bus width. Interesting for research is now, how the sustainable performance of programs and algorithms scales with the new platform.
Until now I was not able to test my own algorithms, the Streambenchmarks and the lattice Boltzmann method (see my Thesis for more details ), on the new NVIDIA GPUs.

Double precision also made its way into the GPU circuits, unfortunately with a huge performance loss to around a tenth of single precision performance.
In contrast to that current CPUs lose only about 50% of performance, which comes obvious from the doubled computational work.

Here a little demonstration about the key difference between CPU and GPU NVISION

1. September 2008

First Shot

Johannes Habich, 09:21 Uhr in CUDA

I’m currently with the HPC group @ RRZE and working on my master thesis about HPC on graphic cards regarding benchmark kernels and flow solvers.

So any remarks or hints? Drop them here!

<% image name="cuda hpc" %>

Thanks

/edit 01.07.08

Thesis finished :-)

Nach oben