10. February 2012
Future of this Blog
Please find new posts to this blog here:
http://johanneshabich.blogspot.com/
There will be no more posts here
10. February 2012
Please find new posts to this blog here:
http://johanneshabich.blogspot.com/
There will be no more posts here
9. August 2011
Very nice tutorial with hands on about installation, setup and programming:
http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/
5. August 2011
In order to query ECC status in CUDA 4.0 (and hopefully the next versions), nvidia-smi works a little bit different:
nvidia-smi -q | grep -e"Ecc Mode" -e"Product" -A2
will give you a excerpt.
Just use nvidia-smi -q for full output
To enable ECC:
nvidia-smi -i 0 -e 1
To disable ECC:
nvidia-smi -i 0 -e 0
Where i is the GPU ID!
4. August 2011
All nodes of the TinyGPU Cluster is now on the current CUDA Toolkit 4.0 and the appropriate driver.
21. June 2011
To disable Fermi L1 Cache in CUDA just compile with: -Xptxas -dlcm=cg
Any idea on how to do this with OpenCL?
18. April 2011
Show ECC config nvidia-smi -r
Enable ECC on GPU 0: nvidia-smi -g 0 -e 1
Disable ECC on GPU 0: nvidia-smi -g 0 -e 0
You need a reboot to get settings active
17. December 2010
For nearly 2 years now RRZE provides development resources for CUDA GPGPU computing under the Linux OS. In December 2009 a GPU cluster joined the two development machines in order to support production runs on up to 16 GPUs. Please contact us if you are interested in GPGPU computing, whether for Linux/Windows based development or production runs.
10. December 2010
Like in Linux you can use nvidia-smi to set different modes on the TESLA GPUs.
nvidia-smi is located usually at: C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe
Go there with a command prompt and administrative privileges and type nvidia-smi -s. This gives you the current status and the status of TCC mode after reboot.
Set exclusive compute mode enabled for the first GPU by nvidia-smi -g 0 -c 1
Set exclusive compute mode disabled for the first GPU by nvidia-smi -g 0 -c 0
For other GPUs increment the number after -g
/edit 24.12.2010:
Also look at the first comment on how to change between WDDM and TCC driver model.
Thanky Francoise for reporting my mistake. I corrected it above
9. December 2010
The release of the stable NVIDIA Driver 260.83 broke my Windows CUDA programming environment.
With the currently newst driver, 263.06, I gave it another shoot. Initially the CUDASDK sample programs did not recognize the GPU as CUDA capable and there was just some babbling about DRIVER and TK mismatch.
However this time searching the web got me to an IBM webpage which got a solution for their servers running Windows 2008 R2.
I tried this in Win2008 and it works like charm:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E968-E325-11CE-BFC1-08002BE10318}\
“AdapterType”=dword:00000002
If you access the system via RDP read my blog entry on Using nvidia-smi for TCC on how to set this up!
Source of this information is IBM and can be found here for further reference and even more details: IBM Support Site
13. October 2010
Just today Nvidia released the WHQL certified Tesla Compute Cluster driver TCC 260.83 for usage in e.g. Windows 2008 Server/HPC.
Till now only a beta version was available
With that special driver you have the ability to use GPGPU compute resources via RDP or via WindowsHPC batch processing mode.
Download the driver here
/edit:
Actually installing this driver broke my working environment. So be sure to keep a backup of the system. Even reinstalling the beta version did not solve the problem.
12. October 2010
Nvidia now provides a beta driver called Tesla Compute Cluster (TCC) in order to use CUDA GPUs within a windows cluster environment. Not only remotely via RDP but also in batch processing. Till now, the HPCServer lacked this ability, as Windows did not fire up the graphics driver inside the limited batch logon mode.
My first steps with TCC took a little bit longer than estimated.
First of all It is not possible to have a NVIDIA and AMD or INTEL GPU side by side as Windows needs to use one unified WDM and thats either one or the other vendor. This was completely new to me.
After this first minor setback and reequipped with only the tesla C2050 the BIOS did not finish, so be sure to be up to date with your BIOS revision.
Another NVIDIA card was the quick fix on my side.
Next thing is the setup. Install the 260 (currently beta) drivers and the exclamation mark in the device manager should vanish.
After that install the toolkit and SDK if you like.
With the nvidia-smi tool, which you find in one of the uncountable NVIDIA folders which are there now, you can have a look if the card is initally correctly recognized.
As well set the TCC mode of the Tesla card to enabled if you want to have remote cuda capabilities:
nvidia-smi -s –> shows the current status
nvidia-smi -g 0 -c 1 –> enables TCC on GPU 0
Next thing you want to test the device query coming with the SDK.
If it runs and everything looks fine, feel gifted!
Nothing did run on my setup. So first of all I tried to build the SDK example myself. Therefore first of all build the Cuda utilities, lying somewhere in the SDK within the folder “common”.
Depending on the Nsight or TK version you have installed you get an error when opening the VS project fles . The you need to edit the visual studio with a text editor of your choice and replace the outdated build rule with the one actually installed.
In order to compile, add the correct include and library directories to the VS project.
Finally you can build deviceQuery or any other program.
Still this setup gave me the same error as the precompiled deviceQuery:
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.
With the help of the DependencyWalker i found out that a missing DLL was the problem, namely:
linkinfo.dll.
You can get this by adding the feature named “Desktop Experience” through the server manager.
Once installed and rebooted the device query worked.
3. September 2010
TinyGPU has new hardware: tg010. The hardware configuration and the currently deployed software are different to the non-Fermi nodes:
/home/hpc and /home/vault are mounted [only] through NFS (and natively via GPFS-Cross-Cluster-Mount)To allocate the fermi node, specify :ppn=24 with your job (instead of :ppn=16) and explicitly submit to the TinyGPU-Queue fermi. The wallclock limit is set to the default of 24h . The ECC Memory status is shown on job startup.
This article tries to be a translation from the original posted here: Zuwachs im TinyGPU-Cluster
24. September 2009
Retesting my benchmarks with the current release of Cuda 2.3 I finally incorporated new features like pinned host memory allocation. Specs say that this improves the host to device transfers and vice versa.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.
The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.
Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.

23. July 2009
NVIDIA just released Cuda Version 2.3 with the corresponding driver.
F22 @RRZE has already been updated to support this Version.
8. July 2009
This information will not be updated any more. Please visit our official page as we provide GPU computing now as a cluster ressource:
Currently the available CUDA test systems @ RRZE are:
lightning: (available with upgraded hardware)
Ubuntu 8.04 x86_64
2x Quadcore Intel Clovertown (2,33 GHz), 4 MB L2 pro 2 Cores,
GeForce 8800 Ultra (768 MB) (G80 core)
Cuda Driver Version: 180.22
Cuda Toolkit: 2.0
f22: (Last Update 29.09.09)
Ubuntu 8.04 x86_64
2x Quadcore Intel Xeon L5420 (2.5 GHz)
GeForce GTX 280 SC (1 GB) (GT200 Core)
Current: Cuda Driver Version: 190.29(Cuda2.3) –> with OpenCL Support!
Before: Cuda Driver Version: 190.16 (Cuda2.3)
Cuda Toolkit: 2.3
7. July 2009
Currently we have two test systems running different GPUs from NVIDIA inside the testcluster environment.
qsub -I -lnodes=f22:ppn=8,walltime=01:00:00
module load cuda/2.2 will give you Cuda Version 2.2 64bit
15. December 2008
Test results with the new generation, i.e. GT 200 based and PCIe Generation 2.0 with doubled performance, show that general naive implemented copys do not get any speedups.
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.

23. November 2008
An exemplar of the new NVIDIA Series GT200 based GTX280 Graphics card arrived at our Computing Center last Friday . The card was installed and set up right away and the first benchmark ran on Saturday 22nd of November and finished today.
Some preliminary figures show the great improvement of this new generation as I expected from the data sheets. Soon I will post some verified results here and some about the changes from the G80 generation to the current GT200 chip.
10. September 2008
Benchmarking the PCI express capabilities with CUDA I stumbled across the weird behaviour that a 4 MB block seems to achieve the best sustainable bandwidth. At least when writing to the host.
However, transmitting more than 4 MB but with 4 MB data packets (let’s call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.
Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols

2. September 2008
With the release of the next generation of GPUs, NVIDIA and AMD (former ATI) graphic boards deliver now performance in the order of one teraflop in single precision accuracy. NVIDIA nearly doubled both the count of processors and the memory bus width. Interesting for research is now, how the sustainable performance of programs and algorithms scales with the new platform.
Until now I was not able to test my own algorithms, the Streambenchmarks and the lattice Boltzmann method (see my Thesis for more details ), on the new NVIDIA GPUs.
Double precision also made its way into the GPU circuits, unfortunately with a huge performance loss to around a tenth of single precision performance.
In contrast to that current CPUs lose only about 50% of performance, which comes obvious from the doubled computational work.
Here a little demonstration about the key difference between CPU and GPU NVISION
1. September 2008
I’m currently with the HPC group @ RRZE and working on my master thesis about HPC on graphic cards regarding benchmark kernels and flow solvers.
So any remarks or hints? Drop them here!
<% image name="cuda hpc" %>
Thanks
/edit 01.07.08
Thesis finished :-)