Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things


fork and OFED Infiniband stack

Attention: OFED disallows system(const char*) or fork/exec after initializing the Infiniband libraries. Some documentation mentions about this:
… the Mellanox InfiniBand driver has ssues with buffers sharing pages when fork() is used. Pinned (locked in memory) pages are normally marked copy-on-write during a fork. If a page is pinned before a fork and subsequently written to while RDMA operations are being performed on the same page, silent data corruption can occur as RDMA operations continue to stream data to a page that has moved. To avoid this, the Mellanox driver does not use copy-on-write behavior during a fork for pinned pages. Instead, access to these pages by the child process will result in a segmentation violation.
Fork support from kernel 2.6.12 and above is available provided that applications do not use threads. The fork() is supported as long as parent process does not run before child exits or calls exec(). The former can be achieved by calling wait(childpid) the later can be achieved by application specific means. Posix system() call is supported.

Woody is running a SuSE SLES9 kernel, i.e. 2.6.5. Thus, no support for fork and similar things!

Some users already hit this problem! Even a Fortran user who had call system('some command') in his code! In the latter case, the application just hang in some (matching) MPI_send/MPI_recv calls.