Markus Wittmann's Blog

Suche


18. Februar 2013

MPI Node-Local Rank determination

Markus Wittmann, 17:00 Uhr in Allgemein

Sometimes it is necessary to determine the number of MPI processes on a node and determine their order (like a node-local rank). This can be useful for process pinning (affinity control) or optimized I/O access patterns, where only one process per node aggregates the reads and writes for the other processes on the node.

Open MPI provides this information via the two environment variables OMPI_COMM_WORLD_LOCAL_RANK and OMPI_COMM_WORLD_LOCAL_SIZE, which returns the node-local rank for a process and the number of MPI processes on this node (for the current job), respectively.

Method I

A naive, portable solution employs MPI_Get_processor_name or gethostname to create an unique identifier for the node and performs an MPI_Alltoall on it. Each process then walks through the received array, compares each entry with its own processor/host name. If it matches and the element number (starting at zero) is lower than the MPI rank (of MPI_COMM_WORLD), then the zero-initialized counter R is incremented. A second (zero-initialized) counter S is incremented each time the processor/host name matches. At the end, counter R contains the node-local rank and S contains the number of MPI processes at this node. This works fine for small process counts, but it is not scalable.

Method II

This relies on MPI_Comm_split, which provides an easy way to split a communicator into subgroups (sub-communicators). A color (of type integer) is used as input, and all processes with the same color end up in the same group. Additionally, a key can be specified, where the order of the key specifies the order of the rank in the new sub-communicator. To use this for the determination of the node-local rank the processor/host name is mapped via a hash function (e.g. Adler-32) to an integer which is used as the color. The MPI rank (of MPI_COMM_WORLD) can be used as the key. After the call to MPI_Comm_split all processes whose host names map to the same hash are now part to the newly created sub-communicator.

As collisions of the hash can occur, a second step must be performed along the lines of the first method, but this time only on the sub-communicator, which contains (hopefully) only a small fraction of the total number of processes.

Method III

Shared memory can be utilized, if available. All processes create a shared memory segment (e.g. with shmget), which contains a zero-initialized counter. After that an MPI_Barrier is executed to ensure every process has mapped the segment. Then the shared counter is atomically incremented with fetch-and-increment by each process. The value returned from the fetch is the node-local rank. At the end a final barrier is needed to ensure every process has incremented the counter and has obtained its node-local rank. Then the counter contains the number of MPI processes on the node.

Note

MPI_Get_processor_namey is not required by the MPI standard to be the same for each process on the same node. So some care must be taken on new platforms/MPI implementations.

8. Januar 2013

Schoenauer Vector Triad

Markus Wittmann, 13:21 Uhr in Allgemein

In PTfS we had an assignment where the students got a graph of the performance of the Schönauer vector triad A(:) = B(:) * C(:) + D(:) in double precision (DP) over the vector length. From this they should compute the size of the different cache levels.

Performance Schönauer Triad

The L1 cache size was 32 KiB (1024 * 4 * 8 byte (DP), red line), L2 cache size 256 KiB (8192 * 4 * 8 byte (DP), green line), and L3 cache size 12 MiB (393216 * 4 * 8 byte (DP), blue line).
The measurements were performed on a Intel Xeon 5660 (“Westmere”) processor which is the CPU of a compute node from RRZE’s Lima cluster.

The question arose, why there is no sharp drop when the four vectors do no more fit completely into a cache level.

The cache line length (or to be more precisely the cache block length) is 64 bytes. The L1 cache has 64 sets and is 8-way associative. In total it can hold 512 cache lines or four vectors with 1024 double precision floating-point numbers (DP) each. After one iteration of the Schönauer triad with vector length = 1024 the L1 cache might look like this, if A, B, C, and D are aligned to page boundaries (with a pages size of 4096 bytes):

ways set 0 set 1 set 63
0 A(0-7) A(8-15) A(504-511)
1 B(0-7) B(8-15) B(504-511)
2 C(0-7) C(8-15) C(504-511)
3 D(0-7) D(8-15) D(504-511)
4 A(512-519) A(520-527) A(1016-1023)
5 B(512-519) B(520-527) B(1016-1023)
6 C(512-519) C(520-527) C(1016-1023)
7 D(512-519) D(520-527) D(1016-1023)

If this is performed with a vector length of 1032 DP elements, cache lines in set 0 have to be evicted, when the last 8 elements of A, B, C, and D are handled. This will then look like the following table, if we assume an ideal LRU replacement strategy (which is probaly not true for the CPU we are looking at, as it is more a
pseudo LRU replacement strategy):

ways set 0 set 1 set 63
0 A(1024-1031) A(8-15) A(504-511)
1 B(1024-1031) B(8-15) B(504-511)
2 C(1024-1031) C(8-15) C(504-511)
3 D(1024-1031) D(8-15) D(504-511)
4 A(512-519) A(520-527) A(1016-1023)
5 B(512-519) B(520-527) B(1016-1023)
6 C(512-519) C(520-527) C(1016-1023)
7 D(512-519) D(520-527) D(1016-1023)

When the iteration of the triad starts over, the first 8 elements of each vector have to be loaded from L2 into L1 again and evict A/B/C/D(512-519):

ways set 0 set 1 set 63
0 A(1024-1031) A(8-15) A(504-511)
1 B(1024-1031) B(8-15) B(504-511)
2 C(1024-1031) C(8-15) C(504-511)
3 D(1024-1031) D(8-15) D(504-511)
4 A(0-7) A(520-527) A(1016-1023)
5 B(0-7) B(520-527) B(1016-1023)
6 C(0-7) C(520-527) C(1016-1023)
7 D(0-7) D(520-527) D(1016-1023)

Further accesses to the vectors result in cache hits until A/B/C/D(512-519) is needed. The cache line therefore has to be fetched first from L2 and evicts A/B/C/D(1024-1031):

ways set 0 set 1 set 63
0 A(512-519) A(8-15) A(504-511)
1 B(512-519) B(8-15) B(504-511)
2 C(512-519) C(8-15) C(504-511)
3 D(512-519) D(8-15) D(504-511)
4 A(0-7) A(520-527) A(1016-1023)
5 B(0-7) B(520-527) B(1016-1023)
6 C(0-7) C(520-527) C(1016-1023)
7 D(0-7) D(520-527) D(1016-1023)

The last 8 elements of the vectors evict A/B/C/D(0-7), which results finally in:

ways set 0 set 1 set 63
0 A(512-519) A(8-15) A(504-511)
1 B(512-519) B(8-15) B(504-511)
2 C(512-519) C(8-15) C(504-511)
3 D(512-519) D(8-15) D(504-511)
4 A(1024-1031) A(520-527) A(1016-1023)
5 B(1024-1031) B(520-527) B(1016-1023)
6 C(1024-1031) C(520-527) C(1016-1023)
7 D(1024-1031) D(520-527) D(1016-1023)

So finally most of the vector elements reside in L1, and only a small fraction has to be fetched from L2 during each iteration. This is the reason for the performance decline from L1 to L2, when not all vectors fit into the L1 cache level.

The decline from L2 to L3 (green line) is not exactly clear as it begins at a point where the four vectors could still completely fit into L2. One explanation might be that the L2 is a unified cache, meaning it contains data and code. Thus the 32 KiB of the instruction cache could also be allocated in L2, but this is only a venture as the exact relation between instruction and L2 cache is not documented.

24. Oktober 2012

MPI Standard Fun

Markus Wittmann, 14:16 Uhr in Allgemein

The MPI Standard is always good for a surprise: MPI functions which return only one MPI_STATUS for a request (or something else) do not set the MPI_ERROR member of the MPI_STATUS structure. The returned error code by the function also reflects the error of the request. The only exception are functions which take more than one status (array of statuses) and return MPI_ERR_IN_STATUS. Then the real error code is found in the MPI_ERROR member.

This information is not stated for every relevant function to that it applies like MPI_Wait, MPI_Waitany, MPI_Isend, MPI_Irecv, etc. Instead it is found under Point-to-Point Communication, Blocking Send and Receive Operations in Sect. 3.2.5 Return Status. Ok, to be fair it is around the definition of MPI_Recv.

Following is the relevant except from the standard:

In general, message-passing calls do not modify the value of the error code field of status variables. This field may be updated only by the functions in Section 3.7.5 which return multiple statuses. The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS.

Rationale. The error field in status is not needed for calls that return only one status, such as MPI_WAIT, since that would only duplicate the information returned by the function itself. The current design avoids the additional overhead of setting it, in such cases. The field is needed for calls that return multiple statuses, since each request may have had a different failure. (End of rationale.)

To make a long story short: no MPI function sets the MPI_ERROR member in the MPI_STATUS structure except:

  • MPI_Waitsome,
  • MPI_Waitall,
  • MPI_Testall, and
  • MPI_Testsome

if they return MPI_ERR_IN_STATUS.

15. Februar 2012

Intel Fortran Runtime Library Features

Markus Wittmann, 12:07 Uhr in Allgemein

The Intel Fortran runtime library (forrtl) has some nice features. If an application crashes and it was compiled with ifort -traceback … it generates a backtrace/stack trace which contains the module names and source line.

Therefore forrtl catches the SIGSEGV signal (beside others) and handles the fault itself. This isn’t handy if you want to generate core dumps, but luckily forrtl provides an environment variable for that.

Create core dump:

  • setting environment variable decfort_dump_flag=1 (or =y =YES, …)
  • NOTE: environment variables are case sensitive
  • maybe also needed: ulimit -c {max core dump size in blocks|unlimited} (bash)
  • Intel documentation

Verbose stack trace

  • including register values, …
  • setting environment variable TBK_ENABLE_VERBOSE_STACK_TRACE=1

Disable printing stack trace

  • setting environtment variable FOR_DISABLE_STACK_TRACE=1

Printing stack trace information at the current location

  • NOTE: following code snipped is specific to the Intel Fortran compiler/runtime
  • use ifcore
    CALL TRACEBACKQQ(STRING="Message preceeds following traceback:", USER_EXIT_CODE=-1)
  • Intel documentation

Related: Supported environment variables of Intel Composer

19. August 2011

Glibc Fancy Features

Markus Wittmann, 11:18 Uhr in Allgemein

By browsing through old magazines I came across an article in ct 9/2001 (Systemzentrale – Die C-Bibliothek in Linux-/Unix-Systemen, pages 228 – 233) about the glibc and some of its features.

Interestingly the described features like memory tracing, memory statistics, and heap consistency checking are still present.

Heap Consistency Checks

Glibc provides a the function mcheck(), which checks the heap for consistency. If you don’t want do modify your source code it’s possible to link against libmcheck (-lmcheck). This way mcheck() is automatically called at the beginning of the program.

It’s also possible to just specify the environment variable MALLOC_CHECK_ where no recompilation of the target application is needed. The disadvantage is that some simple errors are ignored like

  • double calls to free
  • writing one byte beyond allocated buffer.

Possible values for MALLOC_CHECK_ are:

  • 0: errors are ignored
  • 1: erros are written to stderr
  • 2: on error abort is called

Hooks for Malloc

Intercepting malloc, realloc and free calls is usually done by preloading a special library or relinking the application with the specific libarary. Examples are the classical libefence, DUMA, …

Glibc provides for this purpose hooks which are called instead of the origianl malloc, realloc and free functions. Hooks can be easily set by assigning the replacement functions to the variables __malloc_hook, __realloc_hook, __free_hook, etc.

Statistics of Malloc

The function mallinfo() returns the structure mallinfo which contains information about dynamically allocated memory.

Allocation Debugging

Detecting memory leaks can easily be arranged by calling mtrace(). The function installs hooks which logs allocated and freed memory to a file set by the environment variable MALLOC_TRACE. Logging can be stopped by a call to muntrace(). These functions are declared in mcheck.h.

Generated protocols can be evaluated by running mtrace with the file name of protocol as parameter. Specifying also the binary from which the protocol was generated the mtrace also prints out file name and line number where the never freed memory was allocated.

Example:

$ cat test-mtrace.cpp
#include <stdlib.h>
#include <mcheck.h>

int main(int argc, char * argv[])
{
    mtrace();

    char * test = (char *)malloc(1024);

    test[0] = 0x00;

    muntrace();

    return 0;
}
$ g++ -g -O0 test-mtrace.cpp -o test-mtrace
$ env MALLOC_TRACE=test-mtrace.dat ./test-mtrace
$ mtrace test-mtrace test-mtrace.dat

Memory not freed:
-----------------
           Address     Size     Caller
0x0000000001542460    0x400  at /home/..../test-mtrace.cpp:8

28. Februar 2011

Backtrace for fun and profit

Markus Wittmann, 13:00 Uhr in Allgemein

Last time we came to the problem that we wanted a stack trace (aka stack backtrace) of a running MPI application. One possibility would be attaching with gdb to the process to generate one. Another option is the pstack command but therefore you must allow applications to read other applications memory.

And at least as a computer scientist you always have the option of doing it yourself. Based on some articles pthread-overload.c, iX, and Linux Magazin a library was created which prints a backtrace when SIGUSR1, SIGSEGV, etc is received.

Therefore a shared library is needed which can be loaded during process startup by listing it in the LD_PRELOAD environment variable. Inside the library a function, attributed with the constructor attribute, must install a signal handler for signals that should generate a backtrace. The backtrace itself can be obtained by using the backtrace and backtrace_symbols functions of the glibc.

Essentially this is what’s PTrace.cpp does. Just compile it with make or make TARGET=debug generating ptrace.so and ptrace-dbg.so, respectively.

After some digging around in the net if found that /lib/libSegFault.so does the same. So

$ env LD_PRELOAD=/lib/libSegFault.so app-which-creates-segfault

prints an even more nice backtrace than the solution discussed here.

Usage sample 1:

$ env LD_PRELOAD=./ptrace-dbg.so ./crash

gives

[ptrace] DEBUG: initializing
[ptrace] DEBUG: overriding: SIGUSR1 (10)  SIGINT (2)  SIGQUIT (3)  SIGILL (4)  SIGFPE (8)  SIGSEGV (11)  SIGBUS (7)  
[ptrace] DEBUG: aborting:   SIGINT (2)  SIGQUIT (3)  SIGILL (4)  SIGFPE (8)  SIGSEGV (11)  SIGBUS (7)  
[ptrace] DEBUG: initializing done
 -> function main
 -> function first
 -> function second
[ptrace] Received signal SIGSEGV (11).
[ptrace] 0: /xxxxxxxxxxxxxxx/ptrace/ptrace-dbg.so(PtraceSignalHandler+0x66) [0x2b22126f9c26]
[ptrace] 1: /lib64/libc.so.6 [0x2b22130cb2d0]
[ptrace] 2: ./crash(_Z6secondv+0x23) [0x40081b]
[ptrace] 3: ./crash(_Z5firstv+0x18) [0x40083c]
[ptrace] 4: ./crash(main+0x23) [0x400861]
[ptrace] 5: /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b22130b8994]
[ptrace] 6: ./crash(__gxx_personality_v0+0x41) [0x400749]
[ptrace] DEBUG: Calling exit() after handled signal SIGSEGV (11).
[ptrace] DEBUG: finalizing
[ptrace] DEBUG: finalizing done

Usage sample 2:

$env LD_PRELOAD=./ptrace-dbg.so sleep 30
$kill -USR1 <pid of started sleep>

gives

[ptrace] DEBUG: initializing
[ptrace] DEBUG: overriding: SIGUSR1 (10)  SIGINT (2)  SIGQUIT (3)  SIGILL (4)  SIGFPE (8)  SIGSEGV (11)  SIGBUS (7)
[ptrace] DEBUG: aborting:   SIGINT (2)  SIGQUIT (3)  SIGILL (4)  SIGFPE (8)  SIGSEGV (11)  SIGBUS (7)
[ptrace] DEBUG: initializing done
[ptrace] Received signal SIGUSR1 (10).
[ptrace] 0: /xxxxxxxxxxxxxxx/ptrace-dbg.so(PtraceSignalHandler+0x66) [0x2ab7cee65c26]
[ptrace] 1: /lib64/libc.so.6 [0x2ab7cf0a52d0]
[ptrace] 2: /lib64/libc.so.6(nanosleep+0x10) [0x2ab7cf10f3c0]
[ptrace] 3: sleep [0x4029c4]
[ptrace] 4: sleep [0x40142c]
[ptrace] 5: /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ab7cf092994]
[ptrace] 6: sleep [0x401079]
[ptrace] DEBUG: finalizing
[ptrace] DEBUG: finalizing done

PTrace.cpp

/* ===========================================================================
 *
 *      Filename:  PTrace.c
 *
 *      Description:  Shared library printing backtrace (stack trace) 
 *                    when signal is caught.
 *
 *      Version:  0.1
 *      Created:  2011-02-08
 *
 *      Author:     Markus Wittmann (mw), markus.wittmann@rrze.uni-erlangen.de
 *      Company:    RRZE Erlangen
 *      Project:    ptrace
 *      Copyright:  Copyright (c) 2011, Markus Wittmann
 *
 *      This program is free software; you can redistribute it and/or modify
 *      it under the terms of the GNU General Public License, v2, as
 *      published by the Free Software Foundation
 *
 *      This program is distributed in the hope that it will be useful,
 *      but WITHOUT ANY WARRANTY; without even the implied warranty of
 *      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *      GNU General Public License for more details.
 *
 *      You should have received a copy of the GNU General Public License
 *      along with this program; if not, write to the Free Software
 *      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 *
 * ===========================================================================
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <execinfo.h>


#ifdef DEBUG
	#define Debug(formatString, ...) \
		fprintf(stderr, "[ptrace] DEBUG: " formatString, ##__VA_ARGS__)
#else
	#define Debug(formatString, ...)
#endif


#define Print(formatString, ...) \
	fprintf(stderr, "[ptrace] " formatString, ##__VA_ARGS__)

#define Error(formatString, ...) \
	fprintf(stderr, "[ptrace] ERROR: " formatString, ##__VA_ARGS__)


#define N_ELEMS(x)  (sizeof(x) / sizeof(x[0]))


#define BACKTRACE_DEPTH				20

// List the signals for which a backtrace should be generated if
// received by the application.

#define OVERRIDE_SIGNALS \
	X(SIGUSR1) \
	X(SIGINT) \
	X(SIGQUIT) \
	X(SIGILL) \
	X(SIGFPE) \
	X(SIGSEGV) \
	X(SIGBUS)


// Just returning from a signal handler might cause an infinit loop
// for some signals. If such signals are overriden (i.e. listed above)
// list them below. After handling such a signal exit(EXIT_FAILURE)
// will be called to avoid this undesired behaviour.

#define ABORT_SIGNALS \
	X(SIGINT) \
	X(SIGQUIT) \
	X(SIGILL) \
	X(SIGFPE) \
	X(SIGSEGV) \
	X(SIGBUS)


#define X(sig) sig,
int g_overrideSignals[] = { OVERRIDE_SIGNALS };
int g_abortSignals[] = { ABORT_SIGNALS };
#undef X

#define X(sig) #sig,
char * g_overrideSignalsStr[] = { OVERRIDE_SIGNALS };
char * g_abortSignalsStr[] = { ABORT_SIGNALS };
#undef X


/*****************************************************************************
 * declaration of functions
 *****************************************************************************/

void PtraceInit();
extern "C" void PtraceSignalHandler(int signalNumber, siginfo_t * signalInfo, void * context);


/*****************************************************************************
 * functions called on library load and unload
 *****************************************************************************/

static void _ptrace_initialize() __attribute__((constructor));
static void _ptrace_initialize()
{
	Debug("initializing\n");

	#ifdef DEBUG

		fprintf(stderr, "[ptrace] DEBUG: overriding: ");
		for (unsigned int i = 0; i < N_ELEMS(g_overrideSignals); ++i) {
			fprintf(stderr, "%s (%d)  ", g_overrideSignalsStr[i], g_overrideSignals[i]);
		}
		fprintf(stderr, "\n");

		fprintf(stderr, "[ptrace] DEBUG: aborting:   ");
		for (unsigned int i = 0; i < N_ELEMS(g_abortSignals); ++i) {
			fprintf(stderr, "%s (%d)  ", g_abortSignalsStr[i], g_abortSignals[i]);
		}
		fprintf(stderr, "\n");

	#endif /* DEBUG */

	PtraceInit();

	Debug("initializing done\n");
}

static void _ptrace_finalize() __attribute__((destructor));
static void _ptrace_finalize()
{
	Debug("finalizing\n");
	Debug("finalizing done\n");
}


/*****************************************************************************
 * definition of functions
 *****************************************************************************/


void PtraceInit()
{
	struct sigaction action;

	memset(&action, 0, sizeof(action));
	action.sa_sigaction = PtraceSignalHandler;
	sigfillset(&action.sa_mask);
	action.sa_flags = SA_SIGINFO | SA_NODEFER;

	int error;

	for (unsigned int i = 0; i < N_ELEMS(g_overrideSignals); ++i) {
		error = sigaction(g_overrideSignals[i], &action, NULL);

		if (error == -1) {
			Error("Installing signal handler for signal %s (%d) failed: error %d.\n",
				  g_overrideSignalsStr[i], g_overrideSignals[i], error);
			exit(EXIT_FAILURE);
		}
	}

	return;
}


void PtraceSignalHandler(int signalNumber, siginfo_t * signalInfo, void * context)
{
	char * signalName = "unknown";

	for (unsigned int i = 0; i < N_ELEMS(g_overrideSignals); ++i) {
		if (g_overrideSignals[i] == signalNumber) {
			signalName = g_overrideSignalsStr[i];
			break;
		}
	}

	Print("Received signal %s (%d).\n", signalName, signalNumber);

	int nAddresses;
	void * addresses[BACKTRACE_DEPTH];
	char ** symbols = NULL;

	nAddresses = backtrace(addresses, BACKTRACE_DEPTH);
	symbols = backtrace_symbols(addresses, nAddresses);

	if (symbols == NULL) {
		Error("Retrieving symbols failed.\n");
		exit(EXIT_FAILURE);
	}

	for (int k = 0; k < nAddresses; ++k) {
		Print("%d: %s\n", k, symbols[k]);
	}

	for (unsigned int i = 0; i < N_ELEMS(g_abortSignals); ++i) {
		if (signalNumber == g_abortSignals[i]) {
			Debug("Calling exit() after handled signal %s (%d).\n",
				  g_abortSignalsStr[i], signalNumber);
			exit(EXIT_FAILURE);
		}
	}

	return;
}

Makefile

  
CXX        = g++
CXXFLAGS   = -O2 -Wall -shared -fPIC
PPFLAGS    =

D          = -D

EXE_SUFFIX =

ifeq (debug,$(TARGET))
  PPFLAGS += $(D)DEBUG
  EXE_SUFFIX = -dbg
endif


PTRACE = ptrace$(EXE_SUFFIX).so

.phony: clean


$(PTRACE): PTrace.cpp
	$(CXX) $(CXXFLAGS) $(PPFLAGS) $< -o $@


clean:
	rm -f *.o
	rm -f $(TARGET)
  

[Update:A colleague pointed out that nearly the same can be achieved by using libsegfault.so in combination with LD_PRELOAD to get a stacktrace, register dump and a memory map. Using catchsegv gives the same results, but without the need to use LD_PRELOAD.]

Nach oben