Markus Wittmann's Blog

Inhalt

x87 not completely dead yet

The current Intel compilers do not generate x87 code in favor of SSE instructions for floating-point operations. According to the documentation the Intel compilers can only be forced to do so by generating code explicitly for the IA32 architecture (via -mia32 compiler switch). Surprisingly exactly these x87 instructions were found in a physics code where explicitly a SSE2 capable CPU was targeted (via -xsse4.2). Together with Georg Hager we found that the reason are complex double-precision floating-point divisions.

Complex division of two complex numbers e = a + bi and f = c + di is carried out as
  \frac{a + b i}{c + d i} =  	\frac{(a + b i) (c - d i)}{(c + d i) (c - d i)} =  	\frac{ac + bd}{c^2 + d^2} + \frac{bc - ad}{c^2 + d^2} i.
The intermediate result of c² + d² can exceed its range if the exponents of c or d are already large [1]. To increase the range the compiler performs this computations on the x87 FPU which can use (IEEE) 80-bit extended double precision instead of the (IEEE) 64-bit double precision.

If you are sure the ranges will not exceed during a complex division the usage of x87 can be turned off, so that only SSE/AVX instructions are used.

Intel Compiler Options[2]:

  • -no-complex-limited-range (default): use x87 for complex division.
  • -complex-limited-range: do not use x87 for complex division.

GCC uses Smith’s method instead [4] via a call to __divsc3 for single precision and a call to __divdc3 for double precision in libm [5]. The options are [3]:

  • -fno-cx-limited-range (default): use Smith’s method for complex division.
  • -fcx-limited-range: do not use Smith’s method for complex division; this is automatically turned on with -ffast-math.

[1] M. Baudin and R. L. Smith. A Robust Complex Division in Scilab. arXiv:1210.4539.
[2] https://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/win/copts/common_options/option_complex_limited_range.htm
[3] https://gcc.gnu.org/onlinedocs/gcc-4.9.1/gcc/Optimize-Options.html#Optimize-Options
[4] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, 5(8):435, 1962, doi:10.1145/368637.368661.
[5] https://sourceware.org/git/?p=glibc.git;a=blob;f=math/divtc3.c;hb=HEAD

OSU Micro-Benchmarks

The OSU Micro-Benchmarks (OMB) are used like the Intel MPI Benchmarks (IMB) the measure the achievable latency, bandwidth, … of MPI libraries and interconnects. In contrast to IMB the OSU micro-benchmarks exhibit for several benchmark types a different communication pattern. In the following the patterns used by the latency, bandwidth, and bi-directional bandwidth point-to-point benchmarks is described.

Latency (osu_latency)

In this benchmark latency denotes the time it takes to transfer a message of a certain size from one MPI rank to another. For sending and receiving separate buffers are used, but stay the same during each iteration.

Osu-Latency

Bandwidth (osu_bw)

Measures the uni-directional bandwidth from one to another MPI rank. Hereby several MPI_Isends are started followed by a MPI_Waitall. There receiving side uses matching MPI_Irecvs with MPI_Waitall. The number of started MPI_Isends is defined as the window_size. The send and receive buffer for each MPI_Isend/MPI_Irecv is the same buffer. One iteration ends when the sending sides gets the receive of all messages acknowledged.

Osu_bw

Bi-directional Bandwidth (osu_bibw)

This benchmark works as the uni-directional bandwidth benchmark only that both sides issue first MPI_IRecvs followed by MPI_Isends and MPI_Waitalls. Again send/receive buffers are the same, respectively.

Osu_bibw

Naming Threads under Linux

Under Linux ( and BSD, MacOS, …) it is possible to name threads of a process via pthread_setname_np or prctl(PR_SET_NAME). This can be handy for example when debugging multi-threaded code in gdb or looking at top and figuring out which thread consumes 100% CPU.

In the following this feature is used to name OpenMP threads and make them distinguishable inside gdb. For OpenMP it has to be performed inside a parallel region:

//      This program is free software; you can redistribute it and/or modify
//      it under the terms of the GNU General Public License, v2, as
//      published by the Free Software Foundation
//
//      This program is distributed in the hope that it will be useful,
//      but WITHOUT ANY WARRANTY; without even the implied warranty of
//      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
//      GNU General Public License for more details.
//
//      You should have received a copy of the GNU General Public License
//      along with this program; if not, write to the Free Software
//      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
//
// File: ThreadName.c
// Setting thread name via pthread_setname_np.
// Compile with:
//		gcc -g -fopenmp ThreadName.c -o thread-name
//    icc -g -openmp  ThreadName.c -o thread-name
//
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#ifdef _OPENMP
	#include <openmp.h>
#endif

int main(int argc, char * argv[])
{

	#pragma omp parallel
	{
		int threadId = 0;
		char name[16] = { 0 };

		#ifdef _OPENMP
			threadId = omp_get_thread_num();
		#endif

		snprintf(name, sizeof(name), "omp-%02d", threadId);

		pthread_setname_np(pthread_self(), name);

		sleep(10);
	}

	return 0;
}
//      This program is free software; you can redistribute it and/or modify
//      it under the terms of the GNU General Public License, v2, as
//      published by the Free Software Foundation
//
//      This program is distributed in the hope that it will be useful,
//      but WITHOUT ANY WARRANTY; without even the implied warranty of
//      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
//      GNU General Public License for more details.
//
//      You should have received a copy of the GNU General Public License
//      along with this program; if not, write to the Free Software
//      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
//
// File: ThreadNamePrctl.c
// Setting thread name via prctl(PR_SET_NAME).
// Compile with:
//  gcc -g -fopenmp ThreadName.c -o thread-name-prctl
//  icc -g -openmp  ThreadName.c -o thread-name-prctl
//
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#ifdef _OPENMP
    #include <omp.h>
#endif
#include <sys/prctl.h>

int main(int argc, char * argv[])
{
    #pragma omp parallel
    {
        int threadId = 0;
        char name[16] = { 0 };

        #ifdef _OPENMP
            threadId = omp_get_thread_num();
        #endif

        snprintf(name, sizeof(name) - 1, "omp-%02d", threadId);

        prctl(PR_SET_NAME, name, 0, 0, 0);

        sleep(10);
    }

    return 0;
}

Compile ThreadNamePrctl.c with gcc via:
gcc -g -fopenmp ThreadName.c -o thread-name-prctl
Note: -g is only needed for convience when using gdb.

$ env OMP_NUM_THREADS=4 gdb ./thread-name-prctl
> break sleep
> r
> info threads
  Id   Target Id         Frame
  4    Thread 0x7ffff642d700 (LWP 26499) "omp-03" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6
  3    Thread 0x7ffff6c2e700 (LWP 26498) "omp-02" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6
  2    Thread 0x7ffff742f700 (LWP 26497) "omp-01" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6
* 1    Thread 0x7ffff7fb9760 (LWP 26493) "omp-00" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6

Thne named threads will also show up with the same name in top (when pressing „H“):

...
26504 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-00
26508 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-01
26509 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-02
26510 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-03
...

Notes:

  • The length of a thread name is limited to 16 characters.
  • Not all versions of gdb support this feature.
  • pthread_setname_np
    • It is only supported in glibc since version 2.12.
    • Compile with -pthread if no OpenMP is used.
  • prctl(PR_SET_NAME) could be used as a fallback, which is supported since Linux kernel version 2.6.9.

C Default Argument Promotions

Have you ever wondered why your application never crashes, shows strange behavior, or does not print the correct double precision numbers, when you write something like

float  f = 1.1f;
double d = 1.1;

printf("float  = %f\n", f);
printf("double = %f\n", d);

This would print following result

float  = 1.100000
double = 1.100000

and works like you normally expect it.

But then why the question above? The %f format specifier for the printf/scanf class of functions denotes a variable of type float which consumes 32 bits. In the example above d is of type double which uses 64 bits and actually requires %lf. The provided argument d (64 bits) would be interpreted as a variable of type float (32 bits). If this would happen, then the output would look more like that (at least on architectures using little endian byte order):

float  = 1.100000
double = -0.000000

The double variable d is stored in memory as:

0x9a 99 99 99 99 99 f1 3f 

As little endian byte order is used dwill finally be interpreted as:

0x3f f1 99 99 99 99 99 9a

When only the first four bytes would be considered by printf then these bytes would be 0x9a 99 99 99 (from memory) and interpreted as float which would represent the number (ca.) -1.58818684e-23 (0x99 99 99 9a). Printed with %f gives -0.000000. But nothing of this happens and everything works like expected.

The reason is that ONLY the printf class of functions treat %f as %lf, which denotes a double precision argument. Accordingly this is also done for %g and %e. So why can printf safely do this?

The reason are the default argument promotions from C. During this following casts take place (without guarantee to be complete):

char, short   to   int
unsigned char, unsigned short   to   unsigned int
float   to   double

The promotions are applied to

  • all parameters of a function, when no prototype for this function exists and
  • to all non-declared parameters of variadic functions, i.e. functions which take a variable number of arguments.

As the printf class of functions are variadic (denoted by the elipsis „…“ in the function signature) the default argument promotions take place. Finally the call printf("float = %f\n", f) will after promotion look like printf("float = %f\n", (double)f) and printf can safely assume that every %f, %g, and %e denotes a parameter of type double.

References:

  • Michael L. Overton. Numerical Computing with IEEE Floating Point Arithmetic, SIAM, 2001. Chapter 10: Floating Point in C.
  • C Standard. Chapter 6.5.2.2 (http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf)
  • Stackoverflow – Default argument promotions in C function calls (http://stackoverflow.com/questions/1255775).