Inhalt

x87 not completely dead yet

18. September 2014

The current Intel compilers do not generate x87 code in favor of SSE instructions for floating-point operations. According to the documentation the Intel compilers can only be forced to do so by generating code explicitly for the IA32 architecture (via -mia32 compiler switch). Surprisingly exactly these x87 instructions were found in a physics code where explicitly a SSE2 capable CPU was targeted (via -xsse4.2). Together with Georg Hager we found that the reason are complex double-precision floating-point divisions.

Complex division of two complex numbers e = a + bi and f = c + di is carried out as
[latex]
\frac{a + b i}{c + d i} =
\frac{(a + b i) (c – d i)}{(c + d i) (c – d i)} =
\frac{ac + bd}{c^2 + d^2} + \frac{bc – ad}{c^2 + d^2} i.
[/latex]
The intermediate result of c² + d² can exceed its range if the exponents of c or d are already large [1]. To increase the range the compiler performs this computations on the x87 FPU which can use (IEEE) 80-bit extended double precision instead of the (IEEE) 64-bit double precision.

If you are sure the ranges will not exceed during a complex division the usage of x87 can be turned off, so that only SSE/AVX instructions are used.

Intel Compiler Options[2]:

-no-complex-limited-range (default): use x87 for complex division.
-complex-limited-range: do not use x87 for complex division.

GCC uses Smith’s method instead [4] via a call to __divsc3 for single precision and a call to __divdc3 for double precision in libm [5]. The options are [3]:

-fno-cx-limited-range (default): use Smith’s method for complex division.
-fcx-limited-range: do not use Smith’s method for complex division; this is automatically turned on with -ffast-math.

[1] M. Baudin and R. L. Smith. A Robust Complex Division in Scilab. arXiv:1210.4539.
[2] https://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/win/copts/common_options/option_complex_limited_range.htm
[3] https://gcc.gnu.org/onlinedocs/gcc-4.9.1/gcc/Optimize-Options.html#Optimize-Options
[4] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, 5(8):435, 1962, doi:10.1145/368637.368661.
[5] https://sourceware.org/git/?p=glibc.git;a=blob;f=math/divtc3.c;hb=HEAD

OSU Micro-Benchmarks

8. September 2014

The OSU Micro-Benchmarks (OMB) are used like the Intel MPI Benchmarks (IMB) the measure the achievable latency, bandwidth, … of MPI libraries and interconnects. In contrast to IMB the OSU micro-benchmarks exhibit for several benchmark types a different communication pattern. In the following the patterns used by the latency, bandwidth, and bi-directional bandwidth point-to-point benchmarks is described.

Latency (osu_latency)

In this benchmark latency denotes the time it takes to transfer a message of a certain size from one MPI rank to another. For sending and receiving separate buffers are used, but stay the same during each iteration.

Bandwidth (osu_bw)

Measures the uni-directional bandwidth from one to another MPI rank. Hereby several MPI_Isends are started followed by a MPI_Waitall. There receiving side uses matching MPI_Irecvs with MPI_Waitall. The number of started MPI_Isends is defined as the window_size. The send and receive buffer for each MPI_Isend/MPI_Irecv is the same buffer. One iteration ends when the sending sides gets the receive of all messages acknowledged.

Bi-directional Bandwidth (osu_bibw)

This benchmark works as the uni-directional bandwidth benchmark only that both sides issue first MPI_IRecvs followed by MPI_Isends and MPI_Waitalls. Again send/receive buffers are the same, respectively.

C Default Argument Promotions

15. November 2013

Have you ever wondered why your application never crashes, shows strange behavior, or does not print the correct double precision numbers, when you write something like

float  f = 1.1f;
double d = 1.1;

printf("float  = %f\n", f);
printf("double = %f\n", d);

This would print following result

float  = 1.100000
double = 1.100000

and works like you normally expect it.

But then why the question above? The %f format specifier for the printf/scanf class of functions denotes a variable of type float which consumes 32 bits. In the example above d is of type double which uses 64 bits and actually requires %lf. The provided argument d (64 bits) would be interpreted as a variable of type float (32 bits). If this would happen, then the output would look more like that (at least on architectures using little endian byte order):

float  = 1.100000
double = -0.000000

The double variable d is stored in memory as:

0x9a 99 99 99 99 99 f1 3f

As little endian byte order is used dwill finally be interpreted as:

0x3f f1 99 99 99 99 99 9a

When only the first four bytes would be considered by printf then these bytes would be 0x9a 99 99 99 (from memory) and interpreted as float which would represent the number (ca.) -1.58818684e-23 (0x99 99 99 9a). Printed with %f gives -0.000000. But nothing of this happens and everything works like expected.

The reason is that ONLY the printf class of functions treat %f as %lf, which denotes a double precision argument. Accordingly this is also done for %g and %e. So why can printf safely do this?

The reason are the default argument promotions from C. During this following casts take place (without guarantee to be complete):

`char, short`	to	`int`
`unsigned char, unsigned short`	to	`unsigned int`
`float`	to	`double`

The promotions are applied to

all parameters of a function, when no prototype for this function exists and
to all non-declared parameters of variadic functions, i.e. functions which take a variable number of arguments.

As the printf class of functions are variadic (denoted by the elipsis „…“ in the function signature) the default argument promotions take place. Finally the call printf("float = %f\n", f) will after promotion look like printf("float = %f\n", (double)f) and printf can safely assume that every %f, %g, and %e denotes a parameter of type double.

References:

Michael L. Overton. Numerical Computing with IEEE Floating Point Arithmetic, SIAM, 2001. Chapter 10: Floating Point in C.
C Standard. Chapter 6.5.2.2 (http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf)
Stackoverflow – Default argument promotions in C function calls (http://stackoverflow.com/questions/1255775).