# x87 not completely dead yet

The current Intel compilers do not generate x87 code in favor of SSE instructions for floating-point operations. According to the documentation the Intel compilers can only be forced to do so by generating code explicitly for the IA32 architecture (via -mia32 compiler switch). Surprisingly exactly these x87 instructions were found in a physics code where explicitly a SSE2 capable CPU was targeted (via -xsse4.2). Together with Georg Hager we found that the reason are complex double-precision floating-point divisions.

Complex division of two complex numbers e = a + bi and f = c + di is carried out as
$\frac{a + b i}{c + d i} = \frac{(a + b i) (c – d i)}{(c + d i) (c – d i)} = \frac{ac + bd}{c^2 + d^2} + \frac{bc – ad}{c^2 + d^2} i.$
The intermediate result of c² + d² can exceed its range if the exponents of c or d are already large [1]. To increase the range the compiler performs this computations on the x87 FPU which can use (IEEE) 80-bit extended double precision instead of the (IEEE) 64-bit double precision.

If you are sure the ranges will not exceed during a complex division the usage of x87 can be turned off, so that only SSE/AVX instructions are used.

Intel Compiler Options[2]:

• -no-complex-limited-range (default): use x87 for complex division.
• -complex-limited-range: do not use x87 for complex division.

GCC uses Smith’s method instead [4] via a call to __divsc3 for single precision and a call to __divdc3 for double precision in libm [5]. The options are [3]:

• -fno-cx-limited-range (default): use Smith’s method for complex division.
• -fcx-limited-range: do not use Smith’s method for complex division; this is automatically turned on with -ffast-math.

[1] M. Baudin and R. L. Smith. A Robust Complex Division in Scilab. arXiv:1210.4539.
[2] https://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/win/copts/common_options/option_complex_limited_range.htm
[3] https://gcc.gnu.org/onlinedocs/gcc-4.9.1/gcc/Optimize-Options.html#Optimize-Options
[4] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, 5(8):435, 1962, doi:10.1145/368637.368661.

# OSU Micro-Benchmarks

The OSU Micro-Benchmarks (OMB) are used like the Intel MPI Benchmarks (IMB) the measure the achievable latency, bandwidth, … of MPI libraries and interconnects. In contrast to IMB the OSU micro-benchmarks exhibit for several benchmark types a different communication pattern. In the following the patterns used by the latency, bandwidth, and bi-directional bandwidth point-to-point benchmarks is described.

Latency (osu_latency)

In this benchmark latency denotes the time it takes to transfer a message of a certain size from one MPI rank to another. For sending and receiving separate buffers are used, but stay the same during each iteration.

Bandwidth (osu_bw)

Measures the uni-directional bandwidth from one to another MPI rank. Hereby several MPI_Isends are started followed by a MPI_Waitall. There receiving side uses matching MPI_Irecvs with MPI_Waitall. The number of started MPI_Isends is defined as the window_size. The send and receive buffer for each MPI_Isend/MPI_Irecv is the same buffer. One iteration ends when the sending sides gets the receive of all messages acknowledged.

Bi-directional Bandwidth (osu_bibw)

This benchmark works as the uni-directional bandwidth benchmark only that both sides issue first MPI_IRecvs followed by MPI_Isends and MPI_Waitalls. Again send/receive buffers are the same, respectively.

# Naming Threads under Linux

Under Linux ( and BSD, MacOS, …) it is possible to name threads of a process via pthread_setname_np or prctl(PR_SET_NAME). This can be handy for example when debugging multi-threaded code in gdb or looking at top and figuring out which thread consumes 100% CPU.

In the following this feature is used to name OpenMP threads and make them distinguishable inside gdb. For OpenMP it has to be performed inside a parallel region:

[c]
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License, v2, as
// published by the Free Software Foundation
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
//
// Compile with:
// gcc -g -fopenmp ThreadName.c -o thread-name
// icc -g -openmp ThreadName.c -o thread-name
//
#include <stdio.h>
#include <unistd.h>
#ifdef _OPENMP
#include <openmp.h>
#endif

int main(int argc, char * argv[])
{

#pragma omp parallel
{
int threadId = 0;
char name[16] = { 0 };

#ifdef _OPENMP
#endif

snprintf(name, sizeof(name), „omp-%02d“, threadId);

sleep(10);
}

return 0;
}
[/c]

[c]
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License, v2, as
// published by the Free Software Foundation
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
//
// Setting thread name via prctl(PR_SET_NAME).
// Compile with:
// gcc -g -fopenmp ThreadName.c -o thread-name-prctl
// icc -g -openmp ThreadName.c -o thread-name-prctl
//
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#ifdef _OPENMP
#include <omp.h>
#endif
#include <sys/prctl.h>

int main(int argc, char * argv[])
{
#pragma omp parallel
{
int threadId = 0;
char name[16] = { 0 };

#ifdef _OPENMP
#endif

snprintf(name, sizeof(name) – 1, „omp-%02d“, threadId);

prctl(PR_SET_NAME, name, 0, 0, 0);

sleep(10);
}

return 0;
}
[/c]

Compile ThreadNamePrctl.c with gcc via:
gcc -g -fopenmp ThreadName.c -o thread-name-prctl
Note: -g is only needed for convience when using gdb.

\$ env OMP_NUM_THREADS=4 gdb ./thread-name-prctl
> break sleep
> r
Id   Target Id         Frame
4    Thread 0x7ffff642d700 (LWP 26499) "omp-03" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6
3    Thread 0x7ffff6c2e700 (LWP 26498) "omp-02" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6
2    Thread 0x7ffff742f700 (LWP 26497) "omp-01" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6
* 1    Thread 0x7ffff7fb9760 (LWP 26493) "omp-00" 0x00007ffff76e1590 in sleep () from /lib64/libc.so.6


Thne named threads will also show up with the same name in top (when pressing „H“):

...
26504 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-00
26508 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-01
26509 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-02
26510 user-name  20   0 34940  636  500 S      0  0.0   0:00.00 omp-03
...


Notes:

• The length of a thread name is limited to 16 characters.
• Not all versions of gdb support this feature.
• Compile with -pthread if no OpenMP is used.