

# Introduction to IA-32 and IA-64: Architectures, Tools and Libraries

G. Hager

Regionales Rechenzentrum Erlangen (RRZE) HPC Services

# Outline

## IA-32 Architecture

- Architectural basics
- Optimization with SIMD operations
- Cluster computing on IA-32 basis

#### IA-64 Architecture

- Intel's EPIC concept
- Available system architectures
- Peculiarities of IA-64 application performance

#### Libraries

Optimized BLAS/LAPACK

#### Tools

- Intel IA-32 and IA-64 compilers
- VTune Performance Analyzer





IA-32

## IA-32 Architecture Basics – a Little History



- IA-32 has roots dating back to the early 80s
  - Intel's first 16-bit CPU: 8086 with 8087 math coprocessor (x86 is born)
  - Even the latest Pentium 4 is still binary compatible with 8086
  - Ioads of advances over the last 20 years:
    - addressing range (1MB → 16MB → Terabytes)
    - protected mode (80286)
    - 32 bit GPRs and usable protected mode (80386)
    - on-chip caches (80486)
    - SIMD extensions and superscalarity (Pentium)
    - CISC-to-RISC translation and out-of-order superscalar processing (Pentium Pro)
    - floating-point SIMD with SSE and SSE2 (Pentium III/4)
- Competitive High Performance Computing was only possible starting with Pentium III
  - Pentium 4 is today rivaling all other established processor architectures

## **IA-32 Architecture Basics**





#### IA-32 Architecture Basics: Pentium 4 Block Diagram



## **IA-32 Architecture Basics**





## IA-32 Architecture Basics: What Makes the Pentium 4 so fast?



- every pipeline stage is kept as simple as possible, leading to very long pipelines (20 stages)
- L1 data cache is tiny but fast (only 2 cycles latency)
- lower instruction level parallelism (ILP) than previous IA-32 designs

#### L1 instruction cache

- long pipelines lead to large penalties for branch mispredictions
- L1I cache takes pre-decoded instructions in order to reduce pipeline fill-up latency
- P4 can very accurately predict conditional branches
- Integer add, subtract, inc, dec, logic, cmp, test all take only 0.5 clock cycles to execute
  - Iatency and throughput are identical for those operations

### IA-32 Architecture Basics: Floating Point Operations



- Nobody should use this any more
- "Sensible SIMD" came about with SSE (Pentium III) and SSE2 (Pentium 4) – Streaming SIMD Extensions

#### Register Model:

| <br>0    |  |
|----------|--|
| xmm0     |  |
| xmm1     |  |
| xmm2     |  |
| xmm3     |  |
| xmm4     |  |
| xmm5     |  |
| хттб     |  |
| xmm7     |  |
|          |  |
| 128 bits |  |

- Each register can be partitioned into several integer or FP data types
  - 8 to 128-bit integers
  - single (SSE) or double precision (SSE2) floating point
- SIMD instructions can operate on the lowest or all partitions of a register at once











IA-32/IA-64 Introduction

14



13.10.2003

georg.hager@rrze.uni-erlangen.de

IA-32/IA-64 Introduction

15

#### **IA-32 Architecture Basics: Programming With SIMD Extensions** Special C++ data types map to SSE registers FP types: н. F32vec4 **4** single-precision FP numbers **1** single-precision FP number F32vec1 2 double-predision FP numbers F64vec2 Integer types: Is32vec4, Is64vec2, etc. C++ operator+ and operator\* are overloaded to accomodate operations on those types programmer must take care of remainder loops manually Alignment issues arise when using SSE data types compiler intrinsics and command line options control alignment uncontrolled unaligned access to SSE data will induce runtime exceptions!

## IA-32 Architecture Basics: Programming With SIMD Extensions





## IA-32 Architecture Basics: Programming With SIMD Extensions

- Alignment issues
  - alignment of arrays in SSE calculations should be on 16-byte boundaries
  - other alternatives: use explicit unaligned load operations (not covered here)
  - How is manual alignment accomplished?

```
2 alternatives
```

manual alignment of structures and arrays with

\_\_declspec(align(16)) <declaration>;

 dynamic allocation of aligned memory (align=alignment boundary)

```
void* _mm_malloc (int size, int align);
void _mm_free (void *p);
```

#### IA-32 Architecture Basics: Hyperthreading



- Hyper-Threading Technology enables multi-threaded software to execute tasks in parallel within each processor
- Duplicates architectural state allowing 1 physical processor to appear as 2 "logical" processors to software (operating system and applications)
- One set of shared execution resources (caches, FP, ALU, dispatch, etc.)
  - only registers and a few other things are duplicated





## IA-32 Architecture Basics: Hyperthreading



## IA-32 Architecture Basics: Hyperthreading



- What is the advantage of HT?
  - puts better use to one CPU's resources
  - CPU is faster under high workload (many processes/threads)
  - helps throughput, not performance
- What does HT not do?
  - can not speed up one single process/thread (can even slow it down)
  - does not give you more resources per CPU
- Who can benefit from HT?
  - workloads in which different threads use different functional units (e.g. integer & fp operations, respectively)
- Where is HT useless?
  - purely floating-point code that uses the FP units continuously
  - code which is very sensitive to cache size
  - spin waits do not free resources
- With suitable applications, speedups of 30% per node are possible

# **IA-32 System Architecture**





# IA-32 Clustering

- **FFBE**
- Due to its unrivaled price/performance ratio the Pentium 4 is very suitable for cluster computing on any scale
- "Poor man's supercomputer": Go to ALDI and buy a bunch of boxes and a Fast Ethernet switch (100 Mbit)
  - might be perfectly well suited for many applications
  - Other end: Quadrics Elan3 interconnect
    - expensive, but at least 30 times more communication bandwidth than Fast Ethernet
    - far more than \$1000 per node "just for the network"
- Common setups (compromise between speed and purse)
  - Myrinet (LRZ Linux Cluster)
  - Gbit Ethernet (RRZE Linux cluster)

## IA-32 Cluster at RRZE



IA-64

다고크



| Template | 10   | 11 | 12   |
|----------|------|----|------|
| 0E       | М    | М  | F    |
| 0F       | М    | М  | F ;; |
| 0B       | M ;; | М  | ;;   |









## Itanium2: Intel's current 64 Bit processor Processor Specifications





























# Libraries & Tools for Intel Architectures

## **High-Performance Libraries**



- Important functionality for every architecture: optimized dense linear algebra (BLAS, LAPACK) and FFT libs
  - "vanilla code" from <u>http://www.netlib.org/</u> is unsuitable performancewise
  - optimized versions available from Intel and other sources
- Intel's High Performance LAPACK/BLAS/FFT package: Math Kernel Library (MKL)
  - complete BLAS 1/2/3 and LAPACK3 implementation
  - FFT functions
  - commercial, but free (beer) for personal use
- Alternative: Goto's High Performance BLAS
  - approx. 10% faster than MKL for matrix-matrix operations
  - http://www.cs.utexas.edu/users/flame/goto/
- Intel's Integrated Performance Primitives (IPP)
  - special subroutines esp. for multimedia processing

# **Using Intel MKL**







## **Intel Compilers**







## Intel Compilers: Basics of Usage Endianness Conversion





#### Intel Compilers: Important Options

| Г |  |
|---|--|
|---|--|

| Not processo | r-specific                                                                         |
|--------------|------------------------------------------------------------------------------------|
| -g           | include debugging information in binary; can be combined with optimization options |
| -qp, -p      | compile for profiling with gprof                                                   |
| -f[no-]alias | assume there is [no] aliasing in program; esp. suitable for C(++) and F90          |
| -openmp      | enable OpenMP directives                                                           |
| -openmpS     | compile OpenMP program as serial program;<br>use stub OpenMP library               |
| -syntax      | check program syntax only; do not generate code                                    |
| -ipo         | enable interprocedural optimizations across files                                  |

## Intel Compilers: Important Options

| georg.hager@rrze.uni-erlangen.de |
|----------------------------------|
|----------------------------------|

| IA-32/IA-64 | Introduction |
|-------------|--------------|
|             |              |

|  | Γ | TZ = |
|--|---|------|
|--|---|------|

## Not processor-specific

| -03                            | high level optimizations (loop nest, prefetching, unrolling,)                     |
|--------------------------------|-----------------------------------------------------------------------------------|
| -opt_report                    | print optimization report to stderr                                               |
| -opt_report_level[min med max] | verbosity level of optimization report                                            |
| -opt_report <i>phase</i>       | report only for certain optimization phases                                       |
| -opt_report_help               | print availabe optimization phases to report on                                   |
| -parallel                      | enables auto-parallelization of loops                                             |
| -par_report <i>level</i>       | report on parallelization<br>success with different<br>verbosity (03, default: 1) |

13.10.2003

13.10.2003

georg.hager@rrze.uni-erlangen.de

IA-32/IA-64 Introduction

49

## Intel Compilers: Important Options

| IA-32 specific/IA-64 specific |                                                                                                                    |  |
|-------------------------------|--------------------------------------------------------------------------------------------------------------------|--|
| -tpp7                         | optimize for Pentium 4 and Xeon                                                                                    |  |
| -xW                           | use SSE2 extensions when possible; code will only run on SSE2 capable architectures                                |  |
| -vec[-]                       | enable [disable] vectorizer                                                                                        |  |
| -vec_reportn                  | print diagnostic information about vectorization;<br>levels 05, default is 1                                       |  |
| -rcd                          | use rounding instead of truncation for float-to-int<br>conversions in C++; faster, but not standard-<br>conforming |  |
| -ftz                          | flush denormals to zero; faster, but not IEEE compliant                                                            |  |
| -ivdep_parall                 | el when a loop is marked by !DIR\$ IVDEP, assume there is no loop-carried dependency                               |  |

## Intel VTune Performance Analyzer





## References



51

- R. Gerber: *The Software Optimization Cookbook.* High Performance Recipes for the Intel Architecture. Intel Press (2002)
  - good introduction, must be complemented with compiler and architecture documentation
- W. Triebel et al: *Programming Itanium-based Systems.* Developing High Performance Applications for Intel's New Architecture. Intel Press (2001)
  - extremely detailed, suitable for assembler programmers
  - slightly outdated
- http://developer.intel.com/
  - tutorials, manuals, white papers, discussion forums etc.
- c't Magazine 13/2003, several articles (25th birthday of Intel's x86 architecture)