The following table gives a short overview of the main HPC systems at RRZE and the development of number of cores, clock frequency, peak performance, etc.:
original Woody (w0xxx) | w10xx | w11xx | w12xx + w13xx | LiMa | Emmy | Meggie | |
---|---|---|---|---|---|---|---|
Year of installation | Q1/2007 | Q1/2012 | Q4/2013 | Q1/2016 + Q4/2016 | Q4/2010 | Q4/2013 | Q4/2016 |
total number of compute nodes | 222 | 40 | 72 | 8 + 56 | 500 | 560 | 728 |
total number of cores | 888 | 160 | 288 | 32 + 224 | 6000 | 11200 | 14560 |
double precision peak performance of the complete system | 10 TFlop/s | 4.5 TFlop/s | 15 TFlop/s | 1.8 + 12.5 TFlop/s | 63 TFlop/s | 197 TFlop/s | ~0.5 PFlop/s (assuming the non-AVX base frequency as AVX turbo frequency) |
increase of peak performance of the complete system compared to Woody | 1.0 | 0.4 | 0.8 | 0.2 + 1.25 | 6.3 | 20 | 50 |
max. power consumption of compute nodes and interconnect | 100 kW | 7 kW | 10 kW | 2 + 6,5 kW | 200 kW | 225 kW | ~200 kW |
Intel CPU generation | Woodcrest | SandyBridge (E3-1280) | Haswell (E3-1240 v3) | Skylake (E3-1240 v5) | Westmere-EP (X5650) | IvyBridge-EP (E5-2660 v2) | Intel Broadwell EP (E5-2630 v4) |
base clock frequency | 3.0 GHz | 3.5 GHz | 3.4 GHz | 3.5 GHz | 2.66 GHz | 2.2 GHz | 2.2 GHz |
number of sockets per node | 2 | 1 | 1 | 1 | 2 | 2 | 2 |
number of (physical) cores per node | 4 | 4 | 4 | 4 | 12 | 20 | 20 |
SIMD vector lenth | 128 bit (SSE) | 256 bit (AVX) | 256 bit (AVX+FMA) | 256 bit (AVX+FMA) | 128 bit (SSE) | 256 bit (AVX) | 256 bit (AVX+FMA) |
maximum single precision peak performance per node | 96 GFlop/s | 224 GFlop/s | 435 GFlop/s | 448 GFlop/s | 255 GFlop/s | 704 GFlop/s | 1408 GFlop/s |
peak performance per node compared to Woody | 1.0 | 2.3 | 4.5 | 4.5 | 2.6 | 7.3 | 14.7 |
single precision peak performance of serial, non-vectorized code | 6 GFlop/s | 7.0 GFlop/s | 6.8 GFlop/s | 7.0 GFlop/s | 5.3 GFlop/s | 4.4 GFlop/s | 4.4 GFlop/s |
performance for unoptimized serial code compared to Woody | 1.0 | 1.17 | 1.13 | 1.17 | 0.88 | 0.73 | 0.73 |
main memory per node | 8 GB | 8 GB | 8 GB | 16 GB / 32 GB | 24 GB | 64 GB | 64 GB |
memory bandwidth per node | 6.4 GB/s | 20 GB/s | 20 GB/s | 25 GB/s | 40 GB/s | 80 GB/s | 100 GB/s |
memory bandwidth compared to Woody | 1.0 | 3.1 | 3.1 | 3.9 | 6.2 | 13 | 15.6 |
If one only looks at the increase of peak performance of the complete Emmy systems only, the world is bright: 20x increase in 6 years. Not bad.
However, if one has an unoptimized (i.e. non-vectorized) serial code which is compute bound, the speed on the latest system coming in 2013 will only be 73% of the one bought in 2007! The unoptimized (i.e. non-vectorized) serial code can neither benefit from the wider SIMD units nor from the increased number of cores per node but suffers from the decreased clock frequency.
But also optimized parallel codes are challenged: the degree of parallelism increased from Woody to Emmy by a factor of 25. You remember Amdahl’s law (link to Wikipedia) for strong scaling? To scale up to 20 parallel processes, it is enough if 95% of the runtime can be executed in parallel, i.e. 5% can remain non-parallel. To scale up to 11200 processes, less then 0.01% must be executed non-parallel and there must be no other overhead, e.g. due to communication or HALO exchange!