Intel’s port 7 AGU blunder

Everyone’s got their pet peeves: For Poempelfox it’s Schiphol Airport, for Andreas Stiller it’s the infamous A20 gate. Compared to those glorious fails, my favorite tech blunder is a rather measly one, and it may not be relevant to many users in practice. However, it’s not so much the importance of it but the total mystery of how it came to happen. So here’s the thing.

Sandy Bridge and Ivy Bridge LOAD and STORE units, AGUs, and their respective ports.

The Intel Sandy Bridge and Ivy Bridge architectures have six execution ports, two of which (#2 & #3) feed one LOAD pipeline each and one (#4) feeds a STORE pipe. These units are capable of transferring 16 bytes of data per cycle each. With AVX code, the core is thus able to sustain one full-width 32-byte LOAD (in two adjacent 16-byte chunks) and one half of a 32-byte STORE per cycle. But the LOAD and STORE ports are not the only thing that’s needed to execute these instructions – the core must also generate the corresponding memory addresses, which can be rather complicated. In a LOAD instruction like:

vmovupd ymm0, [rdx+rsi*8+32]

the memory address calculation involves two integer add operations and a shift. It is the task of the address generation units (AGUs) to do this. Each of ports 2 and 3 serves an AGU in addition to the LOAD unit, so the core can generate two addresses per cycle – more than enough to sustain the maximum LOAD and STORE throughput with AVX.

The peculiar configuration of LOAD and STORE units and AGUs causes some strange effects. For instance, if we execute the Schönauer vector triad:

A(:) = B(:) + C(:) * D(:)

with AVX on Ivy Bridge, it is clearly LOAD bound with a maximum in-core throughput of one AVX iteration in 3 cycles, i.e., 8/3 flop/cy. The three LOADs take three cycles, the STORE needs two, the arithmetic can be done in one, and the AGUs are not even 100% busy since we only require 4 addresses every 3 cycles. The compiler typically employs sufficient unrolling on top of AVX to reduce the overhead of the “loop mechanics” to nil. On our Ivy Bridge I can measure up to 98% of the theoretical limit with a working set that fits into L1 (992 elements per array).

However, with SSE (16-byte wide) instructions, the game is different: The LOAD ports now act independently, each requiring an address to execute a LOAD. The STORE port can issue one 16-byte STORE per cycle, so in principle it should be possible to execute two LOADs and one STORE per cycle, and there may even be room for another instruction before we hit the absolute limit of four micro-ops per cycle. However, with only two AGUs, that’s impossible: The core can either do two LOADs or one LOAD and one STORE per cycle. The Schönauer triad will thus execute in two cycles per SSE iteration, or 2 flop/cy, which is also easily in reach with compiler-generated code. Neither the LOAD nor the STORE units are continuously busy, and the bottleneck is actually the AGUs’ throughput. As a consequence, the AVX code is only a factor of 4/3 faster than the SSE version (and 8/3 times faster than the scalar code, for that matter), although the SIMD width alone makes a different promise.

The port 7 AGU on Haswell

Haswell has 32-byte wide LOAD/STORE units and an additional port with a third AGU to serve.

Alas! It is not. Whatever you do, compiler-generated code (perfect as it may appear) only runs at 4 flop/cy, i.e., 2 cy per AVX2 iteration. This is the generated assembly code (Intel 19.02, options -Ofast -xAVX2):

..B2.8:
vmovupd ymm2, [r13+r8*8]
vmovupd ymm4, [32+r13+r8*8]
vmovupd ymm6, [64+r13+r8*8]
vmovupd ymm8, [96+r13+r8*8]
vmovupd ymm1, [r12+r8*8]
vmovupd ymm3, [32+r12+r8*8]
vmovupd ymm5, [64+r12+r8*8]
vmovupd ymm7, [96+r12+r8*8]
vmovupd [r14+r8*8], ymm2
vmovupd [32+r14+r8*8], ymm4
vmovupd [64+r14+r8*8], ymm6
vmovupd [96+r14+r8*8], ymm8
cmp r8, r9
jb ..B2.8

Nothing out of the ordinary here: The compiler has unrolled the loop four times on top of AVX2, which requires the core to execute the 19 instructions  in 4$\times$1.5=6 cycles in order to attain the theoretical maximum. I’ve taken care to align the arrays to cache line boundaries, so any residual split cache line load penalty does not apply. So what’s going on here? (If you think “So what? 2 instead of 1.5 cycles? In L1? Who cares?”, you may be right. But that’s not the point here. Go away and leave the bean counters alone.)

There is a slight hint in Table 2-14 of the September 2019 Intel 64 and IA32 Architectures Optimization Reference Manual: It lists a “Simple_AGU” connected to port 7. Some more digging in Agner‘s forum and eventually asking an Intel representative got me the answer. The AGU behind port 7 can only handle “simple” addressing modes,  more specifically, those of the form [BASE+OFFSET]. Using an index register is ruled out. The compiler (almost) always uses indexed addressing, so the port 7 AGU is never used and we’re back to two AGUs.

“Optimized” assembly

Hacking the assembly to check this hypothesis is easy (changes highlighted):

  push r14
..B2.8:
vmovupd ymm2, [r13+r8*8]
vmovupd ymm4, [32+r13+r8*8]
vmovupd ymm6, [64+r13+r8*8]
vmovupd ymm8, [96+r13+r8*8]
vmovupd ymm1, [r12+r8*8]
vmovupd ymm3, [32+r12+r8*8]
vmovupd ymm5, [64+r12+r8*8]
vmovupd ymm7, [96+r12+r8*8]
vmovupd [r14], ymm2
vmovupd [32+r14], ymm4
vmovupd [64+r14], ymm6
vmovupd [96+r14], ymm8
cmp r8, r9
jb ..B2.8
pop r14

Now all stores use only [BASE+OFFSET] addressing. Register r14, the base address of the target array, must be updated within the loop, and to make it compatible with the surrounding mumbo-jumbo (remainder loops and such) I save it on the stack before the loop and restore it afterwards.

Running this code I could observe two things: (1) The performance is now consistently above 4 flop/cy, so obviously the two-AGU limit must have been overcome, and (2) although some runs attain up to 92% of the 16/3 flop/cy limit, there is a considerable run-to-run variation in the order of 5%. One can only speculate that sometimes the STORE address generation still happens on port 2 or 3 but the LOAD addresses cannot come from the port 7 AGU. Intel’s documents provide no insight here, but LIKWID does:

\$ likwid-perfctr -C 1 -g UOPS_EXECUTED_PORT_PORT_7:PMC1,AVX_INSTS_CALC:PMC2 ./a.out

This counts overall AVX computation instructions (of sorts; see this for details) and the number of µops on port 7. It turns out that low performance is strongly correlated with a lower utilization of port 7, and whenever performance is good the two numbers are very close together. I’d love to study this more quantitatively, but this is something for a future post.

Long story short, port 7 cannot handle indexed addressing, and the compiler doesn’t give a s#!t. I’ve seen it generate LEA instructions sometimes, but I have no idea about the heuristics behind that. See also this post for some more detailed info on which instructions cannot be handled, address-wise, by port 7.

There is not much more to say. All Intel architectures from Haswell to Cascade Lake have the simple AGU problem at port 7. On Skylake and Cascade Lake, the LOAD/STORE throughput doubles anyway because of the 64-byte SIMD width, but the unmodified, compiler-generated (don’t forget the “secret weapon” -qopt-zmm-usage=high) Schönauer triad is still AGU bound. The next new thing will be Ice Lake, but I still have to get my hands on one of those. The September 2019 optimization manual has a section 2.1 about Ice Lake, which states that new ports 8 and 9 will be serving a new store address AGU and another STORE unit, respectively. No mention of “simple addresses,” but also no clear statement about AGU capabilities.