Whoever has chosen the name "Zen" for AMD's next generation x86 core, might have had the number four in mind, which plays an important role in this philosophy (e.g. Four Dharmadhātu). At least this is what a recent patch revealed about this long awaited microarchitecture.
Andreas Stiller speculates that the term Zen as in "SuZen" might be related to Zen team leader Suzanne Plummer and possibly Lisa Su as well. An article on myStatesman, which appeared shortly after Jim Keller's leave, lists some more team member names if you magnify the photo:
Mike Clark is a true AMD veteran, being there since 1993. Some have developed the Cat cores, like Teja Singh and Joshua Bell, who presented the Jaguar microarchitecture at ISSCC 2013.
As heard earlier this year, Zen will use SMT and an improved cache subsystem while being designed from scratch with new ideas combined with reusing existing components (to reduce the effort). This might even include already existing and somewhat developed ideas not realized in previous designs. A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.
And the new patch also suggests nicely low int/fp mul, fp add, int/fp div and fp square root latencies. Some of these lower latencies (div/sqrt) were introduced with Excavator, as an Aida64 instruction latency dump provided by Anandtech forum user monstercameron revealed. Due to an Aida problem with measured and reported clock frequencies (although it was fixed at 1.4GHz), you have to multiply the measured times by 1.4 to get the real number of cycles. Ok, back to Zen.
Here are some quotes of the patch file:
+;; Decoders unit has 4 decoders and all of them can decode fast path +;; and vector type instructions.
+;; Integer unit 4 ALU pipes.
+;; 2 AGU pipes.
+;; Floating point unit 4 FP pipes.
+ 32, /* size of l1 cache. */ + 512, /* size of l2 cache. */
- 4 wide decoders
- 4 integer ALUs
- 2 AGUs (for 2R 1W L1 cache according to a LinkedIn profile)
- 4 FP pipelines
There is a lot more information, which I will collect over the next days. Some stuff is copy pasted from Excavator (bdver4) or Jaguar (btver2) and modified then. But careful comparing did show some clear differences, while at other places it's not clear, if there is new information or not (e.g. div latencies). But as btver2 has 2048 kB L2 and the rest of the block is more similar to bdver4 or btver2 than btver1 (Bobcat), which has 512 kb L2, it looks like no btver1 files were used as a source. So I assume, that this is a new entry of an L2 cache size, indicating fast L2 caches per core. The L1 data cache still has the same size as that of Jaguar or Excavator. Some patents mention an 8-way 32kb L1 D$.
Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.
These latencies also clearly suggest, that this is no high clock frequency design. But at 14nm (or 16nm from TSMC as some rumours suggest) clocks of 3.5 to 4 GHz should be reachable without stretching the thermal limits too much.
This should be enough for now. Here is a schematic, which should come close to what Zen might really look like:
|AMD Zen Core Microarchitecture (with some speculated parts)|