Saturday, October 3, 2015

AMD's Zen core (family 17h) to have ten pipelines per core

With writing about Zen I moved here since blog.de will close its service at the end of this year. That's it. Let's move on to the interesting stuff.

Whoever has chosen the name "Zen" for AMD's next generation x86 core, might have had the number four in mind, which plays an important role in this philosophy (e.g. Four Dharmadhātu). At least this is what a recent patch revealed about this long awaited microarchitecture.

Andreas Stiller speculates that the term Zen as in "SuZen" might be related to Zen team leader Suzanne Plummer and possibly Lisa Su as well. An article on myStatesman, which appeared shortly after Jim Keller's leave, lists some more team member names if you magnify the photo:

Mike Clark, front left, and team leader Suzanne Plummer, and in background from left are Teja Singh, Lyndal Curry, Mike Tuuk, Farhan Rahman, Andy Halliday, Matt Crum, Mike Bates and Joshua Bell.

Mike Clark is a true AMD veteran, being there since 1993. Some have developed the Cat cores, like Teja Singh and Joshua Bell, who presented the Jaguar microarchitecture at ISSCC 2013.

As heard earlier this year, Zen will use SMT and an improved cache subsystem while being designed from scratch with new ideas combined with reusing existing components (to reduce the effort). This might even include already existing and somewhat developed ideas not realized in previous designs. A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.

And the new patch also suggests nicely low int/fp mul, fp add, int/fp div and fp square root latencies. Some of these lower latencies (div/sqrt) were introduced with Excavator, as an Aida64 instruction latency dump provided by Anandtech forum user monstercameron revealed. Due to an Aida problem with measured and reported clock frequencies (although it was fixed at 1.4GHz), you have to multiply the measured times by 1.4 to get the real number of cycles. Ok, back to Zen.

Here are some quotes of the patch file:

+;; Decoders unit has 4 decoders and all of them can decode fast path
+;; and vector type instructions.
+;; Integer unit 4 ALU pipes.
+;; 2 AGU pipes.
+;; Floating point unit 4 FP pipes.
+  32, /* size of l1 cache.  */
+  512, /* size of l2 cache.  */

Excerpt:
  • 4 wide decoders
  • 4 integer ALUs
  • 2 AGUs (for 2R 1W L1 cache according to a LinkedIn profile)
  • 4 FP pipelines
That makes z ten pipelines with a general four wide design.

There is a lot more information, which I will collect over the next days. Some stuff is copy pasted from Excavator (bdver4) or Jaguar (btver2) and modified then. But careful comparing did show some clear differences, while at other places it's not clear, if there is new information or not (e.g. div latencies). But as btver2 has 2048 kB L2 and the rest of the block is more similar to bdver4 or btver2 than btver1 (Bobcat), which has 512 kb L2, it looks like no btver1 files were used as a source. So I assume, that this is a new entry of an L2 cache size, indicating fast L2 caches per core. The L1 data cache still has the same size as that of Jaguar or Excavator. Some patents mention an 8-way 32kb L1 D$.

Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.

These latencies also clearly suggest, that this is no high clock frequency design. But at 14nm (or 16nm from TSMC as some rumours suggest) clocks of 3.5 to 4 GHz should be reachable without stretching the thermal limits too much.

This should be enough for now. Here is a schematic, which should come close to what Zen might really look like:
AMD Zen Core Microarchitecture
AMD Zen Core Microarchitecture (with some speculated parts)

14 comments:

Heikki said...

Is there any actual source which states/hints that the FPU's are (only) 128 bit wide?

ClausDK said...

AMDs own Zen slide (may 2015) shows 2x256bit and 6 integer-units. Lots of speculations ;)

Heikki said...

Those may 2015 slides are fakes

ClausDK said...

Oh, didn't notice that, but yes, can't find it among the real 2015 slides I have either, true.

Lo Absoluto said...

There are two 128 bits floating point add units and two 128bits floating point multiply units. 256bits operands in 128high and 128low

Heikki said...

I asked for a SOURCE for this claim of FPU's being 128-bit wide. Not to repeat it again. Repeating anything without a source does not make it any more true.

Matthias Waldhauer said...

The FPU units seem to be 128 bit wide because many common SSE and 128 bit AVX instructions are of type fast-path decode (one uop) while equivalent 256 bit AVX ops are double decode (going to 2 pipelines or maybe even to one but sequentially).

And one FMAC grouping is not 256 bit wide but does the 128 bit FMUL first, accompanied by the FADD unit for the final result. So these 2 "FMACs" are 128 bit wide too. And if there is no copy-paste error in the patch, the FPU might even just start one FMAC and not two during one cycle, as the 2 options for a FMUL are always paired with the FADD in fp pipline #3.

I'll write a follow up soon.

Atom Symbol said...

The Haswell diagram http://www.anandtech.com/show/6355/intels-haswell-architecture/8 gives a much better overview of what-goes-where questions.

sky scraper said...

According to this: http://pastie.org/private/xrq95cijhfhrlmpeyngrq there are load and store capabilities present in fpu pipes. Can they be utilized also for integer load/store or are they strictly for fp load/store?

Matthias Waldhauer said...

sky scraper, this table by yuri shows, to which the load or store related fp ops go. In earlier AMD processors (K8, K10 I think), each of the three FP units could execute a load, while the cache could only provide two 64bit values. This was only done to simplify or equalize the design of these units, and maybe also to reduce blocking.

sky scraper said...

I'm trying to compare Haswell load/store units with Zen l/s units. Haswell has 256bit l/s units, while Zen is probably going to have 128bit l/s units. What I don't quite understand is that Zen has dedicated l/s units + some l/s capabilities in fp units, while haswell only has ports specifically for l/s operations. Does this mean that Haswell units can load whatever they want while Zen has two dedicated l/s units for integer ops and l/s capabilities inside fp pipes for fp ops. Is this correct?

Another thing I'm not sure about is that according diagram found here:

http://www.anandtech.com/show/6355/intels-haswell-architecture/8

Haswell has units that l/s data and units that l/s adress. I imagine those units are different?? There is no load data unit on this diagram. How is that possible?

I'm sorry if those are noob questions, I'm not very knowledgeable about processor design.

Daniela Czajk said...

In English "with little speculation" means you're pretty sure - is that what you'd meant or "a little speculation" is still warranted?

Matthias Waldhauer said...

Daniela, it's the result of finishing a posting and a diagram at 3 a.m. But that statement should fit, as about 95% of the diagram represents, what is shown in the patch or patents, it still fits I think. I leaved out the L1 I$ size (none given) and borrowed one idea from Jaguar. So small changes might be necessary, but the important elements are backed by the available leaks/sources.

Daniela Czajk said...

I see, thx! Hopefully it's going to become competitive again like in the good ole' days :)