Whoever has chosen the name "Zen" for AMD's next generation x86 core, might have had the number four in mind, which plays an important role in this philosophy (e.g. Four Dharmadhātu). At least this is what a recent patch revealed about this long awaited microarchitecture.
Andreas Stiller speculates that the term Zen as in "SuZen" might be related to Zen team leader Suzanne Plummer and possibly Lisa Su as well. An article on myStatesman, which appeared shortly after Jim Keller's leave, lists some more team member names if you magnify the photo:
Mike Clark is a true AMD veteran, being there since 1993. Some have developed the Cat cores, like Teja Singh and Joshua Bell, who presented the Jaguar microarchitecture at ISSCC 2013.
As heard earlier this year, Zen will use SMT and an improved cache subsystem while being designed from scratch with new ideas combined with reusing existing components (to reduce the effort). This might even include already existing and somewhat developed ideas not realized in previous designs. A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.
And the new patch also suggests nicely low int/fp mul, fp add, int/fp div and fp square root latencies. Some of these lower latencies (div/sqrt) were introduced with Excavator, as an Aida64 instruction latency dump provided by Anandtech forum user monstercameron revealed. Due to an Aida problem with measured and reported clock frequencies (although it was fixed at 1.4GHz), you have to multiply the measured times by 1.4 to get the real number of cycles. Ok, back to Zen.
Here are some quotes of the patch file:
+;; Decoders unit has 4 decoders and all of them can decode fast path +;; and vector type instructions.
+;; Integer unit 4 ALU pipes.
+;; 2 AGU pipes.
+;; Floating point unit 4 FP pipes.
+ 32, /* size of l1 cache. */ + 512, /* size of l2 cache. */
- 4 wide decoders
- 4 integer ALUs
- 2 AGUs (for 2R 1W L1 cache according to a LinkedIn profile)
- 4 FP pipelines
There is a lot more information, which I will collect over the next days. Some stuff is copy pasted from Excavator (bdver4) or Jaguar (btver2) and modified then. But careful comparing did show some clear differences, while at other places it's not clear, if there is new information or not (e.g. div latencies). But as btver2 has 2048 kB L2 and the rest of the block is more similar to bdver4 or btver2 than btver1 (Bobcat), which has 512 kb L2, it looks like no btver1 files were used as a source. So I assume, that this is a new entry of an L2 cache size, indicating fast L2 caches per core. The L1 data cache still has the same size as that of Jaguar or Excavator. Some patents mention an 8-way 32kb L1 D$.
Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.
These latencies also clearly suggest, that this is no high clock frequency design. But at 14nm (or 16nm from TSMC as some rumours suggest) clocks of 3.5 to 4 GHz should be reachable without stretching the thermal limits too much.
This should be enough for now. Here is a schematic, which should come close to what Zen might really look like:
|AMD Zen Core Microarchitecture (with some speculated parts)|
Is there any actual source which states/hints that the FPU's are (only) 128 bit wide?
AMDs own Zen slide (may 2015) shows 2x256bit and 6 integer-units. Lots of speculations ;)
Those may 2015 slides are fakes
Oh, didn't notice that, but yes, can't find it among the real 2015 slides I have either, true.
There are two 128 bits floating point add units and two 128bits floating point multiply units. 256bits operands in 128high and 128low
I asked for a SOURCE for this claim of FPU's being 128-bit wide. Not to repeat it again. Repeating anything without a source does not make it any more true.
The FPU units seem to be 128 bit wide because many common SSE and 128 bit AVX instructions are of type fast-path decode (one uop) while equivalent 256 bit AVX ops are double decode (going to 2 pipelines or maybe even to one but sequentially).
And one FMAC grouping is not 256 bit wide but does the 128 bit FMUL first, accompanied by the FADD unit for the final result. So these 2 "FMACs" are 128 bit wide too. And if there is no copy-paste error in the patch, the FPU might even just start one FMAC and not two during one cycle, as the 2 options for a FMUL are always paired with the FADD in fp pipline #3.
I'll write a follow up soon.
The Haswell diagram http://www.anandtech.com/show/6355/intels-haswell-architecture/8 gives a much better overview of what-goes-where questions.
According to this: http://pastie.org/private/xrq95cijhfhrlmpeyngrq there are load and store capabilities present in fpu pipes. Can they be utilized also for integer load/store or are they strictly for fp load/store?
sky scraper, this table by yuri shows, to which the load or store related fp ops go. In earlier AMD processors (K8, K10 I think), each of the three FP units could execute a load, while the cache could only provide two 64bit values. This was only done to simplify or equalize the design of these units, and maybe also to reduce blocking.
I'm trying to compare Haswell load/store units with Zen l/s units. Haswell has 256bit l/s units, while Zen is probably going to have 128bit l/s units. What I don't quite understand is that Zen has dedicated l/s units + some l/s capabilities in fp units, while haswell only has ports specifically for l/s operations. Does this mean that Haswell units can load whatever they want while Zen has two dedicated l/s units for integer ops and l/s capabilities inside fp pipes for fp ops. Is this correct?
Another thing I'm not sure about is that according diagram found here:
Haswell has units that l/s data and units that l/s adress. I imagine those units are different?? There is no load data unit on this diagram. How is that possible?
I'm sorry if those are noob questions, I'm not very knowledgeable about processor design.
In English "with little speculation" means you're pretty sure - is that what you'd meant or "a little speculation" is still warranted?
Daniela, it's the result of finishing a posting and a diagram at 3 a.m. But that statement should fit, as about 95% of the diagram represents, what is shown in the patch or patents, it still fits I think. I leaved out the L1 I$ size (none given) and borrowed one idea from Jaguar. So small changes might be necessary, but the important elements are backed by the available leaks/sources.
I see, thx! Hopefully it's going to become competitive again like in the good ole' days :)
I am always searching online for articles that can help me. There is obviously a lot to know about this. I think you made some good points in Features also. Keep working, great job ! Feel free to visit my website; 안전놀이터
Love this blog!!!Thanks a lot for sharing this with all folks you actually read my mind Definitely believe that what you said. Thanks for sharing this marvelous post. I m very pleased to read this article. You have touched some pleasant factors here. Any way keep up wrinting. Feel free to visit my website; 토토
Thanks for this great post, I find it very interesting and very well thought out and put together. 토토사이트
Hello, I enjoy reading all of your article post. I like to write a little comment to support you. 경마
Thanks for such a valuable post. I am waiting for your next post, I have enjoyed a lot reading this post keep it up.
percetakan buku online di jakarta
percetakan murah jakarta
percetakan online jakarta
percetakan jakarta timur
jasa percetakan jakarta
digital printing jakarta
cetak murah jakarta
cetak online jakarta
jasa print murah
Pretty useful article. I merely stumbled upon your internet site and wanted to say that I’ve very favored learning your weblog posts. Any signifies I’ll be subscribing with your feed and I hope you publish once additional soon. 메이저사이트
Thanks for such a great post and the review, I am totally impressed! Keep stuff like this coming. 온라인카지노
The popularity of a racehorse is better when the odds are low and worse when the odds are high.실시간 경마 사이트
Post a Comment