Wednesday, August 17, 2016

Some last chance pre Hot Chips speculation about Zen

TL;DR:  I made a new (stitched) Zeppelin die photo. AMD's datacenter APU might use multiple Zeppelin and Greenland dies. Zen's FPU might have some interesting and unique capabilities.

There is less than one week left until AMD's Zen presentation at Hot Chips. A redditor set up this nice countdown. As usual, AMD will not talk about final SKUs and clock frequencies. But they surely will give more details about individual microarchitectural features. While this would be the first chance to verify what has been posted already ten or five months ago on this blog, it will surely reduce the amount of features to be speculated about. In other words: this is a last chance to post some yet unpublished thoughts about the microarchitecture.

For a start, you get this full Zen/Zeppelin die shot, created from multiple patches of the already known photo showing a part of a Zeppelin wafer:

Labelled Zeppelin die photo (stitched)
Due to some missing reference, it is difficult to find the correct aspect ratio of the die. Scaled the way as shown above, it looks roughly "right" and also matches Hans de Vries' corrected image.

In the past I estimated the die size to be about roughly 160 mm² based on what's in the core, and how other components might scale. When matching this die shot's DDR PHY to that of Skylake, I get roughly 200 mm² (assuming a good guess of the aspect ratio and roughly similar DDR PHY area). So I wouldn't be surprised, if Zeppelin is somewhere in this range 160-200 mm².

GMI-Links and the datacenter APU

AMD's GMI links (Global Memory Interconnect) are already known since Fudzilla mentioned them here. Soon afterwards they posted a slide, which likely shows a schematic view of AMD's planned datacenter APU. This slide was the base for creating the picture below. There I noticed the placement of the orange lines in the center of the Zeppelin and Greenland dies. As "Data Fabric" is written in the same color, the horizontal lines likely mean the same.

So what does this tell us? Well, it looks like both the CPU part and the GPU part do consist of two dies each, which are also connected via GMI. If you already heard about the distributed memory controller in Zen based processors (mentioned in combination with a directory based coherency protocol on LinkedIn), all this makes sense. Knowing the leaked die photo, as shown above, it is not wrong to assume, that the two GMI link structures (GMI-Link #0 and #1) actually comprise of four links. This would be enough to connect two Zeppelin dies with two GMI links to get access to the two distributed memory controllers (and two DDR4 channels provided by them) on the other die. Two more links provided by each die go to the two Greenland dies, which in turn might also have four GMI links each. Each of the GPU dies might just have one HBM PHY. Of course, the shown Greenland GPU might just be a monolithic die sitting on the interposer. But while we are at it, an interposer would be a perfect way to stitch multiple dies together - something, that is expected to come to a greater extent with Navi.

This would provide a lot of flexibility in configuring different processors from a small set of dies: one 8C Zeppelin die and probably just one Greenland die. One important reason for this would be costs for different designs, which are growing with each new process node.

Floating Point Unit

One of the more interesting parts of the Zen microarchitecture is the four-wide FPU. As the GCC patch suggests (by decoding type "single" or "double"), the FPU's native width is 128 bit for SIMD operations. A different patch mentioned a 3 cycle latency for cache accesses by the FPU. With a base L1D$ latency of 4 cycles indicated by the patches, this would mean a total of 7 cycles latency for FP memory accesses. This is likely the cost for going through the FX unit ("fixed point"), which contains the load store unit responsible for L1 data cache accesses. I won't go through the full details of the patches regarding all the different instructions. Let me point you to a wonderful CPU chart found at InstLatX64, which also includes Zen resp. Zeppelin, and Looncraz' instruction mapping table.

But two things stood out about the MUL instructions:
  • In early patches, FMA was displayed as using a combination of a FMUL pipeline (fp0/fp1) and the second FADD pipeline (fp3). This led me to the assumption, that we might see an incarnation of the bridged FMA here. Later this patch info has been changed, so that FMA instructions would run through the FMUL pipelines only
  • (SSE) FMUL and SSE IMUL instructions seem to occupy specific stages in the corresponding pipelines for more than one cycle. This means, throughput would be lower than 1 per pipeline. In the GCC patches this is specified as a times symbol, for example "fp0*3". However, this is not the case for FMA instructions, which could be related to a special treatment (due to the bridge), which  might skip some to-be-iterated stage. We might learn a bit more about that next week.
One reason for that might be Zen's cat core heritage. To be more power efficient, cat cores use a "rectangular" or "iterative" multiplier. This means, it is reduced in depth, so that it can do 32bit FP multiplications at max throughput, but becomes slower, the wider the FP numbers are (64bit and 80bit). This is caused by the need to do multiple iterations in the multiplier array to produce all needed partial products, which become more, the wider the numbers are. This saves a lot of power and area, while still maintaining full throughput for a lot of FP/SIMD code (incl. games), which uses single precision. Also often used double precision has a lower throughput, which doesn't cause much of a performance hit with cat cores, as a lot of code even has more FADDs than FMULs. It typically costs a few percent with DP code for a nice power efficiency improvement.

Another aspect is, that in case of a bridged FMA (or something similar, see below), the FPU wouldn't need that many FPRF read ports, as during doing a FMA operation, a first unit (FMUL) reads the two multiplicands of an FMA instruction, while a second unit (FADD) reads the addend with its own ports and finishes the FMA operation by doing the addition, normalization, and rounding. I think it is interesting to note, that several cat core related patents covering a FMA unit did show a delayed read of the addend.

Since Zen will come with a lot of cores (especially the datacenter variants with up to 32 cores), AMD's (or Jim Kellers?) choice might have been to cut the per core power consumption for higher total core counts. Instead of making the whole core weaker, they seemingly decided to avoid hardware support for wide SIMD. This way, there is no need for 256b datapaths from L1D$ to the execution units, 256b wide registers, and of course 256b wide execution units. This already saves some power, as can be derived from the following chart, taken from the paper "Improving the Energy Efficiency of Big Cores" (PDF):

Similar to SIMD execution width, multipliers are still contributing a large part to a FPU's power consumption at full throughput, so AMD might have cut that further to use (updated) iterative multipliers as found in the cat cores. Maybe this is the reason, that there are two FMUL units in a single core at all, as the construction core line has shown, that AMD avoided to have that many FMUL/FMA units in a single core. There are other nice effects, like a reduction in voltage droops, which were the next big thing in Steamroller, and are still being handled by Sam Naffziger's "Voltage Droop Mitigation" in Carrizo, Bristol Ridge, and even Polaris. An AMD paper described, how researchers were able to increase the base clock frequency of an Orochi processor by 400MHz and higher simply by reducing the throughput of heavy FP ops like FMUL. A FMUL implementation with an iterative multiplier would have a similar effect already built in.

But that's not all. AMD Research lists a paper called "REEL: reducing effective execution latency of floating point operations", which (for many at least as abstract) can be found here. In this paper, researchers describe a novel FPU, which contains some additional registers located in one pipeline stage before round and normalization for later reuse. With a modified scheduler, it is possible, to significantly reduce the effective execution latency of a chain of dependent instructions. This happens by forwarding intermediate results from the internal micro register file. One important aspect of floating point performance is execution latency, as many calculations found in typical code have a low ILP (instruction level parallelism). In these cases, reducing these latencies is important. You may compare some of Zen's latencies in the CPU chart mentioned above. But on top of that, a FPU like described in REEL, would help even further. FMA hasn't explicitly been discussed in the paper, but the way, how the FPU works, there is even kind of an inherent FMA execution (kind of fusion) for dependent FMUL/FADD instructions.

Remaining things

What else could be shown at Hot Chips for Zen? Based on presentations and patents, I wouldn't be surprised about:
  • a package level integrated voltage regulator, or maybe even a FIVR
  • per thread priorities for a more efficient SMT implementation in cases of differently prioritized threads (e.g. Prime95 in background and a game in foreground)
  • finally a working ASF/transactional memory implementation, which becomes increasingly important for higher core counts
  • more efficient address handling (esp. of stack addresses for accessing a stack cache) to reduce AGU usage
  • dual front ends in the future for an increasingly powerful execution back end
  • future application of die stacking (2.5D, 3D) and PIM
  • interesting network on chip topologies for 16 and 32 cores like "ButterDonut"
  • uOp and stack caches
  • reduced branch misprediction penalty thanks to checkpointing or fast rollback of executed instructions
  • FMUL/FADD fusion (to handle them via FMA operations)
This is, what I wanted to get out before enjoying this year's Hot Chips' Zen revelations!

The next article on this blog will be about Zen clock frequency and performance projections, as more information became available. BTW, have you seen Looncraz' Zen analysis at the end of his XV article yet?


Christian H. said...

Great info... I've been waiting for this... Zen has been in many ways a closed book... I do remember hearing he caches were inclusive now.. Is tht the case... Well, I guess we'll find at Hot Chips...

Sam said...

Are you going to be "covering" Hot Chips? That is, providing information and analysis on AMD's presentation when it happens?

Dresdenboy said...

Christian, yes, the caches are inclusive. The latest slides, which were published today just hours after my latest posting, contain some more details.

Somehow it seems, only German media published them.

Sam, I will write something for and also do my own analysis here.

QuickBooks Support said...

Thanks for another wonderful post. Where else may anybody get that type of information in such an ideal way of writing? I’ve a presentation next week, and I’m on the look for such info. for more information click here: Change QuickBooks Password

Angel charls said...

Users who have a Google account on an Android smartphone are looking for ways to effortlessly remove their Google account from their Android devices. Now, if you're interested in learning how to delete your Google account from your Android smartphone, keep reading.
Google Bellen

Lazaro Coley said...

I am continually searching for some free stuffs over the web. There are likewise a few organizations that give free examples.
To learn more about How do I fix the mouse double click test Please do refer to the article mouse double click fix. Here you will learn mouse double click fix. Thank you for your attention.

Unknown said...

This is a website where you can find out more Click Counter. So please visit the site to read its latest post online click counter.

yadongbizz said...

If more people that write articles really concerned themselves with writing great content like you, more readers would be interested in their writings. Thank you for caring about your content. 야동

Also feel free to visit may web page check this link 국산야동

yadongbizz said...

What’s up to all, it’s genuinely a fastidious for me to visit this website, it consists of priceless Information. 한국야동

Also feel free to visit may web page check this link 야동

harry said...

I’m using my brother’s laptop and he was reading this blog of yours. I salute you guys for your efforts as I swear I can’t understand a word of this coding and programming. I get so stressed even if I get a slightly complicated topic to write an essay about; I immediately run for Essay Help in New York - So, hats off to you guys.

토토사이트 온라인베팅 사설토토 플렛폼 said...

Totosite has various subscription codes depending on the ability of the verification site. The safety code of a food and dash verification expert provides the most reliable environment to use Totosite games in private Toto. 토토사이트 뱃사공

SAFE SITES18 said...

This is one very interesting post. I like the way you write and I will bookmark your blog to my favorites. 바둑이사이트넷

sportstotome said...

It is truly a nice and useful piece of info. I am glad that you just shared this useful information with us. Please stay us up to date like this. Thanks for sharing. 스포츠토토

sportstotome said...

Thanks for the marvelous posting! I definitely enjoyed reading it, you may be a great author. I will remember to bookmark your blog and will come back very soon. 스포츠중계

sportstotome said...

Fantastic site you have here but I was curious if you knew of any user discussion forums that cover the same topics discussed here?
I'd really like to be a part of online community where I can get opinions from other knowledgeable individuals that share the same interest.
If you have any suggestions, please let me know. 파워볼사이트

sportstotome said...

That is a great tip particularly to those new to the blogosphere.
Simple but very accurate info? Thank you for sharing this one.
A must read post! 먹튀검증

Error Code Expert said...

I really happy found this website eventually. Really informative and inoperative, Thanks for the post and effort! Please keep sharing more such blog. More Information Click Here:- Recover Deleted Emails in Yahoo Mail


Great post. Articles that have meaningful and insightful comments are more enjoyable.


Interesting stuff to read. Keep it up. 온라인바둑이


I am really impressed with your blog article, such great & useful information you mentioned here. 카지노사이트

David Carter said...

上記の記事を読んで楽しんで、本当にすべてを詳細に説明しています。この記事は非常に興味深く効果的です。 今後の記事をよろしくお願いいたします。 モールス符号は、1999 年にグローバル海上遭難安全システムに置き換えられるまで、海上通信の国際標準として使用されていました。 モールスコーダーの詳細については、モールス信号の記事を参照してください。 ご清聴ありがとうございました。

토토사이트 said...
This comment has been removed by the author.
johnsonmio said...

"Dissertation writing services offer expert support to students tackling the intricacies of creating a thesis. Guiding through topic selection, literature review, and research design, these services ensure a robust scholarly document. With experienced writers, they provide aid in data collection, analysis, and synthesis of findings. Constructive critiques refine content. Caution is advised to choose ethical and credible services that prioritize authenticity. Such services alleviate the daunting task, providing students with well-structured, top-quality dissertations. Ultimately, Dissertation writing services
streamline the process, helping students achieve academic success."