Monday, August 22, 2016

Two days to go until AMD's Hot Chips presentation - how about Zen's core size?

TL;DR: Zen's many integer schedulers might be related to dedicated dependency chain handling. And the Zen core might be just 4.9 mm² incl. L2 cache.

Here are some last speculative thoughts before we'll hear about Zen at Hot Chips 28, from the guy, who told you first about Bulldozer's and Zen's microarchitectures, AMD's upcoming 32 core server chip, and some other interesting things. Now I can say this, as AMD did present a first view on Zen's microarchitecture just a day after my last blog posting. Again, I was pretty close. This simply depends on the amount of data found in patches and patents.

However, some things are different: The speculated Zen microarchitecture shown here on this blog had a unified scheduler for the 4 ALUs and a second one for the AGUs, while Zen actually has 6 separate schedulers instead. The base for my assumption was the cat core heritage. But of course, a unified scheduler for 4 execution units holding lots of µOps is still a big step up from a scheduler, which only has to issue to two units (cat cores). Now what's the reason for this configuration? At first, there are many possible typical design trade-off related reasons, like area, delays, buffer sizes, power efficiency. But there are also some interesting concepts, like a dependency chain oriented handling of instruction streams. If some code has an instruction level parallelism greater than one, there are groups of instructions, which at some point could be executed independently of the main flow, until their result gets fed back. These groups and sections of the main flow could also be called dependency chains, where it is not possible to execute a newer instruction before an older one, as each of them in the chain depends on some result of its predecessing operation. Here is an example:
mov eax, [edi]
add eax, ecx
imul edi, eax
cmp [ebp+08h], edi
jnz out
This code actually can't be executed out-of-order. All the logic put into an out-of-order scheduler would be a waste of energy in this case. And multiple same type execution units for parallel issue wouldn't help to speed this up either. A single scheduler with an integer execution unit (IEU), and an address generation unit (AGU) would be enough. The latter wouldn't even need a separate scheduler, similar to the K7, K8, K10 series of CPUs. This could be one reason for Zen's individual schedulers, as one identified dependency chain could be sent to a single integer scheduler and one AGU scheduler if there are memory operands. The other schedulers might even be clock gated then.

U.S. Patent No. 8769539 covers a scheduler, which can be switched between out-of-order and in-order operation. One of the inventors is Zen project leader Suzanne Plummer, while Dan Hopper is also an important member of the Zen x86 core design team. In combination with many other patents (for example  US20120023314), which cover dependency chain related logic, there might be such a scheduler in Zen.

Knowing the dependency chain also offers several efficiency measures. One patent covered different latencies for "far source operands" and close ones, i.e. coming from a different "lane" or the same. Bypass networks could be implemented in a somewhat more relaxed way, which improves delay and power efficiency.

A Zen core size estimation

After talking about Zen's die size just days ago, there is another size, which likely will be revealed at Hot Chips: the size of a Zen core. Earlier this year I created a table to estimate that size based on Excavator module components and some scaling factors. Based on AMD's statement about a density optimized process, one of the unknowns in this calculation just became a bit smaller. Their statement could both mean dense metal layers or high density standard cell libs. For simplicity and lack of further data, assuming no density related scaling should do. Process related scaling is a different story, though.

Using die photos, it is possible to measure the size of a graphics CU. On a Polaris 10 die, the size of a graphics CU is about 2.96 mm², while Carrizo has 7.21 mm² CUs. This results in a scaling factor of about 41%. Putting this all together with some individual scaling factors based on design changes (e.g. more ALUs, smaller multiplier arrays, 64KB L1 I$ - already included in the "area 1C" number), results in a Zen core size incl. L2 cache of about 4.9 mm².

Zen core size estimation based on Excavator data


Arnawa Widagda said...

Matthias Waldhauer said...

Yep, just hours after my posting. It seems, only German media published them, BTW.

Arnawa Widagda said...

Probably cause it's early in the week.
I do wish there were more details about the front end, but I can understand if AMD decided not go into details at this moment.
Care to write up an analysis on the front end? Looks far more interesting than the execution parts.

D' Artagnan Ironlynx said...

can you do an overall post with all the info we know about Zen till now?Where Zen will lack and where will shine against Intel CPU´s.
I think the lack of 256SIMD will have a big impact on power consumption and the fast cache will deliver a great performance, but I fear the expectations are going sky high.

Tacit Murky said...

«This code actually can't be executed out-of-order» — except for loading this [ebp+08h] argument.

Generally, I think 6 RS's is a bad idea. A core would have less opportunity to issue ready mops to execution; while power-saving can be attained with less sectioning of RS's. Practically, upgraded Cat's version would've be fine: data RS (4 ports), AG/stack RS (2 ports), flags/jumps RS (1 port). Two later ones can be unified in a 3-ported RS. Then it'll be possible to execute float/vector code without waking up large data RS.

128-bit paths for vectors seems degrading. This will have negative effect for both speed and power for all AVX code. It's just repeating one of many BD mistakes…

Jeremy Miller said...

There has been more of the possible grounds and the values been emphasized in advance and surely for the future would lay down a better platform. java programming help