The New Citavia Blog

Leaked Zen/Naples benchmarks appeared in SiSoftware's database [screenshots updated]

2016-11-12T13:13:00.000+01:00

After weeks without new Zen benchmark leaks showing up (Blenchmark doesn't seem to be one), there are some first in the SiSoftware database, dated 10/26. Have a look here.

Update: After some of the pages were removed at SiSoftware, I removed the links, so that the full resolution screenshots are available again.

Before doing some analysis, there are the raw results:

Stoney Ridge 6W Engineering Sample spotted

2016-10-28T00:25:00.002+02:00

The well known Zauba shipment database contains an entry for a low power Stoney Ridge E2 engineering sample. As the OPN 2E1601AOY23E2 tells us, it runs at 1.6GHz base clock and has 2 cores (as expected for Stoney Ridge). The first "E" letter denotes an embedded part, confirmed by the ES OPN decoding sheet at CPU-World.

Interestingly, the ES entry also lists the TDP, which is a low 6W for this dual core, which is one of the last descendants of the Bulldozer or Construction Core line. Next to it, there is a 15W model, which is already available as E2-9010 APU.

For comparison, the highest clocked AMD dual core APU with a similar TDP has a base clock of 1.2 GHz (GX-212JC). An upcoming model, likely of the same Puma core powered "Steppe Eagle" family, will be running at 1.6 GHz while coming with a TDP of 10 W (GX-216HC).

Two days to go until AMD's Hot Chips presentation - how about Zen's core size?

2016-08-22T03:42:00.002+02:00

TL;DR: Zen's many integer schedulers might be related to dedicated dependency chain handling. And the Zen core might be just 4.9 mm² incl. L2 cache.

Here are some last speculative thoughts before we'll hear about Zen at Hot Chips 28, from the guy, who told you first about Bulldozer's and Zen's microarchitectures, AMD's upcoming 32 core server chip, and some other interesting things. Now I can say this, as AMD did present a first view on Zen's microarchitecture just a day after my last blog posting. Again, I was pretty close. This simply depends on the amount of data found in patches and patents.

However, some things are different: The speculated Zen microarchitecture shown here on this blog had a unified scheduler for the 4 ALUs and a second one for the AGUs, while Zen actually has 6 separate schedulers instead. The base for my assumption was the cat core heritage. But of course, a unified scheduler for 4 execution units holding lots of µOps is still a big step up from a scheduler, which only has to issue to two units (cat cores). Now what's the reason for this configuration? At first, there are many possible typical design trade-off related reasons, like area, delays, buffer sizes, power efficiency. But there are also some interesting concepts, like a dependency chain oriented handling of instruction streams. If some code has an instruction level parallelism greater than one, there are groups of instructions, which at some point could be executed independently of the main flow, until their result gets fed back. These groups and sections of the main flow could also be called dependency chains, where it is not possible to execute a newer instruction before an older one, as each of them in the chain depends on some result of its predecessing operation. Here is an example:

mov eax, [edi]
add eax, ecx
imul edi, eax
cmp [ebp+08h], edi
jnz out

This code actually can't be executed out-of-order. All the logic put into an out-of-order scheduler would be a waste of energy in this case. And multiple same type execution units for parallel issue wouldn't help to speed this up either. A single scheduler with an integer execution unit (IEU), and an address generation unit (AGU) would be enough. The latter wouldn't even need a separate scheduler, similar to the K7, K8, K10 series of CPUs. This could be one reason for Zen's individual schedulers, as one identified dependency chain could be sent to a single integer scheduler and one AGU scheduler if there are memory operands. The other schedulers might even be clock gated then.

U.S. Patent No. 8769539 covers a scheduler, which can be switched between out-of-order and in-order operation. One of the inventors is Zen project leader Suzanne Plummer, while Dan Hopper is also an important member of the Zen x86 core design team. In combination with many other patents (for example US20120023314), which cover dependency chain related logic, there might be such a scheduler in Zen.

Knowing the dependency chain also offers several efficiency measures. One patent covered different latencies for "far source operands" and close ones, i.e. coming from a different "lane" or the same. Bypass networks could be implemented in a somewhat more relaxed way, which improves delay and power efficiency.

A Zen core size estimation

After talking about Zen's die size just days ago, there is another size, which likely will be revealed at Hot Chips: the size of a Zen core. Earlier this year I created a table to estimate that size based on Excavator module components and some scaling factors. Based on AMD's statement about a density optimized process, one of the unknowns in this calculation just became a bit smaller. Their statement could both mean dense metal layers or high density standard cell libs. For simplicity and lack of further data, assuming no density related scaling should do. Process related scaling is a different story, though.

Using die photos, it is possible to measure the size of a graphics CU. On a Polaris 10 die, the size of a graphics CU is about 2.96 mm², while Carrizo has 7.21 mm² CUs. This results in a scaling factor of about 41%. Putting this all together with some individual scaling factors based on design changes (e.g. more ALUs, smaller multiplier arrays, 64KB L1 I$ - already included in the "area 1C" number), results in a Zen core size incl. L2 cache of about 4.9 mm².

Zen core size estimation based on Excavator data

Some last chance pre Hot Chips speculation about Zen

2016-08-17T23:49:00.001+02:00

TL;DR: I made a new (stitched) Zeppelin die photo. AMD's datacenter APU might use multiple Zeppelin and Greenland dies. Zen's FPU might have some interesting and unique capabilities.

There is less than one week left until AMD's Zen presentation at Hot Chips. A redditor set up this nice countdown. As usual, AMD will not talk about final SKUs and clock frequencies. But they surely will give more details about individual microarchitectural features. While this would be the first chance to verify what has been posted already ten or five months ago on this blog, it will surely reduce the amount of features to be speculated about. In other words: this is a last chance to post some yet unpublished thoughts about the microarchitecture.

For a start, you get this full Zen/Zeppelin die shot, created from multiple patches of the already known photo showing a part of a Zeppelin wafer:

Labelled Zeppelin die photo (stitched)

Due to some missing reference, it is difficult to find the correct aspect ratio of the die. Scaled the way as shown above, it looks roughly "right" and also matches Hans de Vries' corrected image.

In the past I estimated the die size to be about roughly 160 mm² based on what's in the core, and how other components might scale. When matching this die shot's DDR PHY to that of Skylake, I get roughly 200 mm² (assuming a good guess of the aspect ratio and roughly similar DDR PHY area). So I wouldn't be surprised, if Zeppelin is somewhere in this range 160-200 mm².

GMI-Links and the datacenter APU

AMD's GMI links (Global Memory Interconnect) are already known since Fudzilla mentioned them here. Soon afterwards they posted a slide, which likely shows a schematic view of AMD's planned datacenter APU. This slide was the base for creating the picture below. There I noticed the placement of the orange lines in the center of the Zeppelin and Greenland dies. As "Data Fabric" is written in the same color, the horizontal lines likely mean the same.

So what does this tell us? Well, it looks like both the CPU part and the GPU part do consist of two dies each, which are also connected via GMI. If you already heard about the distributed memory controller in Zen based processors (mentioned in combination with a directory based coherency protocol on LinkedIn), all this makes sense. Knowing the leaked die photo, as shown above, it is not wrong to assume, that the two GMI link structures (GMI-Link #0 and #1) actually comprise of four links. This would be enough to connect two Zeppelin dies with two GMI links to get access to the two distributed memory controllers (and two DDR4 channels provided by them) on the other die. Two more links provided by each die go to the two Greenland dies, which in turn might also have four GMI links each. Each of the GPU dies might just have one HBM PHY. Of course, the shown Greenland GPU might just be a monolithic die sitting on the interposer. But while we are at it, an interposer would be a perfect way to stitch multiple dies together - something, that is expected to come to a greater extent with Navi.

This would provide a lot of flexibility in configuring different processors from a small set of dies: one 8C Zeppelin die and probably just one Greenland die. One important reason for this would be costs for different designs, which are growing with each new process node.

Floating Point Unit

One of the more interesting parts of the Zen microarchitecture is the four-wide FPU. As the GCC patch suggests (by decoding type "single" or "double"), the FPU's native width is 128 bit for SIMD operations. A different patch mentioned a 3 cycle latency for cache accesses by the FPU. With a base L1D$ latency of 4 cycles indicated by the patches, this would mean a total of 7 cycles latency for FP memory accesses. This is likely the cost for going through the FX unit ("fixed point"), which contains the load store unit responsible for L1 data cache accesses. I won't go through the full details of the patches regarding all the different instructions. Let me point you to a wonderful CPU chart found at InstLatX64, which also includes Zen resp. Zeppelin, and Looncraz' instruction mapping table.

But two things stood out about the MUL instructions:

In early patches, FMA was displayed as using a combination of a FMUL pipeline (fp0/fp1) and the second FADD pipeline (fp3). This led me to the assumption, that we might see an incarnation of the bridged FMA here. Later this patch info has been changed, so that FMA instructions would run through the FMUL pipelines only
(SSE) FMUL and SSE IMUL instructions seem to occupy specific stages in the corresponding pipelines for more than one cycle. This means, throughput would be lower than 1 per pipeline. In the GCC patches this is specified as a times symbol, for example "fp0*3". However, this is not the case for FMA instructions, which could be related to a special treatment (due to the bridge), which might skip some to-be-iterated stage. We might learn a bit more about that next week.

One reason for that might be Zen's cat core heritage. To be more power efficient, cat cores use a "rectangular" or "iterative" multiplier. This means, it is reduced in depth, so that it can do 32bit FP multiplications at max throughput, but becomes slower, the wider the FP numbers are (64bit and 80bit). This is caused by the need to do multiple iterations in the multiplier array to produce all needed partial products, which become more, the wider the numbers are. This saves a lot of power and area, while still maintaining full throughput for a lot of FP/SIMD code (incl. games), which uses single precision. Also often used double precision has a lower throughput, which doesn't cause much of a performance hit with cat cores, as a lot of code even has more FADDs than FMULs. It typically costs a few percent with DP code for a nice power efficiency improvement.

Another aspect is, that in case of a bridged FMA (or something similar, see below), the FPU wouldn't need that many FPRF read ports, as during doing a FMA operation, a first unit (FMUL) reads the two multiplicands of an FMA instruction, while a second unit (FADD) reads the addend with its own ports and finishes the FMA operation by doing the addition, normalization, and rounding. I think it is interesting to note, that several cat core related patents covering a FMA unit did show a delayed read of the addend.

Since Zen will come with a lot of cores (especially the datacenter variants with up to 32 cores), AMD's (or Jim Kellers?) choice might have been to cut the per core power consumption for higher total core counts. Instead of making the whole core weaker, they seemingly decided to avoid hardware support for wide SIMD. This way, there is no need for 256b datapaths from L1D$ to the execution units, 256b wide registers, and of course 256b wide execution units. This already saves some power, as can be derived from the following chart, taken from the paper "Improving the Energy Efficiency of Big Cores" (PDF):

Similar to SIMD execution width, multipliers are still contributing a large part to a FPU's power consumption at full throughput, so AMD might have cut that further to use (updated) iterative multipliers as found in the cat cores. Maybe this is the reason, that there are two FMUL units in a single core at all, as the construction core line has shown, that AMD avoided to have that many FMUL/FMA units in a single core. There are other nice effects, like a reduction in voltage droops, which were the next big thing in Steamroller, and are still being handled by Sam Naffziger's "Voltage Droop Mitigation" in Carrizo, Bristol Ridge, and even Polaris. An AMD paper described, how researchers were able to increase the base clock frequency of an Orochi processor by 400MHz and higher simply by reducing the throughput of heavy FP ops like FMUL. A FMUL implementation with an iterative multiplier would have a similar effect already built in.

But that's not all. AMD Research lists a paper called "REEL: reducing effective execution latency of floating point operations", which (for many at least as abstract) can be found here. In this paper, researchers describe a novel FPU, which contains some additional registers located in one pipeline stage before round and normalization for later reuse. With a modified scheduler, it is possible, to significantly reduce the effective execution latency of a chain of dependent instructions. This happens by forwarding intermediate results from the internal micro register file. One important aspect of floating point performance is execution latency, as many calculations found in typical code have a low ILP (instruction level parallelism). In these cases, reducing these latencies is important. You may compare some of Zen's latencies in the CPU chart mentioned above. But on top of that, a FPU like described in REEL, would help even further. FMA hasn't explicitly been discussed in the paper, but the way, how the FPU works, there is even kind of an inherent FMA execution (kind of fusion) for dependent FMUL/FADD instructions.

Remaining things

What else could be shown at Hot Chips for Zen? Based on presentations and patents, I wouldn't be surprised about:

a package level integrated voltage regulator, or maybe even a FIVR
per thread priorities for a more efficient SMT implementation in cases of differently prioritized threads (e.g. Prime95 in background and a game in foreground)
finally a working ASF/transactional memory implementation, which becomes increasingly important for higher core counts
more efficient address handling (esp. of stack addresses for accessing a stack cache) to reduce AGU usage
dual front ends in the future for an increasingly powerful execution back end
future application of die stacking (2.5D, 3D) and PIM
interesting network on chip topologies for 16 and 32 cores like "ButterDonut"
uOp and stack caches
reduced branch misprediction penalty thanks to checkpointing or fast rollback of executed instructions
FMUL/FADD fusion (to handle them via FMA operations)

This is, what I wanted to get out before enjoying this year's Hot Chips' Zen revelations!

The next article on this blog will be about Zen clock frequency and performance projections, as more information became available. BTW, have you seen Looncraz' Zen analysis at the end of his XV article yet?

Some AMD Zen leaks (ES clocks, PCI info, Rambus DDR PHY, Roadmaps) [UPDATED]

2016-08-08T17:30:00.000+02:00

Update: Small corrections, comments on PCIe dump, AoTS leaks, link to more slides added.
TL;DR: First Zen OPN and PCIe info leaked. Additionally I do a recap of some other recent leaks.

Planet3DNow! forum user "Crashtest" posted a Zen ES OPN and PCI device info (behind the spoiler button), which somehow landed in the CPU-Z and SIV databases.

So the OPN is "1D2801A2M88E4_32/28_N". Using the already known schema, "D" stands for desktop, "28" for the base clock frequency (2.8GHz), the first "8" of "88" for the number of cores. And more clearly, "32/28" stand for turbo boost frequency (3.2GHz), and the base clock again (2.8GHz). This matches the information given for the 8C DT ES in an AnandTech forum posting recently:

"The most exciting part is core clock. The 8c/95W variant's base clock is 2.8GHz, all core boost is 3.05GHz and maximum boost is 3.2GHz."

This hasn't been confirmed in this way before (except that I heard it is true). So until now there was nothing available, that actually supported the information posted there.

The PCI information found in the SIV database looks as follows:

Bus-Numb-Fun IRQ Vendor-Dev-Sub_OEM-Rev Class (9:255) Vendor and Device Description Showing 39 of 39
[0 - 00 - 0] 1022-1450-14501022-00 Host Bridge AMD
[0 - 01 - 0] 1022-1452-00000000-00 Host Bridge AMD
[0 - 01 - 2] 1022-1453-00000000-00 PCI Bridge (0-1) x4 (x4) AMD
[0 - 02 - 0] 1022-1452-00000000-00 Host Bridge AMD
[0 - 03 - 0] 1022-1452-00000000-00 Host Bridge AMD
[0 - 04 - 0] 1022-1452-00000000-00 Host Bridge AMD
[0 - 07 - 0] 1022-1452-00000000-00 Host Bridge AMD
[0 - 07 - 1] 1022-1454-00000000-00 PCI Bridge (0- x16 (x16) AMD
[0 - 08 - 0] 1022-1452-00000000-00 Host Bridge AMD
[0 - 08 - 1] 1022-1454-00000000-00 PCI Bridge (0-9) x16 (x16) AMD
[0 - 20 - 0] 1022-790B-790B1022-59 SMBus Controller AMD
[0 - 20 - 3] 1022-790E-790E1022-51 ISA Bridge AMD
[0 - 20 - 6] 1022-7906-79061022-51 SD Host DMA Controller AMD
[0 - 24 - 0] 1022-1460-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor Link Control
[0 - 24 - 1] 1022-1461-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor Address Map Configuration
[0 - 24 - 2] 1022-1462-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor DRAM Controll
[0 - 24 - 3] 1022-1463-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor Miscellaneous Control
[0 - 24 - 4] 1022-1464-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor Link Control
[0 - 24 - 5] 1022-1465-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor Function 5 Configuration
[0 - 24 - 6] 1022-1466-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor Function 6 Configuration
[0 - 24 - 7] 1022-1467-00000000-00 Host Bridge AMD Summit Ridge (K17) Processor Function 7 Configuration
[1 - 00 - 0] 1022-43B9-11421B21-02 XHCI Controller x4 (x4) AMD Promotory USB 3.1 XHCI Host Controller
[1 - 00 - 1] 1022-43B5-10621B21-02 SATA (AHCI 1.0) x4 (x4) AMD
[1 - 00 - 2] 1022-43B0-00000000-02 PCI Bridge (1-2) x4 (x4) AMD
[2 - 00 - 0] 1022-43B4-00000000-02 PCI Bridge (2-3) x1 (x1) AMD
[2 - 01 - 0] 1022-43B4-00000000-02 PCI Bridge (2-4) x1 (x1) AMD
[2 - 02 - 0] 1022-43B4-00000000-02 PCI Bridge (2-5) x1 (x1) AMD
[2 - 03 - 0] 1022-43B4-00000000-02 PCI Bridge (2-6) x1 (x1) AMD
[2 - 04 - 0] 1022-43B4-00000000-02 PCI Bridge (2-7) x0 (x4) AMD
[3 - 00 - 0] 14E4-1687-168714E4-10 Ethernet Controller x1 (x1)Broadcom NetXtreme BCM5762 Gigabit Ethernet PCIe
[3 - 00 - 1] 14E4-1640-164014E4-10 SD Host DMA Controller x1 (x1)Broadcom
[5 - 00 - 0] 1002-68F9-010E1002-00 VGA Controller x1 (x16) AMD Cedar Pro [Radeon HD 5450/Radeon HD 6350] [GPU-0]
[5 - 00 - 1] 1002-AA68-AA681002-00 High Def Audio x1 (x16) AMD Cedar/Park HDMI Audio
[8 - 00 - 0] 1022-145A-145A1022-00 Other (0x130000) x16 (x16) AMD
[8 - 00 - 2] 1022-1456-14561022-00 Other Encryption x16 (x16) AMD
[8 - 00 - 3] 1022-145C-145C1022-00 XHCI Controller x16 (x16) AMD
[9 - 00 - 0] 1022-1455-14551022-00 Other (0x130000) x16 (x16) AMD
[9 - 00 - 2] 1022-7901-79011022-51 SATA (AHCI 1.0) x16 (x16) AMD
[9 - 00 - 3] 1022-1457-14571022-00 High Def Audio x16 (x16) AMD

Total of 7 PCI buses and 39 PCI devices in 0.040 seconds.

Source: SIV database

Update #1: As "Crashtest" explains in a later posting, the respective Summit Ridge system (w/ Myrtle mainboard) seems to have at least 36 PCIe lanes. According to him, the listed configuration seems to be a bit chaotic. BTW, "Promotory" should actually be written "Promontory".

Update #2: Thanks to the OPN, Planet3DNow! user "BoMbY" identified some Ashes of the Singularity benchmark results, which were run on two different Zen engineering samples. One had the same OPN, while the other had a slightly different one ("2D" instead of "1D"). As they've been removed from the AotS database, you can find them archived here.

Next there was a BitsAndChips article about the likely provider of the DDR4 PHY found on Zen based processors: Rambus. That speculation is based on the given details the author learned from his sources, which fit well to what Rambus recently announced regarding their 3200 Mbps DDR4 PHY available for Globalfoundries' 14LPP process. You can learn a bit more about their technology here.

Another interesting bit of info are two AMD roadmaps, which look real. But as one recently could see with this Athlon X8K die shot posted at Reddit, even the quality of die shot fakes can be rather high (also see my analysis there). In that case the creator of the die shot wrote me, that he actually just tried to check the viability of such a product regarding die size and thus processing costs. Back to the roadmaps.

For 2017 they show:

Raven Ridge APUs for the FP5 socket (4C/8T, <=12 gfx CUs, 4-35W TDP)
higher TDP models (65-95W) for the AM4 socket (also 4C/8T and an unknown number of gfx CUs)
also AM4 based Summit Ridge CPUs with 8C/16T and TDPs of 65 to 95W

Most of the boxes are marked as "AMD PRO", which usually stands for a separate series of products targeted at commercial customers. Due to special certification programmes, perhaps also additional testing, and of course the integration in ready to ship OEM hardware, these products might hit the shelves a bit later than consumer variants.

There is also a notable difference in the listed processes: "14nm SoC" for Raven Ridge, and "14nm FinFET" for Summit Ridge. I assume, that the "14nm SoC" process might refer to a different metal layer density, as recently covered in an article by Hiroshige Goto (Japanese) about APUs and AMD's FinFET efforts. Thus "FinFET" could stand for less dense lower metal layers, which would allow for somewhat higher clock speeds due to lower wire delays.

A note on the slides: I saw some unusual pixel patterns and spacings in the "14nm SoC" and "14nm FinFET" boxes (aside from different sizes). I'd expect a scaled, interpolated PowerPoint slide to show subpixel positioning of single characters. But I saw only pixel exact 1 or 2 pixel spacings with exactly similar interpolated pixels around the characters. I'll try to reproduce this in PPT.

In the end, these slides (if real) might mean, that there is no desktop Zen available in 2016 (even not as the promised sampling at the end of the year). But that is not for sure, yet. My own GCC patch based launch speculation gave a rough launch date range between 10/2016 and 05/2017.

Update #3: You can find the complete two slides and an additional one here. These don't look like being faked, although some details look a bit awkward. But this might be attributed to a smaller target audience (I suppose decision makers with more interests in dates and specs).

First AMD Summit Ridge Wafer Photo spotted

2016-05-23T00:59:00.000+02:00

If you are into die photos like me, you'll certainly be excited about this one, which has been spotted by SemiAccurate's Thomas Ryan, when many (including me) didn't notice it while glancing over this particular slide (lower left corner):

Source: ComputerBase

It apparently shows Zen based Summit Ridge dies on a wafer and has been grabbed by ComputerBase, which published it in an article covering AMD's Annual Stockholder Meeting and some interesting updates in their Investor Presentation slidedeck.

Here is a zoomed variant:

@Dresdenboy Okay, that would make sense. I attempted to make a better quality version of the wafer shot. pic.twitter.com/nLVPMZGTcY
— Thomas Ryan (@UncheckedError) 22. Mai 2016

One half of the die would look like this after some perspective correction:

There you can see one core complex with four cores and likely 8 MB of L3 cache.

The mentioned slidedeck in a somewhat older version also contained a rough comparison of the performance between Summit Ridge (8 Zen core variant) and Orochi (could be anything from Bulldozer based FX-81xx to Orochi revision C "Vishera", also known as FX-83xx with 8 cores). This has been changed to Excavator vs. Zen in the latest version of the slides, as linked above. I stitched both comparisons together:

New AMD Zen core details emerged

2016-02-29T20:31:00.002+01:00

Just one week after my last blog posting, providing a hint of the maximum number of Zen cores supported per socket, a news wave about details of Zen based server processors given in the presentation of a CERN researcher hit the web. The guy works in the institution's Platform Competence Centre (PCC) and manages integration of predominantly prototype hardware according to his CERN profile. So it can be assumed, that anything he says about server platforms might have been provided by representatives coming from the different processor and server OEMs. The 8 memory channels haven't been mentioned before in a leak or patch. And the 32 core number is not related to my posting, as the CERN talk has been held on 29th of January while I published my posting (unaware of the talk) on the 1st of February, after first mentioning the patch already in December.

Now a new series of patches provides further information about the Zen core's IP blocks. They've been posted on 16/02/16 on the Linux kernel mailing list by an AMD employee, after an earlier round of patches in January, which even mention a "ZP" target, very likely being the abbreviation for "Zeppelin". The more recent patches cover additions to AMD's implementation of a scalable Machine Check Architecture (MCA), and handling of deferred errors. This is implemented in the Linux EDAC kernel module, which is responsible for hardware error detection and correction. The most interesting patch contains following sections, with some details highlighted:

+/*
+ * Enumerating new IP types and HWID values
+ * in ScalableMCA enabled AMD processors
+ */
+#ifdef CONFIG_X86_MCE_AMD
+enum ip_types {
+ F17H_CORE = 0, /* Core errors */
+ DF,  /* Data Fabric */
+ UMC,  /* Unified Memory Controller */
+ FUSE,  /* FUSE subsystem */
+ PSP,  /* Platform Security Processor */
+ SMU,  /* System Management Unit */
+ N_IP_TYPES
+};

+enum core_mcatypes {
+ LS = 0,  /* Load Store */
+ IF,  /* Instruction Fetch */
+ L2_CACHE, /* L2 cache */
+ DE,  /* Decoder unit */
+ RES,  /* Reserved */
+ EX,  /* Execution unit */
+ FP,  /* Floating Point */
+ L3_CACHE /* L3 cache */
+};
+
+enum df_mcatypes {
+ CS = 0,  /* Coherent Slave */
+ PIE  /* Power management, Interrupts, etc */
+};
+#endif

The interconnect subsystem is called "Data Fabric", which knows so called coherent slaves according to the last enumeration list. The "FUSE subsystem" might be replaced by something else like "Parameter block", as it just means a block managing the processor's configuration.

The second list of enumerations contains a blocks found in the Zen core or close to it. I think, the highlighted "RES" element might actually stand for a real IP block, as it doesn't make much sense to have it sitting inmidst the other elements and not at the end. According to some other code in the patch, the L2 cache is seen as part of the core, while the L3 cache is not (as expected):

+ case F17H_CORE:
+  pr_emerg(HW_ERR "%s Error: ",
+    (mca_type == L3_CACHE) ? "L3 Cache" : "F17h Core");
+  decode_f17hcore_errors(xec, mca_type);
+  break;

Now let's go through some of the error string lists, beginning with those dedicated to the load/store unit:

+/* Scalable MCA error strings */
+
+static const char * const f17h_ls_mce_desc[] = {
+ "Load queue parity",
+ "Store queue parity",
+ "Miss address buffer payload parity",
+ "L1 TLB parity",
+ "",      /* reserved */
+ "DC tag error type 6",
+ "DC tag error type 1",

This is the first of many lists containing error strings, in this case for the load/store unit. Similar to the enumeration above, there is a reserved element, possibly hiding something, as this is a public mailing list. The strings I left out don't contain any surprises compared to the Bulldozer family. But overall I get the impression, that AMD significantly improved the RAS capabilities, which are very important for server processors. The following block contains error strings related to the instruction fetch block ("if"):

+static const char * const f17h_if_mce_desc[] = {
+ "microtag probe port parity error",
+ "IC microtag or full tag multi-hit error",
+ "IC full tag parity",
+ "IC data array parity",
+ "Decoupling queue phys addr parity error",
+ "L0 ITLB parity error",
+ "L1 ITLB parity error",
+ "L2 ITLB parity error",
+ "BPQ snoop parity on Thread 0",
+ "BPQ snoop parity on Thread 1",
+ "L1 BTB multi-match error",
+ "L2 BTB multi-match error",
+};

There is a new L0 ITLB, which is the only level 0 thing being mentioned so far, while VR World mentioned level 0 caches (besides other somewhat strange rumoured facts like no L3 cache in the APU variant - while this has been shown on the leaked Fudzilla slide). The only thing resembling such a L0 cache is a uOp cache, which has clearly been named in the new patch in a section related to the decode/dispatch block (indicated by "de"):

+static const char * const f17h_de_mce_desc[] = {
+ "uop cache tag parity error",
+ "uop cache data parity error",
+ "Insn buffer parity error",
+ "Insn dispatch queue parity error",
+ "Fetch address FIFO parity",
+ "Patch RAM data parity",
+ "Patch RAM sequencer parity",
+ "uop buffer parity"
+};

There are strings for both a "uop cache" and a "uop buffer". So far I knew about this uop buffer patent filed by AMD in 2012, which describes different related techniques aimed at saving power, e.g. when executing loops or to keep the buffer physically small by leaving immediate and displacement data of decoded instructions in an instruction byte buffer ("Insn buffer") sitting between instruction fetch and decode. The "uop cache" clearly seems to be a separate unit. Even without knowing how many uops per cycle can be provided by that cache, it will help to save power and remove an occaisional fetch/decode bottleneck when running two threads. The next interesting block is about the execution units:

+static const char * const f17h_ex_mce_desc[] = {
+ "Watchdog timeout error",
+ "Phy register file parity",
+ "Flag register file parity",
+ "Immediate displacement register file parity",
+ "Address generator payload parity",
+ "EX payload parity",
+ "Checkpoint queue parity",
+ "Retire dispatch queue parity",
+};

Here is a first confirmation of a checkpoint mechanism. This has been described in several patents and might also be an enabler for hardware transactional memory, which has been proposed in the form of ASF back in 2009. Another use case is the quick recovery from branch mispredictions, where program flow can be redirected to a checkpoint created right before evaluating a difficult to predict branch condition.

Let me continue with some random picks:

+ "L3 victim queue parity",
...
+ "Atomic request parity",
+ "ECC error on probe filter access",

...

+ "Error on GMI link",

There is a confirmation of the "GMI link" mentioned on an already leaked slide, which mentioned a bandwidth of 25 GB/s per link. The term "Data Fabric" also has been used on that slide.

When reporting about the 32 core support, I wrote that some patents used the same wording. It's actually "core processing complex" (CPC) and can contain multiple compute units (like Zen cores). So they are not the same. AMD patent filings using the term are US20150277521, US20150120978, and US20140331069.

Last but not least I have updated the Zen core diagram based on these new informations and some very likely related patents and papers:

Notable changes are:

uOp Cache has been added based on the new patch
FMUL/FADD for FMAC pairing removed, based on some corrections of the znver1 pipeline description.
4x parallel Page Table Walkers added, based on US20150121046
128b FP datapaths (also to/from the L1 D$) based on "direct" decode for 128b wide SIMD and "double" decode for 256b AVX/AVX2 instructions
32kB L1 I$ has been mentioned in some patents. With enough ways, a fast L2$ and a uOp cache this should be enough, I think.
issue port descriptions and more data paths added
2R1W and 4 cycle load-to-use-latency added for the L1 D$ based on info found on a LinkedIn profile and the given cylce differences in the znver1 pipeline description
Stack Cache speculatively added based on patents and some interesting papers. This doesn't help so much with performance, but a lot with power efficiency.

It's still interesting, what the first mentioning of fp3 port for FMAC operations was good for. I thought, it was a typo, but more of the kind "fp3" instead of "fp2" in one case. It could still be related to register file port usage and/or bridged FMA, but probably not that useful for telling the compiler. Due to the correction patch I'm still looking further into the FPU topic, as promised earlier. I'll cover that in a followup posting.

Finally there is a hint at good hardware prefetcher performance (or bad interferences?), as AMD recommends to switch off default software prefetching for the znver1 target in GCC.

BTW have you ever heard of a processor core having 2 front ends and one shared back end?

Update: There is an update of the bespoken patches, posted on the same day as this blog entry. You can see it here. So far I didn't see any significant additions other than cleanups and fixes.

More AMD Bristol Ridge SKUs leaked

2016-02-08T01:17:00.002+01:00

Kristian Gocnik (@I_biT_MySeLf) tipped me off about new mobile Bristol Ridge SKUs, which appeared on usb.org as you can see [UPDATE: the entries have been removed now - visiting the pages may delete your only cached copy in the browser] here and here. That's the same site, where the first Bristol Ridge SKU (FX-9830P) appeared on. I put this together with information found in the leaked slide by Benchlife.info, which you can find in my blog post about a WEI result of an A10-9600P.

Using the mobile Carrizo SKUs, the leaked A10-9600P clock, and some sorting, it was easy to map the SKUs to the leaked slide's data. Kristian Gocnik tried it independently and we got the same mapping, except for a consumer A8-9500P he speculatively derived from the pro model, but which is missing on usb.org. So the resulting table likely represents what AMD is going to release as mobile Bristol Ridge chips for the FP4 socket later this year.

The model numbers likely simply jumped by one thousand from Carrizo's and an additional thirty points for the 35W variants. Carrizo's wide TDP ranges got split into 15W and 35W TDPs. This might help to avoid the confusion about 15W and 35W Carrizos laptops. The CPU base clocks jumped significantly, while CPU Turbo and (maximum) GPU clocks kind of matured with the fab process.

A reason for the jump has been given by AMD at ISSCC 2016, as EE Times reported:

"For its part, AMD engineers showed smart ways of squeezing as much as 15% more performance out of its Carrizo PC processor, simply by applying more aggressive power management to the 28nm design. The Bristol Ridge design was a study in using power management to overcome performance limits tied to heat, voltage and current."

Months after the first leaked WEI score, first true Bristol Ridge benchmarks will show, how this improvement translates into real world performance. Hopefully they get tested with dual channel memory, even if AMD or OEMs only provide single channel equipped/designed devices, as for the recent AnandTech Carrizo review.

BTW, there are lots of fresh Stoney Ridge Geekbench results in the Primate Labs' database.

Update: Of course, these are not OPNs, but SKUs. Added a warning as the linked usb.org entries are gone.

AMD Zeppelin CPU codename confirmed by patch and perhaps 32 cores per socket for Zen based MPUs, too

2016-02-01T01:54:00.001+01:00

The Zeppelin codename, first mentioned on a leaked slide shown by Fudzilla, has been identified as a "family 17h model 00h" CPU by a patch on LKML.org. The interesting parts of the patch are:

AMD Zeppelin (Family 17h, Model 00h) introduces an instructionsretired performance counter which indicated byCPUID.8000_0008H:EBX[1]. And dedicated Instructions Retired register(MSR 0xC000_000E9) increments on once for every instruction retired.

There might even be a meaning behind the similarity of parts of the "Zen" and "Zeppelin" codenames.

An older patch on the same mailing list also gives a little more info about Zen:

On AMD Fam17h systems, the last level cache is not resident in Northbridge. Therefore, we cannot assign cpu_llc_id to same value as Node ID (as we have been doing currently)

We should rather look at the ApicID bits of the core to provide us the last level cache ID info. Doing that here.

The most interesting part describes the way, how the last level cache (LLC) ID is being calculated for Zen based MPUs:

+ core_complex_id = (apicid & ((1 << c->x86_coreid_bits) - 1)) >> 3;
+ per_cpu(cpu_llc_id, cpu) = (socket_id << 3) | core_complex_id;

"Core complex" should be similar to "compute unit" and has been used in some AMD patents already. The expression marked in red means a shift right by 3, which equals a division by 8. So with two logical cores per physical core due to SMT, a core complex should contain four Zen cores and a shared LLC.

The next line shows the socket ID being shifted left by 3, leaving 3 bits for the core complex ID, which suggests a maximum number of eight core complexes per socket, or 32 physical cores. This number should first be seen as a placeholder, but we've already seen rumours mentioning that many cores.

AMD A10-9600P: A Bristol Ridge laptop left some traces

2016-01-05T01:54:00.000+01:00

One of my search strings got a hit today. Based on the already known model number 101 I found this Windows Experience Index CPU score of an AMD "A10-9600P" APU (Bristol Ridge). It's a quad core (2 modules) with 6 CUs and a reported base clock of 2.3GHz. A CPU score of 7.4 isn't that low, but doesn't tell us that much more, except that the system was able to finish the benchmark test. The result currently sits on page 7 of the filtered list linked by the table screenshot. You can also click the other images for the respective sources.

According to another page the listed model is a HP system, very likely a laptop.

As can be seen in the leaked slide below, the found CPU might be the a quad core with a cTDP of 12-15W.

Source: https://benchlife.info/amd-will-rename-excavator-to-bristol-ridge-12082015/

After the first listing of a Bristol Ridge SKU ("FX-9830P", thx @Onkel_Dithmeyer) and some BR ES traveling through the world, we now have a first Bristol Ridge APUs reporting the final product OPN instead of a typical ES string. Perhaps this is a system to be shown at CES?

AMD K12 looks to be at least a 4-wide design with SMT

2015-11-23T01:17:00.001+01:00

An article about Zen and K12 by Yusuke Ohara gives a good overview of AMD's processor plans and a new and very interesting bit of information about AMD's high performance ARM design. As machine translators still struggle to provide clearly understandable translations of Japanese texts, multiple translators were tried and did not help. Therefore I asked the author to make sure that I got it right. He confirmed, that according to ARM officials, who are aware of the works of their architectural licensees, AMD is using at least a 4-wide design for their K12 core.

Jim Keller already said, that the smaller decoders for ARM instructions would leave room to add some performance improving features compared to x86. He also mentioned "a bigger engine" than in Zen. Looking at the microarchitecture diagram, one might ask, how AMD would utilize all these execution hardware, especially if there would be even more units and maybe even more than four instructions fetched and decoded per cycle. And given its target market, which is servers and datacenters, this might include one important feature: SMT. Some already speculated about that based on expectations, but there is AMD patent application US20150121046, which mentions SMT and its application in an AArch64 design very clearly and with many implementation details. This can be seen as an indicator of work being done for real products.

If K12 is a 4-wide or even wider SMT design similar to Zen (which is "only" 4-wide), this would put some substance behind Keller's announcements, which suggested many similarities between both designs. This is supported by the fact, that one of the inventors listed in the patents (Marius Evers) seemingly worked on both cores. Many other patents by him also cover both ARM and x86. He was also involved in one patent filed in 2007, which described a way to add SMT to the front end of a Bulldozer like module. SMT is not only useful to utilize execution units, if there are many of them. It also helps by keeping them busy, if there are multi-cycle FP instructions, branch mispredictions, or cache misses.

Of course, there are more differences between those two architectures than the ISAs alone, but many typical CPU components are either ISA-agnostic and reusable or could be adapted with much less effort than creating them from scratch. However, if it was done this way, such a strategy would not only have permitted AMD to make an efficient use of the limited R&D resources available, but it would have created a chance to produce a powerful ARM core for servers for an acceptable overhead. This is like applying SMT to R&D.

AMD Hierofalcon/Seattle shown at ARM TechCon

2015-11-18T18:38:00.005+01:00

AMD presented some boards at ARM TechCon and thanks to ARMdevices.net there are two videos covering that stuff.

One video shows Red Hat's Jon Masters' explanation of AMD's Huskyboard, where (even if only printed on cardboard) you can have a nice closeup view of the chip (video screenshot):

The second video shows real hardware at work, including SoftIron Overdrive 3000, and the Huskyboard in 3D:

AMD Zen and K12 (ARM) tapeouts confirmed by LinkedIn profile

2015-10-15T19:35:00.000+02:00

According to a LinkedIn profile, both Zen and K12 should have been taped out already. So this is a fact, as it isn't speculated based on sparse information. Interestingly the same guy (you have to find him yourself, if you need to), who only talks about CPU cores, mentions his working on 16nm and 14nm FinFet designs. So there will be one design made by TSMC and one by Globalfoundries. K12 by the first and Zen by the latter I suppose. And here is the snippet:

AMD's ARM-based "Hierofalcon" SoC sighted

2015-10-15T00:23:00.001+02:00

On the same OSADL site, which once provided some first signs of life of a 2 GHz Jaguar based APU, there is now an engineering sample of one of AMD's ARM based embedded processors, called "Hierofalcon". The processor can be found in rack #a slot #3. According to the tables and logs, it also runs at 2 GHz and has 8 cores. If you haven't heard of that processor, just check these two slides. I actually included the first only because of the bird. ;)

I prepared some charts out of the numbers given there. If you check the site, you'll find no directly linked latency chart. But their 1337 page allows to compare rack #a slot #3 to some other rack/slot and returns the missed latency chart of the "Hierofalcon", which looks like this (updated daily):

Latency plot of AMD "Hierofalcon" ES

The only available performance numbers are some daily updated Unixbench results. So I took them, combined rack names with CPU strings, sorted them, choose some CPUs for comparison, and normalized the results to the CPU in question.

The first chart already shows, that on a per clock basis AMD's other CPUs already lag behind in simple integer code of the old kDhrystone benchmark. The floating point based Whetstone benchmark draws a somewhat different picture with more equally distributed per clock performances except that of the old K10 based Phenom II. The next three benchmarks Execl (not Excel!), kCopy, and kPipe test OS functions like spawning processes, doing file copying or using the pipe. The Index is a combined result.

In the next chart we can see the raw performance of all cores, only normalized again to Hierofalcon.

Even then the 8 cores of the ARM based processor have a good standing in the first two benchmarks, while in the OS benchmark, it roughly keeps up with Kaveri and Bulldozer, both running at much higher clock speeds.

The ARM based CPUs are meant to put many lower power cores together. To have a first impression of that effect, I used the given TDP numbers as the only metric available for all CPUs. Here are the power efficiency numbers:

I think in this case, the Hierofalcon bars are really easy to spot, even though I used the max listed TDP of 30W. Only the already power optimized Sandy Bridge variants and the 9W Kabini are able to keep up in some of the tests. And of course, real power measurements would shift the numbers a bit.

Of course, many (including me) would like to see more interesting benchmarks, but these are the first numbers we've got and they aren't bad at all.

How Many Days Until Zen?

2015-10-06T23:43:00.001+02:00

Last weeks headlines went back and forth about which foundry will be selected by AMD to produce wafers containing Zen based processors. After Jim Kellers departure, Mark Papermaster assured that Zen is on track for samples in 2016 and a full year of revenue in 2017. Before him pointing that out there was a comment on a LinkedIn profile, putting AMD's next gen x86 desktop processor straight into 2017.

So there are several data points of more or less official type. Let me add another one, which is based on the GCC patch publication pattern. This assumes, that there are work processes behind the patches and of course GCC related deadlines for inclusion of particular changes.

This chart shows the time delta in days between the publication of patches and the launch of a particular CPU containing a new core. For some launch dates only a month was given, so I took the last day of that month for the calculation.

The Zen bars show the timeline in months starting with publication of the specific patch. With this at hand, anyone can draw their own conclusions. The scenario of first Zen based server or desktop CPUs hitting the markets in 4Q16 doesn't seem unlikely.

AMD's Zen core (family 17h) to have ten pipelines per core

2015-10-03T02:43:00.001+02:00

With writing about Zen I moved here since blog.de will close its service at the end of this year. That's it. Let's move on to the interesting stuff.

Whoever has chosen the name "Zen" for AMD's next generation x86 core, might have had the number four in mind, which plays an important role in this philosophy (e.g. Four Dharmadhātu). At least this is what a recent patch revealed about this long awaited microarchitecture.

Andreas Stiller speculates that the term Zen as in "SuZen" might be related to Zen team leader Suzanne Plummer and possibly Lisa Su as well. An article on myStatesman, which appeared shortly after Jim Keller's leave, lists some more team member names if you magnify the photo:

Mike Clark, front left, and team leader Suzanne Plummer, and in background from left are Teja Singh, Lyndal Curry, Mike Tuuk, Farhan Rahman, Andy Halliday, Matt Crum, Mike Bates and Joshua Bell.

Mike Clark is a true AMD veteran, being there since 1993. Some have developed the Cat cores, like Teja Singh and Joshua Bell, who presented the Jaguar microarchitecture at ISSCC 2013.

As heard earlier this year, Zen will use SMT and an improved cache subsystem while being designed from scratch with new ideas combined with reusing existing components (to reduce the effort). This might even include already existing and somewhat developed ideas not realized in previous designs. A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.

And the new patch also suggests nicely low int/fp mul, fp add, int/fp div and fp square root latencies. Some of these lower latencies (div/sqrt) were introduced with Excavator, as an Aida64 instruction latency dump provided by Anandtech forum user monstercameron revealed. Due to an Aida problem with measured and reported clock frequencies (although it was fixed at 1.4GHz), you have to multiply the measured times by 1.4 to get the real number of cycles. Ok, back to Zen.

Here are some quotes of the patch file:

+;; Decoders unit has 4 decoders and all of them can decode fast path
+;; and vector type instructions.

+;; Integer unit 4 ALU pipes.

+;; 2 AGU pipes.

+;; Floating point unit 4 FP pipes.

+  32, /* size of l1 cache.  */
+  512, /* size of l2 cache.  */

Excerpt:

4 wide decoders
4 integer ALUs
2 AGUs (for 2R 1W L1 cache according to a LinkedIn profile)
4 FP pipelines

That makes z ten pipelines with a general four wide design.

There is a lot more information, which I will collect over the next days. Some stuff is copy pasted from Excavator (bdver4) or Jaguar (btver2) and modified then. But careful comparing did show some clear differences, while at other places it's not clear, if there is new information or not (e.g. div latencies). But as btver2 has 2048 kB L2 and the rest of the block is more similar to bdver4 or btver2 than btver1 (Bobcat), which has 512 kb L2, it looks like no btver1 files were used as a source. So I assume, that this is a new entry of an L2 cache size, indicating fast L2 caches per core. The L1 data cache still has the same size as that of Jaguar or Excavator. Some patents mention an 8-way 32kb L1 D$.

Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.

These latencies also clearly suggest, that this is no high clock frequency design. But at 14nm (or 16nm from TSMC as some rumours suggest) clocks of 3.5 to 4 GHz should be reachable without stretching the thermal limits too much.

This should be enough for now. Here is a schematic, which should come close to what Zen might really look like:

AMD Zen Core Microarchitecture (with some speculated parts)