Monday, February 29, 2016

New AMD Zen core details emerged

Just one week after my last blog posting, providing a hint of the maximum number of Zen cores supported per socket, a news wave about details of Zen based server processors given in the presentation of a CERN researcher hit the web. The guy works in the institution's Platform Competence Centre (PCC) and manages integration of predominantly prototype hardware according to his CERN profile. So it can be assumed, that anything he says about server platforms might have been provided by representatives coming from the different processor and server OEMs. The 8 memory channels haven't been mentioned before in a leak or patch. And the 32 core number is not related to my posting, as the CERN talk has been held on 29th of January while I published my posting (unaware of the talk) on the 1st of February, after first mentioning the patch already in December.

Now a new series of patches provides further information about the Zen core's IP blocks. They've been posted on 16/02/16 on the Linux kernel mailing list by an AMD employee, after an earlier round of patches in January, which even mention a "ZP" target, very likely being the abbreviation for "Zeppelin". The more recent patches cover additions to AMD's implementation of a scalable Machine Check Architecture (MCA), and handling of deferred errors. This is implemented in the Linux EDAC kernel module, which is responsible for hardware error detection and correction. The most interesting patch contains following sections, with some details highlighted:

+/*
+ * Enumerating new IP types and HWID values
+ * in ScalableMCA enabled AMD processors
+ */
+#ifdef CONFIG_X86_MCE_AMD
+enum ip_types {
+ F17H_CORE = 0, /* Core errors */
+ DF,  /* Data Fabric */
+ UMC,  /* Unified Memory Controller */
+ FUSE,  /* FUSE subsystem */
+ PSP,  /* Platform Security Processor */
+ SMU,  /* System Management Unit */
+ N_IP_TYPES
+};

+enum core_mcatypes {
+ LS = 0,  /* Load Store */
+ IF,  /* Instruction Fetch */
+ L2_CACHE, /* L2 cache */
+ DE,  /* Decoder unit */
+ RES,  /* Reserved */
+ EX,  /* Execution unit */
+ FP,  /* Floating Point */
+ L3_CACHE /* L3 cache */
+};
+
+enum df_mcatypes {
+ CS = 0,  /* Coherent Slave */
+ PIE  /* Power management, Interrupts, etc */
+};
+#endif

The interconnect subsystem is called "Data Fabric", which knows so called coherent slaves according to the last enumeration list. The "FUSE subsystem" might be replaced by something else like "Parameter block", as it just means a block managing the processor's configuration.

The second list of enumerations contains a blocks found in the Zen core or close to it. I think, the highlighted "RES" element might actually stand for a real IP block, as it doesn't make much sense to have it sitting inmidst the other elements and not at the end. According to some other code in the patch, the L2 cache is seen as part of the core, while the L3 cache is not (as expected):

+ case F17H_CORE:
+  pr_emerg(HW_ERR "%s Error: ",
+    (mca_type == L3_CACHE) ? "L3 Cache" : "F17h Core");
+  decode_f17hcore_errors(xec, mca_type);
+  break;

Now let's go through some of the error string lists, beginning with those dedicated to the load/store unit:

+/* Scalable MCA error strings */
+
+static const char * const f17h_ls_mce_desc[] = {
+ "Load queue parity",
+ "Store queue parity",
+ "Miss address buffer payload parity",
+ "L1 TLB parity",
+ "",      /* reserved */
+ "DC tag error type 6",
+ "DC tag error type 1",

This is the first of many lists containing error strings, in this case for the load/store unit. Similar to the enumeration above, there is a reserved element, possibly hiding something, as this is a public mailing list. The strings I left out don't contain any surprises compared to the Bulldozer family. But overall I get the impression, that AMD significantly improved the RAS capabilities, which are very important for server processors. The following block contains error strings related to the instruction fetch block ("if"):

+static const char * const f17h_if_mce_desc[] = {
+ "microtag probe port parity error",
+ "IC microtag or full tag multi-hit error",
+ "IC full tag parity",
+ "IC data array parity",
+ "Decoupling queue phys addr parity error",
+ "L0 ITLB parity error",
+ "L1 ITLB parity error",
+ "L2 ITLB parity error",
+ "BPQ snoop parity on Thread 0",
+ "BPQ snoop parity on Thread 1",
+ "L1 BTB multi-match error",
+ "L2 BTB multi-match error",
+};

There is a new L0 ITLB, which is the only level 0 thing being mentioned so far, while VR World mentioned level 0 caches (besides other somewhat strange rumoured facts like no L3 cache in the APU variant - while this has been shown on the leaked Fudzilla slide). The only thing resembling such a L0 cache is a uOp cache, which has clearly been named in the new patch in a section related to the decode/dispatch block (indicated by "de"):

+static const char * const f17h_de_mce_desc[] = {
+ "uop cache tag parity error",
+ "uop cache data parity error",
+ "Insn buffer parity error",
+ "Insn dispatch queue parity error",
+ "Fetch address FIFO parity",
+ "Patch RAM data parity",
+ "Patch RAM sequencer parity",
+ "uop buffer parity"
+};

There are strings for both a "uop cache" and a "uop buffer". So far I knew about this uop buffer patent filed by AMD in 2012, which describes different related techniques aimed at saving power, e.g. when executing loops or to keep the buffer physically small by leaving immediate and displacement data of decoded instructions in an instruction byte buffer ("Insn buffer") sitting between instruction fetch and decode. The "uop cache" clearly seems to be a separate unit. Even without knowing how many uops per cycle can be provided by that cache, it will help to save power and remove an occaisional fetch/decode bottleneck when running two threads. The next interesting block is about the execution units:

+static const char * const f17h_ex_mce_desc[] = {
+ "Watchdog timeout error",
+ "Phy register file parity",
+ "Flag register file parity",
+ "Immediate displacement register file parity",
+ "Address generator payload parity",
+ "EX payload parity",
+ "Checkpoint queue parity",
+ "Retire dispatch queue parity",
+};

Here is a first confirmation of a checkpoint mechanism. This has been described in several patents and might also be an enabler for hardware transactional memory, which has been proposed in the form of ASF back in 2009. Another use case is the quick recovery from branch mispredictions, where program flow can be redirected to a checkpoint created right before evaluating a difficult to predict branch condition.

Let me continue with some random picks:

+ "L3 victim queue parity",
...
+ "Atomic request parity",
+ "ECC error on probe filter access",
...
+ "Error on GMI link",

There is a confirmation of the "GMI link" mentioned on an already leaked slide, which mentioned a bandwidth of 25 GB/s per link. The term "Data Fabric" also has been used on that slide.

When reporting about the 32 core support, I wrote that some patents used the same wording. It's actually "core processing complex" (CPC) and can contain multiple compute units (like Zen cores). So they are not the same. AMD patent filings using the term are US20150277521, US20150120978, and US20140331069.

Last but not least I have updated the Zen core diagram based on these new informations and some very likely related patents and papers:



Notable changes are:
  • uOp Cache has been added based on the new patch
  • FMUL/FADD for FMAC pairing removed, based on some corrections of the znver1 pipeline description.
  • 4x parallel Page Table Walkers added, based on US20150121046
  • 128b FP datapaths (also to/from the L1 D$) based on "direct" decode for 128b wide SIMD and "double" decode for 256b AVX/AVX2 instructions
  •  32kB L1 I$ has been mentioned in some patents. With enough ways, a fast L2$ and a uOp cache this should be enough, I think.
  • issue port descriptions and more data paths added
  • 2R1W and 4 cycle load-to-use-latency added for the L1 D$ based on info found on a LinkedIn profile and the given cylce differences in the znver1 pipeline description
  • Stack Cache speculatively added based on patents and some interesting papers. This doesn't help so much with performance, but a lot with power efficiency.
It's still interesting, what the first mentioning of fp3 port for FMAC operations was good for. I thought, it was a typo, but more of the kind "fp3" instead of "fp2" in one case. It could still be related to register file port usage and/or bridged FMA, but probably not that useful for telling the compiler. Due to the correction patch I'm still looking further into the FPU topic, as promised earlier. I'll cover that in a followup posting.

Finally there is a hint at good hardware prefetcher performance (or bad interferences?), as AMD recommends to switch off default software prefetching for the znver1 target in GCC.

BTW have you ever heard of a processor core having 2 front ends and one shared back end?

Update: There is an update of the bespoken patches, posted on the same day as this blog entry. You can see it here. So far I didn't see any significant additions other than cleanups and fixes.

Monday, February 8, 2016

More AMD Bristol Ridge SKUs leaked

Kristian Gocnik (@I_biT_MySeLf) tipped me off about new mobile Bristol Ridge SKUs, which appeared on usb.org as you can see [UPDATE: the entries have been removed now - visiting the pages may delete your only cached copy in the browser] here and here. That's the same site, where the first Bristol Ridge SKU (FX-9830P) appeared on. I put this together with information found in the leaked slide by Benchlife.info, which you can find in my blog post about a WEI result of an A10-9600P.

Table with leaked mobile Bristol Ridge OPNs

Using the mobile Carrizo SKUs, the leaked A10-9600P clock, and some sorting, it was easy to map the SKUs to the leaked slide's data. Kristian Gocnik tried it independently and we got the same mapping, except for a consumer A8-9500P he speculatively derived from the pro model, but which is missing on usb.org. So the resulting table likely represents what AMD is going to release as mobile Bristol Ridge chips for the FP4 socket later this year.

The model numbers likely simply jumped by one thousand from Carrizo's and an additional thirty points for the 35W variants. Carrizo's wide TDP ranges got split into 15W and 35W TDPs. This might help to avoid the confusion about 15W and 35W Carrizos laptops. The CPU base clocks jumped significantly, while CPU Turbo and (maximum) GPU clocks kind of matured with the fab process.

A reason for the jump has been given by AMD at ISSCC 2016, as EE Times reported:
"For its part, AMD engineers showed smart ways of squeezing as much as 15% more performance out of its Carrizo PC processor, simply by applying more aggressive power management to the 28nm design. The Bristol Ridge design was a study in using power management to overcome performance limits tied to heat, voltage and current."
Months after the first leaked WEI score, first true Bristol Ridge benchmarks will show, how this improvement translates into real world performance. Hopefully they get tested with dual channel memory, even if AMD or OEMs only provide single channel equipped/designed devices, as for the recent AnandTech Carrizo review.

BTW, there are lots of fresh Stoney Ridge Geekbench results in the Primate Labs' database.

Update: Of course, these are not OPNs, but SKUs. Added a warning as the linked usb.org entries are gone.

Monday, February 1, 2016

AMD Zeppelin CPU codename confirmed by patch and perhaps 32 cores per socket for Zen based MPUs, too

The Zeppelin codename, first mentioned on a leaked slide shown by Fudzilla, has been identified as a "family 17h model 00h" CPU by a patch on LKML.org. The interesting parts of the patch are:
AMD Zeppelin (Family 17h, Model 00h) introduces an instructionsretired performance counter which indicated byCPUID.8000_0008H:EBX[1]. And dedicated Instructions Retired register(MSR 0xC000_000E9) increments on once for every instruction retired.

There might even be a meaning behind the similarity of parts of the "Zen" and "Zeppelin" codenames.

An older patch on the same mailing list also gives a little more info about Zen:

On AMD Fam17h systems, the last level cache is not resident in Northbridge. Therefore, we cannot assign cpu_llc_id to same value as Node ID (as we have been doing currently)
We should rather look at the ApicID bits of the core to provide us the last level cache ID info. Doing that here.
The most interesting part describes the way, how the last level cache (LLC) ID is being calculated for Zen based MPUs:

+ core_complex_id = (apicid & ((1 << c->x86_coreid_bits) - 1)) >> 3;
+ per_cpu(cpu_llc_id, cpu) = (socket_id << 3) | core_complex_id;

"Core complex" should be similar to "compute unit" and has been used in some AMD patents already. The expression marked in red means a shift right by 3, which equals a division by 8. So with two logical cores per physical core due to SMT, a core complex should contain four Zen cores and a shared LLC.


The next line shows the socket ID being shifted left by 3, leaving 3 bits for the core complex ID, which suggests a maximum number of eight core complexes per socket, or 32 physical cores. This number should first be seen as a placeholder, but we've already seen rumours mentioning that many cores.