Monday, February 29, 2016

New AMD Zen core details emerged

Just one week after my last blog posting, providing a hint of the maximum number of Zen cores supported per socket, a news wave about details of Zen based server processors given in the presentation of a CERN researcher hit the web. The guy works in the institution's Platform Competence Centre (PCC) and manages integration of predominantly prototype hardware according to his CERN profile. So it can be assumed, that anything he says about server platforms might have been provided by representatives coming from the different processor and server OEMs. The 8 memory channels haven't been mentioned before in a leak or patch. And the 32 core number is not related to my posting, as the CERN talk has been held on 29th of January while I published my posting (unaware of the talk) on the 1st of February, after first mentioning the patch already in December.

Now a new series of patches provides further information about the Zen core's IP blocks. They've been posted on 16/02/16 on the Linux kernel mailing list by an AMD employee, after an earlier round of patches in January, which even mention a "ZP" target, very likely being the abbreviation for "Zeppelin". The more recent patches cover additions to AMD's implementation of a scalable Machine Check Architecture (MCA), and handling of deferred errors. This is implemented in the Linux EDAC kernel module, which is responsible for hardware error detection and correction. The most interesting patch contains following sections, with some details highlighted:

+/*
+ * Enumerating new IP types and HWID values
+ * in ScalableMCA enabled AMD processors
+ */
+#ifdef CONFIG_X86_MCE_AMD
+enum ip_types {
+ F17H_CORE = 0, /* Core errors */
+ DF,  /* Data Fabric */
+ UMC,  /* Unified Memory Controller */
+ FUSE,  /* FUSE subsystem */
+ PSP,  /* Platform Security Processor */
+ SMU,  /* System Management Unit */
+ N_IP_TYPES
+};

+enum core_mcatypes {
+ LS = 0,  /* Load Store */
+ IF,  /* Instruction Fetch */
+ L2_CACHE, /* L2 cache */
+ DE,  /* Decoder unit */
+ RES,  /* Reserved */
+ EX,  /* Execution unit */
+ FP,  /* Floating Point */
+ L3_CACHE /* L3 cache */
+};
+
+enum df_mcatypes {
+ CS = 0,  /* Coherent Slave */
+ PIE  /* Power management, Interrupts, etc */
+};
+#endif

The interconnect subsystem is called "Data Fabric", which knows so called coherent slaves according to the last enumeration list. The "FUSE subsystem" might be replaced by something else like "Parameter block", as it just means a block managing the processor's configuration.

The second list of enumerations contains a blocks found in the Zen core or close to it. I think, the highlighted "RES" element might actually stand for a real IP block, as it doesn't make much sense to have it sitting inmidst the other elements and not at the end. According to some other code in the patch, the L2 cache is seen as part of the core, while the L3 cache is not (as expected):

+ case F17H_CORE:
+  pr_emerg(HW_ERR "%s Error: ",
+    (mca_type == L3_CACHE) ? "L3 Cache" : "F17h Core");
+  decode_f17hcore_errors(xec, mca_type);
+  break;

Now let's go through some of the error string lists, beginning with those dedicated to the load/store unit:

+/* Scalable MCA error strings */
+
+static const char * const f17h_ls_mce_desc[] = {
+ "Load queue parity",
+ "Store queue parity",
+ "Miss address buffer payload parity",
+ "L1 TLB parity",
+ "",      /* reserved */
+ "DC tag error type 6",
+ "DC tag error type 1",

This is the first of many lists containing error strings, in this case for the load/store unit. Similar to the enumeration above, there is a reserved element, possibly hiding something, as this is a public mailing list. The strings I left out don't contain any surprises compared to the Bulldozer family. But overall I get the impression, that AMD significantly improved the RAS capabilities, which are very important for server processors. The following block contains error strings related to the instruction fetch block ("if"):

+static const char * const f17h_if_mce_desc[] = {
+ "microtag probe port parity error",
+ "IC microtag or full tag multi-hit error",
+ "IC full tag parity",
+ "IC data array parity",
+ "Decoupling queue phys addr parity error",
+ "L0 ITLB parity error",
+ "L1 ITLB parity error",
+ "L2 ITLB parity error",
+ "BPQ snoop parity on Thread 0",
+ "BPQ snoop parity on Thread 1",
+ "L1 BTB multi-match error",
+ "L2 BTB multi-match error",
+};

There is a new L0 ITLB, which is the only level 0 thing being mentioned so far, while VR World mentioned level 0 caches (besides other somewhat strange rumoured facts like no L3 cache in the APU variant - while this has been shown on the leaked Fudzilla slide). The only thing resembling such a L0 cache is a uOp cache, which has clearly been named in the new patch in a section related to the decode/dispatch block (indicated by "de"):

+static const char * const f17h_de_mce_desc[] = {
+ "uop cache tag parity error",
+ "uop cache data parity error",
+ "Insn buffer parity error",
+ "Insn dispatch queue parity error",
+ "Fetch address FIFO parity",
+ "Patch RAM data parity",
+ "Patch RAM sequencer parity",
+ "uop buffer parity"
+};

There are strings for both a "uop cache" and a "uop buffer". So far I knew about this uop buffer patent filed by AMD in 2012, which describes different related techniques aimed at saving power, e.g. when executing loops or to keep the buffer physically small by leaving immediate and displacement data of decoded instructions in an instruction byte buffer ("Insn buffer") sitting between instruction fetch and decode. The "uop cache" clearly seems to be a separate unit. Even without knowing how many uops per cycle can be provided by that cache, it will help to save power and remove an occaisional fetch/decode bottleneck when running two threads. The next interesting block is about the execution units:

+static const char * const f17h_ex_mce_desc[] = {
+ "Watchdog timeout error",
+ "Phy register file parity",
+ "Flag register file parity",
+ "Immediate displacement register file parity",
+ "Address generator payload parity",
+ "EX payload parity",
+ "Checkpoint queue parity",
+ "Retire dispatch queue parity",
+};

Here is a first confirmation of a checkpoint mechanism. This has been described in several patents and might also be an enabler for hardware transactional memory, which has been proposed in the form of ASF back in 2009. Another use case is the quick recovery from branch mispredictions, where program flow can be redirected to a checkpoint created right before evaluating a difficult to predict branch condition.

Let me continue with some random picks:

+ "L3 victim queue parity",
...
+ "Atomic request parity",
+ "ECC error on probe filter access",
...
+ "Error on GMI link",

There is a confirmation of the "GMI link" mentioned on an already leaked slide, which mentioned a bandwidth of 25 GB/s per link. The term "Data Fabric" also has been used on that slide.

When reporting about the 32 core support, I wrote that some patents used the same wording. It's actually "core processing complex" (CPC) and can contain multiple compute units (like Zen cores). So they are not the same. AMD patent filings using the term are US20150277521, US20150120978, and US20140331069.

Last but not least I have updated the Zen core diagram based on these new informations and some very likely related patents and papers:



Notable changes are:
  • uOp Cache has been added based on the new patch
  • FMUL/FADD for FMAC pairing removed, based on some corrections of the znver1 pipeline description.
  • 4x parallel Page Table Walkers added, based on US20150121046
  • 128b FP datapaths (also to/from the L1 D$) based on "direct" decode for 128b wide SIMD and "double" decode for 256b AVX/AVX2 instructions
  •  32kB L1 I$ has been mentioned in some patents. With enough ways, a fast L2$ and a uOp cache this should be enough, I think.
  • issue port descriptions and more data paths added
  • 2R1W and 4 cycle load-to-use-latency added for the L1 D$ based on info found on a LinkedIn profile and the given cylce differences in the znver1 pipeline description
  • Stack Cache speculatively added based on patents and some interesting papers. This doesn't help so much with performance, but a lot with power efficiency.
It's still interesting, what the first mentioning of fp3 port for FMAC operations was good for. I thought, it was a typo, but more of the kind "fp3" instead of "fp2" in one case. It could still be related to register file port usage and/or bridged FMA, but probably not that useful for telling the compiler. Due to the correction patch I'm still looking further into the FPU topic, as promised earlier. I'll cover that in a followup posting.

Finally there is a hint at good hardware prefetcher performance (or bad interferences?), as AMD recommends to switch off default software prefetching for the znver1 target in GCC.

BTW have you ever heard of a processor core having 2 front ends and one shared back end?

Update: There is an update of the bespoken patches, posted on the same day as this blog entry. You can see it here. So far I didn't see any significant additions other than cleanups and fixes.

5 comments:

Nintendo Maniac 64 said...

"16/02/16"

Hah, that's quite the date! I must admit though, it really did through me for a loop for a moment, especially combined with the use of a two-digit year.

Though, if you use a 4-digit year (that being 2016/02/16) then you still end up with quite the funky date where it's still only using the numbers 0, 1, 2, and 6, and using each number exactly twice.

Nguyễn Viết Thành said...

https://patchwork.ozlabs.org/patch/599066/

(define_insn_reservation "znver1_sseimul_avx256" 4
(and (eq_attr "cpu" "znver1")
(and (eq_attr "mode" "OI")
(and (eq_attr "type" "sseimul")
(eq_attr "memory" "none"))))
"znver1-double,znver1-fp0*4")

(define_insn_reservation "znver1_sseimul_avx256_load" 8
(and (eq_attr "cpu" "znver1")
(and (eq_attr "mode" "OI")
(and (eq_attr "type" "sseimul")
(eq_attr "memory" "load"))))
"znver1-double,znver1-load,znver1-fp0*4")

(define_insn_reservation "znver1_sseimul_avx256_load" 11
(and (eq_attr "cpu" "znver1")
(and (eq_attr "mode" "OI")
(and (eq_attr "type" "sseimul")
(eq_attr "type" "sseimul"))))
"znver1-direct,znver1-fp0*3")

znver1_sseimul_avx256_load don't need load, from double to direct. L1 cache data: 7 clock?

Mark said...
This comment has been removed by a blog administrator.
Lo Absoluto said...

Just as i was saying, 128b high and 128b low operands at FPU.

Powerrush...

Elliott Rohman said...

The ideas here have almost brought more of the interest among the students to proceed with all those provisions which must have been followed. database design services