Monday, November 23, 2015

AMD K12 looks to be at least a 4-wide design with SMT

An article about Zen and K12 by Yusuke Ohara gives a good overview of AMD's processor plans and a new and very interesting bit of information about AMD's high performance ARM design. As machine translators still struggle to provide clearly understandable translations of Japanese texts, multiple translators were tried and did not help. Therefore I asked the author to make sure that I got it right. He confirmed, that according to ARM officials, who are aware of the works of their architectural licensees, AMD is using at least a 4-wide design for their K12 core.

Jim Keller already said, that the smaller decoders for ARM instructions would leave room to add some performance improving features compared to x86. He also mentioned "a bigger engine" than in Zen. Looking at the microarchitecture diagram, one might ask, how AMD would utilize all these execution hardware, especially if there would be even more units and maybe even more than four instructions fetched and decoded per cycle. And given its target market, which is servers and datacenters, this might include one important feature: SMT. Some already speculated about that based on expectations, but there is AMD patent application US20150121046, which mentions SMT and its application in an AArch64 design very clearly and with many implementation details. This can be seen as an indicator of work being done for real products.

If K12 is a 4-wide or even wider SMT design similar to Zen (which is "only" 4-wide), this would put some substance behind Keller's announcements, which suggested many similarities between both designs. This is supported by the fact, that one of the inventors listed in the patents (Marius Evers) seemingly worked on both cores. Many other patents by him also cover both ARM and x86. He was also involved in one patent filed in 2007, which described a way to add SMT to the front end of a Bulldozer like module. SMT is not only useful to utilize execution units, if there are many of them. It also helps by keeping them busy, if there are multi-cycle FP instructions, branch mispredictions, or cache misses.

Of course, there are more differences between those two architectures than the ISAs alone, but many typical CPU components are either ISA-agnostic and reusable or could be adapted with much less effort than creating them from scratch. However, if it was done this way, such a strategy would not only have permitted AMD to make an efficient use of the limited R&D resources available, but it would have created a chance to produce a powerful ARM core for servers for an acceptable overhead. This is like applying SMT to R&D.

Wednesday, November 18, 2015

AMD Hierofalcon/Seattle shown at ARM TechCon

AMD presented some boards at ARM TechCon and thanks to there are two videos covering that stuff.

One video shows Red Hat's Jon Masters' explanation of AMD's Huskyboard, where (even if only printed on cardboard) you can have a nice closeup view of the chip (video screenshot):

AMD Seattle closeup

The second video shows real hardware at work, including SoftIron Overdrive 3000, and the Huskyboard in 3D:

Thursday, October 15, 2015

AMD Zen and K12 (ARM) tapeouts confirmed by LinkedIn profile

According to a LinkedIn profile, both Zen and K12 should have been taped out already. So this is a fact, as it isn't speculated based on sparse information. Interestingly the same guy (you have to find him yourself, if you need to), who only talks about CPU cores, mentions his working on 16nm and 14nm FinFet designs. So there will be one design made by TSMC and one by Globalfoundries. K12 by the first and Zen by the latter I suppose. And here is the snippet:

AMD's ARM-based "Hierofalcon" SoC sighted

On the same OSADL site, which once provided some first signs of life of a 2 GHz Jaguar based APU, there is now an engineering sample of one of AMD's ARM based embedded processors, called "Hierofalcon". The processor can be found in rack #a slot #3. According to the tables and logs, it also runs at 2 GHz and has 8 cores. If you haven't heard of that processor, just check these two slides. I actually included the first only because of the bird. ;)

I prepared some charts out of the numbers given there. If you check the site, you'll find no directly linked latency chart. But their 1337 page allows to compare rack #a slot #3 to some other rack/slot and returns the missed latency chart of the "Hierofalcon", which looks like this (updated daily):
Latency plot of AMD "Hierofalcon" ES
Latency plot of AMD "Hierofalcon" ES
The only available performance numbers are some daily updated Unixbench results. So I took them, combined rack names with CPU strings, sorted them, choose some CPUs for comparison, and normalized the results to the CPU in question.

The first chart already shows, that on a per clock basis AMD's other CPUs already lag behind in simple integer code of the old kDhrystone benchmark. The floating point based Whetstone benchmark draws a somewhat different picture with more equally distributed per clock performances except that of the old K10 based Phenom II. The next three benchmarks Execl (not Excel!), kCopy, and kPipe test OS functions like spawning processes, doing file copying or using the pipe. The Index is a combined result.

In the next chart we can see the raw performance of all cores, only normalized again to Hierofalcon.

Even then the 8 cores of the ARM based processor have a good standing in the first two benchmarks, while in the OS benchmark, it roughly keeps up with Kaveri and Bulldozer, both running at much higher clock speeds.

The ARM based CPUs are meant to put many lower power cores together. To have a first impression of that effect, I used the given TDP numbers as the only metric available for all CPUs. Here are the power efficiency numbers:

I think in this case, the Hierofalcon bars are really easy to spot, even though I used the max listed TDP of 30W. Only the already power optimized Sandy Bridge variants and the 9W Kabini are able to keep up in some of the tests. And of course, real power measurements would shift the numbers a bit.

Of course, many (including me) would like to see more interesting benchmarks, but these are the first numbers we've got and they aren't bad at all.

Tuesday, October 6, 2015

How Many Days Until Zen?

Last weeks headlines went back and forth about which foundry will be selected by AMD to produce wafers containing Zen based processors. After Jim Kellers departure, Mark Papermaster assured that Zen is on track for samples in 2016 and a full year of revenue in 2017. Before him pointing that out there was a comment on a LinkedIn profile, putting AMD's next gen x86 desktop processor straight into 2017.

So there are several data points of more or less official type. Let me add another one, which is based on the GCC patch publication pattern. This assumes, that there are work processes behind the patches and of course GCC related deadlines for inclusion of particular changes.

Days between GCC compiler patches and CPU launch of Bulldozer and Cat core family CPUs with speculation of Zen launch.

This chart shows the time delta in days between the publication of patches and the launch of a particular CPU containing a new core. For some launch dates only a month was given, so I took the last day of that month for the calculation.

The Zen bars show the timeline in months starting with publication of the specific patch. With this at hand, anyone can draw their own conclusions. The scenario of first Zen based server or desktop CPUs hitting the markets in 4Q16 doesn't seem unlikely.

Saturday, October 3, 2015

AMD's Zen core (family 17h) to have ten pipelines per core

With writing about Zen I moved here since will close its service at the end of this year. That's it. Let's move on to the interesting stuff.

Whoever has chosen the name "Zen" for AMD's next generation x86 core, might have had the number four in mind, which plays an important role in this philosophy (e.g. Four Dharmadhātu). At least this is what a recent patch revealed about this long awaited microarchitecture.

Andreas Stiller speculates that the term Zen as in "SuZen" might be related to Zen team leader Suzanne Plummer and possibly Lisa Su as well. An article on myStatesman, which appeared shortly after Jim Keller's leave, lists some more team member names if you magnify the photo:

Mike Clark, front left, and team leader Suzanne Plummer, and in background from left are Teja Singh, Lyndal Curry, Mike Tuuk, Farhan Rahman, Andy Halliday, Matt Crum, Mike Bates and Joshua Bell.

Mike Clark is a true AMD veteran, being there since 1993. Some have developed the Cat cores, like Teja Singh and Joshua Bell, who presented the Jaguar microarchitecture at ISSCC 2013.

As heard earlier this year, Zen will use SMT and an improved cache subsystem while being designed from scratch with new ideas combined with reusing existing components (to reduce the effort). This might even include already existing and somewhat developed ideas not realized in previous designs. A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.

And the new patch also suggests nicely low int/fp mul, fp add, int/fp div and fp square root latencies. Some of these lower latencies (div/sqrt) were introduced with Excavator, as an Aida64 instruction latency dump provided by Anandtech forum user monstercameron revealed. Due to an Aida problem with measured and reported clock frequencies (although it was fixed at 1.4GHz), you have to multiply the measured times by 1.4 to get the real number of cycles. Ok, back to Zen.

Here are some quotes of the patch file:

+;; Decoders unit has 4 decoders and all of them can decode fast path
+;; and vector type instructions.
+;; Integer unit 4 ALU pipes.
+;; 2 AGU pipes.
+;; Floating point unit 4 FP pipes.
+  32, /* size of l1 cache.  */
+  512, /* size of l2 cache.  */

  • 4 wide decoders
  • 4 integer ALUs
  • 2 AGUs (for 2R 1W L1 cache according to a LinkedIn profile)
  • 4 FP pipelines
That makes z ten pipelines with a general four wide design.

There is a lot more information, which I will collect over the next days. Some stuff is copy pasted from Excavator (bdver4) or Jaguar (btver2) and modified then. But careful comparing did show some clear differences, while at other places it's not clear, if there is new information or not (e.g. div latencies). But as btver2 has 2048 kB L2 and the rest of the block is more similar to bdver4 or btver2 than btver1 (Bobcat), which has 512 kb L2, it looks like no btver1 files were used as a source. So I assume, that this is a new entry of an L2 cache size, indicating fast L2 caches per core. The L1 data cache still has the same size as that of Jaguar or Excavator. Some patents mention an 8-way 32kb L1 D$.

Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.

These latencies also clearly suggest, that this is no high clock frequency design. But at 14nm (or 16nm from TSMC as some rumours suggest) clocks of 3.5 to 4 GHz should be reachable without stretching the thermal limits too much.

This should be enough for now. Here is a schematic, which should come close to what Zen might really look like:
AMD Zen Core Microarchitecture
AMD Zen Core Microarchitecture (with some speculated parts)