Some (wishful thinking) analysis of AMD Monet

REDDIT GAME TRANSLATION | 2021年9月30日 PS5

1 ： Anonymous：2021/09/28 17:47 ID: pxbqtc: A few months ago there was a leak about Monet, an entry level 12nm Zen 3 APU, which I found very interesting. The idea of using GlobalFoundries for a new modern low-end chip seemed right to me, and in general I like low-end stuff. Unfortunately there was no follow up on that rumour, which left me to stew in my own speculation. Here's some of that.

To recap, RedGamingTech said that Monet is a very low power APU with four Zen 3 cores and "a couple of RDNA 2 WGPs", produced on GlobalFoundries' 12LP+ process.

The rest of the specs on the above linked VideoCardz page are speculations by Olrak, not part of the leak, so there seems no reason to consider these extras. The GPU specs are even inconsistent with RGT's leak. I'd interpret "a couple of WGPs" as two or more ("a couple" could mean either two or "a few"). This would mean 4 or more CUs, not 2-4 CUs.

I'd like to discuss whether such a product makes sense, and then what possible specs such a chip might have.

The 12LP+ process

12LP+ is a small size shrink and big power improvement over 12LP. Anandtech's writeup lists 40% reduction in power and 15% space saving, although GloFo's current documentation lists 10% for the reduction in size, which I'd take as more realistic. 12LP itself is a 15% size shrink over 14LPP.

More details of the power saving (and some other details) are available in this interesting article (PDF). Here's a similar article which overlaps to a large extent with the first one but offers some more details in places.

Note that some of the power saving requires using particular structures in particular ways. The process is designed to make AI processors more efficient, and some of the power savings won't apply that well to CPUs. However, the process itself is still more power-efficient than 12LP.

12nm benefits

Looking at some figures, it seems to me that designing and producing chips on 12LP+ will be cheaper than for any TSMC process. The up front cost of 12nm (tools, tapeout) is much lower (see the Anandtech article for GloFo's claim of 50% saving), and the per-wafer cost, if we believe this estimate is also much lower. (The PDF article linked in the previous section explains why that is.)

In terms of density, AMD's 14LPP products (Polaris, Vega, Raven Ridge) all have a transistor density of around 24-25 Mtransistors/mm2. Products on TSMC's N7 vary a lot more, with Navi 10 having 41 Mt
r/m
m" class="reddit-press-subreddit-link" target="_blank" rel="noopener">
r/m
m
, Navi 22 at 51 Mt
r/m
, Renoir at 63 Mt
r/m
m" class="reddit-press-subreddit-link" target="_blank" rel="noopener">
r/m
m

2 and Cezanne at 60 Mt

m" class="reddit-press-subreddit-link" target="_blank" rel="noopener">

2. (I'd have guessed that having more cache translates to higher density, except that Renoir isn't consistent with this.)

12LP+ is denser than 14LPP. It's worth noting that even though AMD used 12LP it never took advantage of the extra density to reduce chip size, instead using the same die layout (floorplan) as the 14LPP versions. With the extra density, it looks to me like a 12LP+ die would be at most twice the size of N7, even in the case of the APUs, which happen to be particularly dense.

With a size of 2x but wafer cost estimated to be more than 2x cheaper, it looks like at the cost level 12LP+ should be a win. Couple that with half the design costs, and it's definitely a process to look at. The focus on lower power means that the only apparent drawback of this process is die size. Larger dies prevent such a process from being a good one to create more advanced chips on, but it would still be a good choice for smaller products.

Another advantage of GloFo is that it produces chips in the US. For a US company like AMD this could be politically important and should help at least a little against Intel, which already has most manufacturing in the US. For clients in the US it could also have a logistic benefit.

Finally, AMD has a contract with GloFo to produce chips there. With Zen 4 being rumoured to use a 6nm I/O die, AMD could probably use another product there. So manufacturing at GloFo not only reduces product pressure at TSMC, it may also be necessary contracturally.

The hole in AMD's lineup

AMD hasn't been servicing the low end market well. Apart from Zen/Zen+ APUs (Raven Ridge, Picasso, Dali) all of AMD's chips or chiplets have 8 cores. The percentage of dies with enough defects to not reach the 6 core level would be small. AMD still supplies 4 core chips to mobile and OEMs, and even had a limited supply of 3300X/3100 for DIY users, but natural binning would not be enough to satisfy demand.

Picasso and Dali are okay chips, but they have an old architecture which is slow compared to modern Ryzen, and battery life on them isn't great either (unless clocks are turned way down, like on Dali's 6W versions, which makes performance even lower). They are still acceptable, but AMD has moved forward enough with its designs that newer versions would be significantly better on all fronts, even if produced at GloFo, and especially if AMD takes full advantage of 12LP+ density (unlike what it did with 12LP) and power reduction.

This missing low end will become even more of an issue in the future, as for AM5 (and equivalent mobile designs) there are no low end chips like Picasso and Dali. This means that AMD needs to either rely on binned chips for the low end (and possibly cut down full chips if yields are good) or produce a new low end chip. 12LP+ makes some sense for such a chip.

AMD can choose, as it does now, to focus on high margin chips, but if it can produce a lower end chip without impacting production of higher end ones, that's a sure win. The low end market is big, and should both up AMD's bottom line and increase its visibility.

It's worth noting that AMD already has Van Gogh, the Steam Deck chip, which could be an acceptable entry level chip. However, it's behind on the CPU front, has a larger GPU than would normally be needed at the low end, and still takes TSMC production space.

tl;dr AMD could use a low end offering that's better than Picasso, and producing it at GloFo can end up cheaper while leaving TSMC to produce higher end products.

Monet potential specs

Let's assume that the leak is right and there are 4 Zen 3 cores and some RDNA 2 CUs. The leak didn't go beyond this, and I'll ignore Olrak's speculation.

As mentioned before, the most glaring gap for AMD would be in the upcoming (LP)DDR5 lineups. It would make some sense if Monet filled that gap. So far AMD's upcoming RDNA 2 APUs, Van Gogh and Rembrandt, both use DDR5. Monet as a DDR5 chip would fit that trend, and could become the low end counterpart of Rembrandt.

In terms of performance, RDNA 2 with 4 CUs (2 WGPs) should be competitive with the 3400G and 5300G (which perform about the same on the GPU front). This would be enough for an entry level product. However, 6 CUs may be a better fit if Monet is expected to be a low-end (to mid-range) counterpart to Rembrandt. That'd make it exactly half of Rembrandt (which is rumoured to have 8 cores and 12 CUs), and although that's certainly not strictly necessary (Dali has only 3 CUs compared to Picasso's 11), Monet will be placed differently, more along the lines of Picasso in the Ryzen 2000 lineup (or even higher priced).

Speaking of half of Rembrandt, one option would be for Monet to have only one channel of RAM. That was a classic way to cut some die size, which I'm sure will be favoured by OEMs (who seem to find dual-channel a nuisance). If two channels are enough for 8 cores and 12 CUs, a single one should be enough for 4 cores and 6 CUs.

The last spec worth speculating about is the cache size. 8MB L3 would be the normal size for a 4 core Zen 3, similar to the 5300G or 5400U. Cutting that to 4MB is likely to lose quite a bit of performance. Zen 3 with 4MB L3 might still be faster than Zen 2, but perhaps not by much. Still, it's an option for saving die space.

So what will be the spec? Wishful thinking would dictate 8MB L3 cache, 6 CUs and dual channel DDR5, but AMD could decide to cut corners and release Monet with 4MB L3, 4 CUs and single channel DDR5 and it would still be a viable product. Of course, Olrak's speculation of LPDDR4 isn't completely out of the question, it just seems to me to fit less well with the timeline and my guess of positioning.

tl;dr Being the optimistic sort, I'd go for 8MB L3 cache, 6 CUs and dual channel DDR5/LPDDR5 RAM.

Die size

It'd be interesting to try to estimate the size of such a chip. I think that AMD will try to get to something that's below Picasso's size (210 mm2). It's also possible that it will try to go to Dali's size (149 mm2), so it's worth seeing if the minimal spec above could fit into that.

For the purpose of this calculation I will estimate 12LP+ sizes as twice the size of N7 structures in current chips. This is a very rough estimate, but I don't have anything that's really much better.

Measured on the Raven Ridge die image (Picasso has the same layout), its CCX is about 41 mm2. Cezanne's half CCX (4 cores) is 25 mm2, making it 50 mm2 when doubled. Cezanne's L3 cache is about 36% of the CCX. Cutting the cache size from 8MB to 4MB will result in a saving of 4.5 mm2, or 9 mm2 in the double version, making four Zen 3 cores + 4MB L3 cache the same size as Picasso's Zen+ cores + cache.

As a sanity check I compared the size of Raven Ridge's 4MB cache, which is about 12 mm2, to the proposed 12LP+ 4MB cache, at 9 mm2. This seems like reasonable scaling. Note that, as described in the PDF article mentioned, 12LP+ has some slower but 25% denser SRAM. I don't know enough to know if this fact is relevant.

So far it looks like four Zen 3 cores + 4MB L3 cache take the same space as four Zen/Zen+ cores with 4MB L3 on Raven Ridge / Picasso. For comparison, Dali's 2 cores + 4MB cache take about 25 mm2.

Let's take a look at the GPU part. I'm looking only at the CU/WGP parts, not other GPU parts. The 11 CUs in Raven Ridge take about 45 mm2. The 3 CUs in Dali take about 12 mm2, which is consistent. The 8 CUs in Renoir take a little under 12 mm2, so are about 2.7x as dense as the Raven Ridge ones. Based on some RDNA 2 die images, a WGP is 4.2-4.3 mm2, so 3 WGPs (6 CUs) will take just a tad more space than 8 Vega CUs when both are at 7nm.

With the 2.7x scaling, about 4 WGPs will fit in the 45 mm2 space of Raven Ridge's 11 CUs. That's not bad, as 4 WGPs should be faster than 11 Vega CUs. We're aiming at 2-3 anyway. At 2x scaling (which again, seems reasonable for 12LP+), 2 WGPs will take about 17 mm2, while 3 will take about 26 mm2.

There are a lot of other parts to the APU. I will assume that most of them remain the same size, due to slightly higher complexity and a slight increase in density. One part I see in the Renoir image I'm looking at is marked "InfinityFabric" and is about 20 mm2. It's hard to say if this is just mislabeling. Renoir doesn't seem to have this, so either AMD has improved this, N7 wiring makes it much smaller (I remember reading that 12LP+ has wiring taken from GloFo's failed 7nm effort, though can't find the reference now), or this is mislabeled in the Raven Ridge image.

In fact, these other areas are most of the chip. Cores + CUs are only about 41% of Raven Ridge and about 37% in Renoir. Still, these are the areas that are easiest to estimate. I'd still guess that the other parts could be reduced, but by no more than 20 mm2. This is a rough guesstimate not based on much.

4 cores with 8MB L3 cache and 3 WGPs will take 76 mm2 on the supposed 2x scaling, compared to 86 mm2 (for 4 cores with 4MB L3 cache and 11 CUs) on Raven Ridge. With 4MB L3 cache and 2 WGPs the combined size will be 58 mm2. That'd put Monet at about 180-200 mm2. Cutting to one channel of RAM could cut another 10 mm2. If we assume the ability to save up to 20 mm2 elsewhere, that takes us down to 150-170 mm2 for the case with 4MB L3, 2 WGPs and single-channel RAM. The full chip (8MB, 3 WGPs, dual-channel) will take 180-200 mm2.

tl;dr AMD should be able to create a "full spec" chip with 8MB L3 cache, 3 WGPs and dual-channel RAM at a smaller size than Picasso, and could possibly (though that's less guaranteed) go down to about the size of Dali while still keeping 4 cores and 2 WGPs.

What about Cyan Skilfish?

That's an odd product in AMD's lineup, an upcoming APU with RDNA 1 and an older display block. There's Linux code for it, but no further details than the graphics architecture. So you know, there's an upcoming strange APU and a rumoured code name. What if they're the same?

Well, I think it's easier to just not go into that speculation. RDNA 1 for an APU is strange by itself, as AMD skipped that for pretty much everything, going for RDNA 2 for the consoles, Van Gogh, Samsung's ARM chips and of course Rembrandt. A yet unreleased APU with RDNA 1 is strange when all these RDNA 2 designs are either around or will shortly be.

So let's just leave it at that. I'm of course curious, but I'll just have to wait. Van Gogh ended up a pleasant surprise in the form of Steam Deck.

Other 12LP+ products

Assuming that 12LP+ is indeed cheaper to design and produce on, what other products would make sense there?

It's not really a forward-looking node. AMD is already using smaller processes, so designing flagship products on 12nm doesn't seem right. The size disadvantage compared to future nodes like 5nm and 3nm will be just too big. Still, it seems reasonable for low end products and ones similar to what N7 has (or N6 will soon have).

A low end GPU is something that makes sense to me in particular. If AMD has already ported RDNA 2 to 12LP+ for Monet, it feels natural to extend this to a discrete GPU.

AMD has recently retired the old GPUs it's been regurgitating for mobile for years. While more powerful iGPUs make such GPUs a little less necessary, NVIDIA's MX family is still successful and Intel will have its own entry level discrete GPUs coming up soon.

A low cost, low power GPU built on 12LP+ may be just the thing for this market. The process also seems a better match for GPUs than for CPUs. The SRAM which is designed to work better for serial access from a cache and the new MAC units may be more relevant for a GPU. GPU clocks are lower than CPU clocks, which also helps match this process.

Something like an 8 WGP (16 CUs) GPU, similar to the rumoured Navi 24, may not be tiny on 12LP+, but it also won't be huge (should be smaller than Navi 23 on N7), and should be more cost efficient than an N7 version, while providing good enough performance to complement the CPUs of more entry level laptops.

Conclusion

I conclude that I'm spending way too much time on speculation that I could have spent on better things. But since I likely wouldn't have actually spent this time on better things, and I enjoyed writing this and learned a little, it's not that bad.

There's no actual conclusion about Monet here. It's just that I felt that finishing with a conclusion section is the standard thing to do.

I hope that at least a few people have enjoyed reading this.

Disclaimer: I'm not an ASIC designer, so this is all based on my layman knowledge and analysis. I welcome comments from those who are more knowledgeable on the subject.