New AMD Patent Application Sheds Light on their Chiplet-GPU Implementation

1 ： Anonymous：2021/06/03 23:37 ID: nrr9gl: Here is a link to the patent app

When the first MGPU patent application was released in December of last year, and then updated in April, there were a few questions over how the chiplets would interact with each other, as well as with the CPU. There was also the question of whether or not there would be extra latency involved in this setup, and other questions such as how VRAM is handled.

But first of all I want to make something abundantly clear, that goes against what RGT, MLID, and a few other leakers are saying: Nowhere in this patent application, or in any other chiplet GPU patent application written by AMD, is there an IO Die required for chiplet GPUs. A lot of people may say 'well patents aren't always what comes to market', but AMD is clearly taking a different approach, and all of the patent applications to date imply that the approach is to not use an IOD at all.

I also want to make it clear that, going against what Coreteks said in their latest 'AMD GPU chiplet' video, the active bridge chiplet is NOT overtop of the die. It is underneath the GPU chiplets and is embedded in the substrate.

Now for some tasty (and long) bullet points:

Fig. 6 shows three dies in the configuration. And to me, it seems that three-chiplet configuration is very likely. Not only because of this patent application, but also because of the fact that 3 dies is the maximum that TSMC can do with their CoWoS-L (TSMC version of EMIB) tech at the moment. In fact, 3-die CoWoS-L is entering mass production in Q2 2021, which is almost right on schedule, if not a bit early, to put it into use for Navi 31. It should be noted that up to 8x HBM2E can also be connected with the 3x logic dies via CoWoS-L, but I don't think it's likely that this will happen for gaming. Especially with the infinity cache, I doubt HBM2E will be used at all. I should also add that all patent applications point to using an active silicon bridge (EMIB / CoWoS-L) in their designs and not a full silicon interposer. (See paragraph [0021], "An active bridge chiplet couples the GPU chiplets to each other...includes an active silicon bridge that serves as a high-bandwidth die-to-die interconnect between GPU chiplet dies). It is worth saying, though, that these patent applications don't specifically rule out more (or less) than 3 dies per package.

As was made clear in the first chiplet GPU patent app, the CPU only talks directly to one of the GPU chiplets, and the entire chiplet-GPU appears as one singular GPU to the system. The first chiplet dispatches the work across the other chiplets via active bridge chiplet that includes L3 and also synchronizes the chiplets. (See paragraph [0022], "..the CPU is coupled to a single GPU chiplet through the bus, and the GPU chiplet is referred to as a 'primary' chiplet. ")

VRAM access: Each chiplet has it's own set of memory channels, as indicated in paragraph [0022] "Subsequently, any inter-chiplet communications are routed through the active bridge chiplet as appropriate to access memory channels on other GPU chiplets". The question here is if the chiplet GPUs have their own memory channels, are those memory channels exclusive to that chiplet or are they shared somehow? This is semi-resolved in paragraph [0026]: "The memory-attached last level L3 cache that sits between the memory controller and the GPU chiplets avoids these issues by providing a consistent 'view' of the cache and memory to all attached cores". So where, physically, are the memory channels? A: They are on each chiplet. All memory access is controlled by the first chiplet, but memory channels are attached to L3. It should be noted that memory bandwidth scales directly with the # of chiplets on package. So for example, if a 1-chiplet GPU was built to have a 128-bit memory bus, a 2-chiplet GPU would have 256-bit, a 3-chiplet GPU would have 384-bit, etc.

Checkerboarding: In the end of paragraph [0023], states "...Mutually exclusive sets of adjacent pixels are processed by the different chiplets, which is referred to herein as 'checkerboarding' the pixels that are processed by the different GPU chiplets". This harkens back to the days of SFR (split frame rendering) vs. AFR (alternate frame rendering) when rendering over crossfire / SLI. However it seems like these 'sets' of pixels will be much smaller, and not based on a simple line drawn across the screen. This should prevent screen tearing and other post-processing problems associated with SFR. According to [0049] "Each pixel (/checkerboard) ... represents a work item that is processed in a graphics pipeline. Sets of pixels are grouped into tiles having dimensions of N pixels x M pixels. Mutually exclusive sets of the tiles are assigned to different process units", and, further on, in [0050] "After the pixel work items are generated by rasterization, the processing units determine which subset two [sic] process based on a comparison of the unit identifier and the screen space location of the pixel.", it seems to indicate that the delineation of work between the chiplets happens at the primitives level, and not at the screen-space level. This could potentially eliminate the problems that occur with post-processing effects while rendering in SFR mode, and would allow each chiplet to effectively work on different parts of the same frame.

Chiplet Synchronization: From [0024], "...Command processors in the GPU chiplets detect synchronization points in the command stream and stop (or interrupt) processing of commands in the command buffer until all the GPU chiplets have completing [sic] processing commands prior to the synchronization point". Again, if a single chiplet is saturated, all of the chiplets have to wait for it to catch up. However, with checkerboarding, it's doubtful whether the workload will vary so greatly between chiplets that this will become and issue.

Active Bridge Chiplet: From [0026], "...the active bridge chiplet includes a unified cache that is on a separate die than the GPU chiplets, and provides an external unified memory interface that communicable links two or more GPU chiplets together". The two points here being that 1) the entire L3 cache sits on the active silicon bridge itself, and nowhere on the GPU chiplets, and, 2) that the memory channels are on each chiplet (as stated above), but are controlled by only the first chiplet.

Memory Controller: From Fig. 6, it's pretty clear that a memory controller exists on the first chiplet. However, this same memory controller would still 'physically' exist on the other 'slave' chiplets, but would be disabled and not used. Anyone familiar with chip fabrication knows that making 2 different designs instead of one involves two costs: 1) the cost of making separate designs in the first place, making a 2nd set of masks, etc, and 2) the cost associated with the loss of scalability of the design. So although the patent application doesn't explicitly mention that there won't be a separately-designed 'IO die', it is quite clear that each of the chiplets in this multi-chiplet GPU are identical to the 'primary chiplet', and that there will be some dark silicon on each of the 'slave' dies.
2 ： Anonymous：2021/06/04 06:53 ID: h0jfptu: Thank you for the nice analysis! It makes sense for each chiplet to have its own set of memory channels, for scaling. I think the active bridge doubling as L3 is brilliant. Definitely seems possible after their recent 3D cache reveal.

ID: h0jp29a

Simple and elegant. I really like that solution.
3 ： Anonymous：2021/06/04 17:25 ID: h0l93ms: Its not discussed here, but this has huge implications for APUs, doesn't it? If a GPU "chiplet" can handle memory requests from adjacent chips, what's to stop AMD from putting a zen CCD onto the L3 bridge? CCD's already expect to hop to another die, so the latency would be manageable. Dual channel ddr5 should be pushing north of 80GB/s, and only improving from there. That's smaller than a 128 bit gddr6 bus at 128GB/s, but its pretty close. You could use defective GPU chiplets for this APU and it wouldn't be too poorly matched. Would be highly efficient, since the CUs wouldn't need to run at full clocks to utilize the bandwidth...could be huge for mobile.

I guess the big hurdle is if AMD can design AM5 to somehow provide the proper pinout for both an IO die and a GPU "Chiplet" to access memory...

Huge implications for console APUs though. Since those already run on gddr, we may have just seen the last monolithic console be released. Okay, maybe we'll see one more mid-cycle refresh...
4 ： Anonymous：2021/06/04 14:20 ID: h0kjaqp: A very good analysis, thank you OP.

P.S. I encourage you to keep this up with future patents. 🙂
5 ： Anonymous：2021/06/04 16:57 ID: h0l52jz: I wonder if the memory controller logic could be designed in such a way that it could be used for some other task, even if it wasn't super optimized for that task.
6 ： Anonymous：2021/06/04 19:31 ID: h0lqwc4: These so call "leakers" just shot gunning from their ass base on some "information" they got or create.
7 ： Anonymous：2021/06/04 00:11 ID: h0i8125: Thank you for the write-up. Very interesting.
8 ： Anonymous：2021/06/04 00:01 ID: h0i6rlk: Active Bridge Chiplet: From [0026], "...the active bridge chiplet includes a unified cache that is on a separate die than the GPU chiplets, and provides an external unified memory interface that communicable links two or more GPU chiplets together". The two points here being that 1) the entire L3 cache sits on the active silicon bridge itself, and nowhere on the GPU chiplets, and, 2) that the memory channels are on each chiplet (as stated above), but are controlled by only the first chiplet.

Uh, so how is the "Active Bridge Chiplet" not an IO die?

ID: h0i8vns

The active bridge doesn't actually have any IO on it

The chiplets each have their own memory PHYs. Active bridge has none.

The primary chiplet has the active memory controller. Active bridge doesn't.

The primary chiplet has the active SDF for talking to the CPU. Active bridge doesn't.

The active bridge's only uses are as a) an L3 cache, and, b) a communication link between the chiplets

ID: h0iqfyh

I suppose this makes it easier to use a singular die without active bridge and secondary GPU chiplet for other products, like RX 7700XT with 80 CUs (single), while 7800XT has 144 CUs (dual + bridge) and 7900XT has 160 CUs (dual + bridge).

ROPs and L1/L2 are tightly coupled to memory controllers, so it seems this hasn't changed. Secondary GPU chiplet should also have its own MCs and PHYs, which also means it has its own framebuffer. The 2 GPU chiplets are simply synced through active bridge (much higher bandwidth than PCIe Crossfire) and can access each others memory channels through its L3 cache and DMA. This would definitely work with checkerboarding, as each GPU can alternate tiles of work.

So, primary 256-bit GDDR6 memory bus width -> bridge + L3 <- secondary 256-bit GDDR6 memory bus width or any variation thereof (192+192, for example). GDDR6 can be arranged in clamshell formation.

EDIT: There's also the possibility that secondary chiplet simply shares primary GPU die's framebuffer through bridge + L3. This would cut down on memory chips needed. Secondary GPU memory PHYs would then be cut. IF SDF would handle all of data transfers to bridge, as usual.

Navi 31 may also only be reporting 80 CUs because only one GPU die connects with host CPU. The other may not be exposed to host, as it connects only with primary GPU die through bridge chip. Secondary die could also power down when not needed outside of 3D or compute work.

I think Instinct will use the active interposer approach with HBM2e, as that makes sense when you're already using an interposer to link multiple dies.

ID: h0kj7rk

Probably the reason for this is the same reason as v-cache is double the cache density of Zen3 in the same area... the active bridge is using a memory density optimized process. But it would be bad at IO and logic.

That L3 is going to be HUGE... I won't be supprised if its > or even much > 256MB.

ID: h0i9hmn

provides an external unified memory interface that communicable links two or more GPU chiplets together

... did you not read the bolded part of what I quoted?

A memory interface is literally IO, communication links between chipslets are also IO. It does have IO, that seems to be all it's even doing. Communication between the rest of the system and the other chiplets, AKA IO.

Subsequently, any inter-chiplet communications are routed through the active bridge chiplet as appropriate to access memory channels on other GPU chiplets

Another quote from your post showing it's doing IO. There is no communication (IO) with the GPU chiplets, it all goes through the Active Bridge Chiplet.
9 ： Anonymous：2021/06/04 03:38 ID: h0ixd6v: Impressive analysis, thanks. Could you ELI5 what the advantages of a chiplet design would be in terms of performance? I don’t see the benefit over a single larger die. Thx

ID: h0iygdr

In terms of straight performance, there isn't much difference between chiplet and monolithic. A monolithic chip would actually perform slightly better, and at lower cost, all else equal.

But chiplet allows lower cost via better yields (from smaller individual chip size), scaleability of design via having only one or two gpu chip designs as opposed to 5 or 6 for an entire product stack, and scaleability of the package surface area, as you can expand the chip size beyond the reticle limit (a manufacturing effect that limits the ability to make massive monolithic chips).

ID: h0k336v

We also have to factor in the VCACHE announcement into this. People will dismiss it's usage for GPUs in gaming space. But that would be a mistake. The value there, despite the potential of cost uptick, exceeds the negatives of cost by a magnitude order difference.

ID: h0k3z6y

Thanks very much, guess I wasn’t too far off base then. I was thinking this is more interesting from a technology perspective than a consumer one as somehow I doubt any cost savings will be passed on to us 😉

ID: h0k0yfl

Well it will make the performance cheaper to make since two smaller dies has a better yield than a massive die.
So you can get a better performance card for the same money, it's not really a direct advantage for performance.

ID: h0k39vo

Another huge advantage for AMD would be allowing them to take the high-end unquestionably since even when they had advantage at the same die-size, nvidia remained the leader at high-end due to having much bigger chips.
10 ： Anonymous：2021/06/04 12:22 ID: h0k57ex: AMD basically went with my 2017 MCM design:

ID: h0knjea

Kind of, except you don't show that they have unified memory access via an external L3, once they start working on a task and it hits L3 they all have access to it at the same latency. This is a huge improvement compared to any chiplet designs AMD has done so far none of which have shared cache like this to date.

Also I think your design relied on an interposer, and I don't think they will use an interposer, but a longboi v-cache on top of the 2-4 dies + dummy die bonded to the top of the rest to give them more rigidity also.

Also 96bit GDDR (diagrams show 3 PHY per chiplet and a GDDR PHY is 32bit) per chiplet which is a bit of a let down but considering RNDA2 I bet it will work great.

In short I think it's gonna be a sick GPU 😀 and probably going to blow our minds when we find out what it actually is.

ID: h0m2sja

I didn't show it, but I am fairly certain I mentioned it in the conversation buried deep in my Reddit history :p Though I explored multiple memory models, so who knows.
11 ： Anonymous：2021/06/04 09:07 ID: h0jp3gf: lol I somehow missed that chiplet GPUs will be akin to Threadripper 2970/2990 with their weird memory access from previous patent. Hope not relying on Windows scheduler will spare us of all those problems threadrippers had on Windows compared to Linux.

From the looks of it, it's the hardware Crossfire we all dreamed about.

Also I don't know how people can believe those "leaks" RGT, MLiD, Coreteks etc post on a regular basis. They do have ocassional real exclusive leaks in their videos but most of the time it's just a bunch of BS they wrap around actual public info that surfaces in a previous day. I mean how much times do they have to end up being wrong before people stop trusting them?

ID: h0kgf2n

Hope not relying on Windows scheduler will spare us of all those problems threadrippers had on Windows compared to Linux.

Window's scheduler will have nothing to do with chiplet GPUs.

Also I don't know how people can believe those "leaks" RGT, MLiD, Coreteks etc post on a regular basis. They do have ocassional real exclusive leaks in their videos but most of the time it's just a bunch of BS they wrap around actual public info that surfaces in a previous day. I mean how much times do they have to end up being wrong before people stop trusting them?

I should start selling bridges to Youtube subscribers.

ID: h0kgokz

Window's scheduler will have nothing to do with chiplet GPUs.

Yes, that's why I said that. Should've probably worded it better.

ID: h0kilj9

lol I somehow missed that chiplet GPUs will be akin to Threadripper 2970/2990 with their weird memory access from previous patent.

I mean, those Threadripper didn't share a unified cache though, I think? Just the main memory. And I think links on the substrate.

Part of how Ryzen has worked from the start is that every individual CPU core in the same CCX shares the same L3.

ID: h0kn12y

Actually it isn't like threadripper.... as as threadripper doesn't share L3 like this at all.... each chiplet has unified memory access! An L3 cache hit from any memory bank should appear the same to any chiplet.
12 ： Anonymous：2021/06/04 09:33 ID: h0jqtgq: While the checkerboard rendering of mutually exclusive pixels with independent chiplets makes total sense for traditional rendering, I think that ray tracing can not be accelerated by this approach. Any clues from the patent, how they are going to tackle this problem? Maybe dedicate one chiplet to RT while the other two calculate the "normal" rendering pipeline? That could cause scaling issues.

ID: h0kg3hp

RT can be handled within the pipeline itself

引用元:https://www.reddit.com/r/Amd/comments/nrr9gl/new_amd_patent_application_sheds_light_on_their/