Compute Ecosystem of AMD GPUs

1 : Anonymous2021/03/22 18:14 ID: mau4at

The following post is an anon rant about AMD's compute ecosystem:

I work at a cross-platform scientific software company that has significant investment in developing products for GPU-accelerated computing. I joined this company just before the company adopted CUDA as a backend for compute, more than 10 years ago.

At the time, NVIDIA was new to the game, but because of the willingness of NVIDIA to respond to our feature requests, plus their technical help to develop code quickly, we were able to develop some semblance of a production-ready minimalist CUDA backend within a year or so of starting.

Over the years, we were vary of being over-reliant on one GPU vendor, for both business and technical reasons and so left spaces in our code for if we ever wanted to expand outside of CUDA. Every year, we would assign an engineer to look into alternates. For many years, the most promising answer was OpenCL. We would try integrating OpenCL into our code base to run on both NVIDIA and AMD's hardware, but either the libraries were lacking features we needed (features that NVIDIA put into CUDA when we made feature requests), or they were not reliably passing our numerical tests. Running OpenCL on NVIDIA GPUs was yielding less performance than OpenCL as well, so every year, the answer was "CUDA is the only viable HPC software ecosystem to invest long-term in".

CUDA has grown to become the standard of HPC, and this monopoly now reflects on NVIDIA's behaviour with its CUDA libraries. Their monopoly enables them to dictate the direction of their libraries (e.g. deprecations and new features, API breakages), and the scientific community must adapt to survive.

This year, like many others, we looked at AMD's available technologies, like ROCm, HSA and hip/SYCL, but our development team voted for restraint because we thought the API was not stable for the scale of investment we wanted to make for a cross-platform product. AMD has changed software direction many times in the last few years, and that is a risk that a company would need to absorb.

We asked AMD to provide us with a compute GPU and an official statement on what their recommended compute ecosystem is. What driver is recommended? What language should we write in? How do we write cross platform code? AMD provided us with an RX 5700XT, so we tried ROCm with it and ... it didn't work. We later found out, it was unsupported for use with ROCm. We asked them what use the compute GPU was if we couldn't run their first-party compute library on it, and it seems their vendor relations employee was not interested in compute as much as they were for gaming. We have yet to receive a decent enough answer from AMD on their compute stack and until we do, our team can not recommend the company to invest resources into AMD's ecosystem, even though customer requests for us to do so are plenty.

It seems to me that while AMD's hardware may be decent, their compute software division is not reliable enough to bank long-term investment in. They have a confused approach to compute. Contrast this with another newcomer on the HPC GPU turf: Intel.

Last year, Intel approached us to ask if we had heard about OneAPI. We said we had heard of it but hadn't looked at it in detail. Within a year of corresponding with Intel, we were able to have a sample of our product up and running. They helped port our CPU code (that already made use of MKL), to oneMKL, and provided us hardware and technical resources. The product is scheduled for release when Intel's HPC GPUs release, but it goes to show the difference in how serious each company takes their compute software division.

It hurts me that even after so many years of AMD producing compute GPUs, we are unable to develop code for them. Contrastingly, Intel just came within the past year and sent an engineer to help us port code and actively worked on our feature requests, resulting in us supporting their compute backend.

I just recently closed this year's examination into CUDA alternates for the company and with a heavy heart, I recommended not to invest in AMD again for the next year.

2 : Anonymous2021/03/22 19:06 ID: grudoeu

I agree with this, and am in much the same situation myself. To me, there’s a very important metric for a compute ecosystem, much more important than number of compute units, memory size, and certainly FPS: Size and quality of documentation. It’s clear that NVIDIA has invested heavily, over many years, in a huge library of documentation and examples that make it easy for developers to get started with CUDA programming or learn new CUDA features. By contrast, AMD hardly has anything. Maybe their technology is nearly as feature-rich as CUDA, but I have no way of knowing because their documentation isn’t nearly as comprehensive.

It sounds like your company also has more time/resources than mine to invest in looking into new compute ecosystems, so we probably won’t adopt ROCm ore OneAPI until those are much more stable or better supported. Hopefully, with some new AMD-based supercomputers coming up, the U.S. Department of Energy will push AMD to improve their documentation, or at least add some themselves.

3 : Anonymous2021/03/22 21:23 ID: gruvnnf

It's laughable that they gave you a 5700xt when you asked for a compute oriented GPU. That's ridiculous, the 5700xt is notorious for not having Rocm support.

Do you guys do HPC stuff, or stuff that needs to run on individual computers? Because if the former, there's now the MI100.

I've been pretty disappointed in the lack of solid compute support on consumer GPUs since RDNA1. Although it's always been insisted that support is coming at some point, it's clearly not a high priority.

ID: gruyjzf

We do both, think FEA, CFD stuff. The problem being that half our enterprise customer base is on Windows, where ROCm doesn't work..

ID: grwa6ax

Finally two abbreviations that I understand. Else I an always confused, what these discussions are about

4 : Anonymous2021/03/23 01:31 ID: grvp9in

Yeah... My compute needs aren't quite as serious, but I feel cripped nevertheless with an AMD GPU.

I'd really like to be able to run deepdream, neural-style, big-sleep, styleGan (with pre-trained networks) etc etc, but they are all made on CUDA or have dependencies that require CUDA.

Some people have said that you can run this stuff with ROCm but I'm starting to feel that they're talking out of their asses, I have yet to see a working example. And it's a dependency hell when ROCm needs one version of some libraries and the programs I want to run need another version.

AMD needs to be able to run CUDA in one way or another, I don't care. I don't even care about the performance either at this point, when the ability is missing completely.
I'm just so frustrated about this. I'm almost ready to buy an overpriced Nvidia GPU... If there were any to buy anywhere. Other than the 3k€ 3090's.

ID: grwih33

Well apparently the last AMD GPU that was even good at compute you could just go and buy was Radeon 7 anyway. RDNA GPUs don't even get ROCm support and it seems like their memory model makes them kind of bad at compute anyway.

ID: grwuffs

I don't think AMD is legally allowed to run CUDA, hence the lock in.

ID: grwxb9p

I believe so too, BUT there's this thing, European trade laws. If you are a market leader, certain rules apply to you. Like for example you cannot have systems in place that block other parties from getting into your markets (which is what Nvidia is doing with CUDA). They'd have to licence CUDA out, and for a price that doesn't go against the spirit of the law. They'd also have to help AMD to add their cards to the CUDA.

AMD has lawyers, yes? But according to the OP's story I also believe AMD hasn't even bothered to try any legal ways to deal with the situation, they've just been sitting on their hands.

ID: grxgo0g

ROCm is a mostly complete source compatible implementation of CUDA implemented as a translation at compile time, it is not binary compatible so you can't just run it out of the box. This is done to make it legally defensible because there was some concern most likely that headers would become copyrightable and AMD probably wanted to avoid a battle.

Still you will run into things that it doesn't yet support.

5 : Anonymous2021/03/23 01:30 ID: grvp6mh

I love reading posts from people who are smarter than me! Thank you for this rant. Very interesting.

6 : Anonymous2021/03/23 08:12 ID: grwo1n3

AMD provided us with an RX 5700XT, so we tried ROCm with it and ... it didn't work.

that's hilarious seeing even some AMD employees didn't expect ROCm isn't supported on consumer RDNA

Within a year of corresponding with Intel, we were able to have a sample of our product up and running.

that sounds interesting. Difficulty-wise, how hard is oneAPI to learn and adapt? Around 2018 I was thinking to learn ROCm for university project but dropped the intention since the support for Vega 8 wasn't there and setting up ROCm is pretty tedious for my everyday machine. I don't plan to get nvidia gpu anytime soon but intel laptop with igp supporting oneAPI is pretty much easy and cheap to get.

ID: grwpni4

There's a book out that I can link you with that they sent me directly to learn the language:

Hope this helps you. But the oneMKL (GPU BLAS, LAPACK) is very similar to MKL.

ID: grwtoq0

cool, thanks a lot!

7 : Anonymous2021/03/23 00:04 ID: grvfbe8

I had a similar experience with Intel's OneAPI. I'm not important enough to get free support from them. But their API at least simply "works" in most cases.

AMD RocM is unsupported on Windows to start with, still has a few limitations regarding supported technologies, does not have any bindings at all (I couldn't even find unofficial ones but that may have a fluke) and isn't supported on my GPU (5700 XT) as well.

HIP seems to be a good replacement for CUDA, since it would use CUDA under the hood for Nvidia GPUs. However, the relationship with RocM is questionable (i.e. does HIP use RocM on AMD GPUs, ergo is not available for 5700 XT) and the support on windows right now is "maybe". The compiler architecture also seems needlessly complex and while it would add overhead, I'd love a vendor agnostic solution in the form of a HIP JIT rather than a compiler.

ID: grwpeje

Some info from my side you may find useful: I tried hipSYCL on the 5700XT a couple months ago and I was getting segmentation faults when extracting device pointers to send to any ROCm library.

8 : Anonymous2021/03/22 18:41 ID: gruadqt

What segment of the hpc market are ya talking and when did you talk to amd? It looks like atm, amd's focus is on hyperscalers and datacenter hpc so recommending the 5700xt could mean that you're either not their focus market or their cdna work just ain't ready when you talked to them

Amd had very little against nvidia in compute before cdna and their focus is lookin to start small from the top to gain grounds before expanding the market. Amd has only hit big success last year with the zen2 processors, it'd be kinda unreal to expect thorough software support from them in just 1-2 years after the major breakthrough. Their desktop class architecture's more focused on gaming than compute atm

ID: grvdop6

recommending the 5700xt could mean that you're either not their focus market or their cdna work just ain't ready when you talked to them

The problem wasn't necessarily the GPU's performance, the problem was the person recommending hardware at AMD did not realize RDNA1 didn't even have ROCm support, which is what OP was looking for. Working at a scientific software company would put him in the enterprise class AMD is targeting with instinct, so he is their market. The problem is just that AMD's RTG division seems scattered, maybe due to all focus being on delivering Frontier on schedule as its timeline and specs are pretty aggressive.

OP is also referring to GPU support, not zen 2 support. AMD's GCN was actually quite a bit better in throughput/$ than nvidia's GPUs, but software became the deciding factor that gave nvidia most of the market. It's been a long, long time since AMD had good compute GPUs, much longer than the success of zen 2, but the software has never matched the hardware.

ID: grwu7bp

OP is also referring to GPU support, not zen 2 support

I said zen2 only to emphasize on where amd's priorities are (zen instead of rtg), not that zen and their compute gpu support are directly related. Resources for zen means no resources for rtg which is probably true before zen2's success

AMD's GCN was actually quite a bit better in throughput/$ than nvidia's GPUs, but software became the deciding factor that gave nvidia most of the market. It's been a long, long time since AMD had good compute GPUs, much longer than the success of zen 2, but the software has never matched the hardware.

Yea agreed, that's my point. Amd's never been ahead in software and ecosystem support compared to cuda

AMD's RTG division seems scattered

Afaik enterprise and hpc accelerators ain't under rtg's responsibilities. The data center group works on accelerator support and sales. Op probably got to the wrong guy or was pointed to the wrong direction

ID: grxgvvm

OP was looking for crossplatform compute support... ROCm doesn't provide that. Which caused the engineer to default to the proprietary driver which 5700XT is supported by.

ID: grubu91

That's a big shift though, historically amd/ATI has had a decent compute advantage. They did get competitive again in gaming this gen but compute is a step backwards or sidestep this gen

ID: grucwlf

That's either real far back or they didn't. If they had an advantage cuda wouldn't be so common ain't it? The gpu division was kinda dropped to secondary priority in the last 5 years, most resources went to cpu development

It made the gpu division fall off but it was the right thing to do because it gave amd the success in zen. They had to choose 1 or the other and it was zen for them. With success and more resources rn amd's tryin to get back into the gpu game. The 1st steps are gonna be small and time's needed for good software support. That ain't gonna happen in 1-2 years so the place that amd's in rn is kinda expected. Software ecosystem takes years to build and to be real they're gonna focus on the top hpc customers 1st before other markets

ID: grwawds

We are "datacenter HPC" and we don't consider AMD GPU* hardware; mostly because of reasons such as above. I don't pretend to know what AMDs strategy is for GPU compute, but it clearly doesn't include mid-tier HPC centers.

* edit: GPU only. We take all the AMD CPU nodes we can get.

ID: grwts0i

Probably ain't mid tier, you can see the only thing they're starting with are markets for very specific workloads and exascale, not mainstream at all. They're real selective about where they go, there are a few large cloud partners such as microsoft, and gigabyte offering mi100 servers. Amd probably knows that, according to their interviews. Don't think they're at the stage where they compete for mainstream compute customers. They're on the 1st step of getting back to the market after years of going missin

And tbf, idk why op got to the rep who pointed him to 5700xt. Amd's hpc accelerators are under the control of its data center group (forrest norrod) instead of rtg. Probably some mistake there from the rep, and also odd enough op didn't find anything wrong (?) and went with it all the way

ID: grux0y9

Enterprise HPC. CFD, FEA stuff. Equally divided between windows and linux. Sadly, there is no ROCm for windows.

ID: grvhhrh

No ROCm for Windows is a bummer. It would certainly help invigorate developers' interest in AMD for ML/DL applications, but AMD hasn't shown they're interested in this. Maybe in the far future, but as it stands and after many years of waiting, ROCm has remained on Linux. This makes sense of course, but Windows support would indeed be nice for interested consumers and developers.

9 : Anonymous2021/03/23 01:41 ID: grvqem1

Completely agree. I work in HPC and have been following GPU compute very closely for years and it’s a shame that no good competitor to CUDA has emerged in the last 10 years.

On the AMD side it seems like there’s a new API every few years with OpenCL, Hip, SYCL, and ROCm. I know these are inter-related but it’s confusing for developers, especially scientists doing HPC. CUDA is easy to setup, easy to use, there’s lots of good libraries, good documentation, and support on all OSs on all their GPUs.

OpenCL has languished and it’s not as good to use (verbose code and lots of setup) in my experience. ROCm isn’t well supported (hardware-wise and software-wise). Maybe SYCL will be able to do something now the Intel is pushing it through their OneAPI.

10 : Anonymous2021/03/23 08:42 ID: grwpxpy

You cannot compare these companies. nVidia and Intel is like a Supermarket with all that assistance personnel close to each section of goods when AMD is just a nice store with one cashier in it. Its just that AMD can do what they do performance wise is already a miracle. But stability and features AMD still nowhere near Intel or nVidia unfortunately.

11 : Anonymous2021/03/22 23:01 ID: grv7rg1

Whoever decided to send you that 5700XT was a total moron. It's not a compute card, RDNA is for gaming, and AMD has split support for ROCm, Hip etc to their professional MI range. They should have sent you an MI50 as a bare minimum.

ID: grxfmzx

This is a bit near sighted. ROCm doesn't work on windows, which was a requirement that they gave thier guy, if you want updates to OpenCL on windows you probably need a recent card... this leaves you using the proprietary driver on Windows and Linux to get OpenCL for computer.... frankly AMD just doesn't have it together, they need to make ROCm crossplatform ASAP.

So yes on paper to a marketing guy it does support compute. That support is quite shitty though.

12 : Anonymous2021/03/23 04:52 ID: grwa5kj

5700xt? now that's a Poor Volta moment.

13 : Anonymous2021/03/22 19:53 ID: grujx2o

Sounds about right...
It is a joke.

14 : Anonymous2021/03/22 21:30 ID: gruwisy

Why not use Vulkan as a compute library?

ID: gruyfb7

We depend on the breadth of BLAS & LAPACK for scientific compute. So we would also need an OpenBLAS or MKL written in Vulkan. Also, Vulkan compute's doc is very complicated. Compare that to OneAPI's documentation, which was very easy for us as CUDA developers to pick up.

15 : Anonymous2021/03/23 01:33 ID: grvpi8y

Seems to me the answer to this is "money" and having enough capable employees under contract (which also requires money). That's what starving a company for years does (guess which two anti-competitive companies...)

Intel has all the money they need and AMD would need to outsell them for years to come to even come close.

ID: gry5zkn

AMD has never shown any interest of being a "software company" and this shows again and again. NVidia's DLSS optimization work came from their ai knowledge and cloud revenue. They used that income stream very well to push their gaming advantage.

Intel invested more in OneAPI because that is the area they know the market is heading, potentially X86/ARM mixed environments. They plan ahead, and want to be the "facilitator", even when not always being the one selling you hardware.

AMD seems to be spinning 24/7 with the products they have. No time to breathe. They should now have the money for more manpower, instead we got a rather "um, nice" 6000 gpu release. I hope the 7000 line shows a little bit of genius.

16 : Anonymous2021/03/23 04:03 ID: grw5sxt

Do not waste your time. If you require GPU compute and you’re targeting linux or windows, just go with NVIDIA.

CUDA has had a 10+ year head start, and if you require BLAS or LAPACK you already have highly optimized library ecosystem for NVIDIA. And AMD is not likely to change any of that in the near future, they simply don’t have a good software strategy. It’s their Achilles heel and why they won’t overtake Intel in the data center or NVIDIA in the compute markets. Even if AMD has better CPUs or their GPUs can theoretically perform more compute. Sadly, all they could do until recently was to concentrate on the HW to keep the lights on, since they have been in rough financial shape until recently.

ID: grxjl0x

Do not waste your time

Time is money and vice versa... by not "wasting thier time" thousands of developers have supported vendor lock in which allows Nvidia to charge whatever they want.

ID: gry5e1d

By not wasting time waiting for AMD to have even a remotely decent competitor to CUDA, developers have been able to make money and keep their business open.

Neither AMD nor NVIDIA are charities. NVIDIA understood the GPGPU market and has provided and supported a decent ecosystem for almost 1.5 decades. It’s not the developer’s job to make AMD understand that. Over a decade in technology is basically a century, there’s no point in wasting time with AMD when they themselves have pretty much conceded that market.

17 : Anonymous2021/03/23 22:18 ID: grzdi75

I never formally entered the ML space but I do remember almost a decade ago being a student in an informatics lab, where the lead scientist was a strong AMD enthusiast (mostly due to the superior Opteron's back then). We were all eagerly watching the new developments in using GPU's for compute, and we were all waiting to hear about developments in AMD and OpenCL. Kept waiting and waiting... pretty sure everyone just gave up after a while. I looked into it again recently and was amazed to see that the compute landscape for AMD GPU's has seemingly not progressed at all since those days almost a decade ago

18 : Anonymous2021/03/22 20:31 ID: gruoyg7

AMD has changed software direction many times in the last few years, and that is a risk that a company would need to absorb.

ROCm has been around ~4-5 years and AMD doesn't seem to be going to change/replace it? One of the focuses of ROCm has been working with CUDA which has recently paid dividends in that finally some popular machine learning things now support AMD that had been previously been cuda only.

With nvidias grasp on ML broken, you may be interested to learn that nvidia has announced they'll jump their OpenCL support forward more than a decade.

After more than a decade of nvidia not supporting OpenCL, nvidia has announced they'll support OpenCL 3.0...

Although it seems like compute is moving away from OpenCL because of nvidias refusal to support it towards compute on Vulkan which nvidia did support.

As I understand it, though, I haven't followed it closely, like ROCm, OneAPI is aimed at doing what OpenCL would have done if nvidia had supported it, that being offering vendor neutral support, ie OneAPI and ROCm support any vendor including nvidia, amd, and intel. Exactly what Cuda which is what you've spent all your time supporting, doesn't do.

" class="reddit-press-link" target="_blank" rel="noopener">

Ask your AMD rep about instinct cards? Although I think AMD has been improving 5700 support for ROCm? I'm not sure. Try to get them to get you a contact for ROCm?

I mean, it sounds like the rep isn't very good, AMDs outreach could be better, and it kinda sounds like you want them to do all the work for you on free hardware they give you?

ID: gruwrcl

We don't need "free" AMD GPUs, we can afford them. What we can't afford is investing in technology that only works on one out of three AMD GPUs, on one out of two HPC OS'es. Contrary to popular belief, significant amount of HPC is done on Windows, something that Intel and NVIDIA understand. Yes, the top 100 supercomputers might not have Windows, but enterprise clients are worth more and they work on Windows. And we are in regular contact with AMD, so getting hold of a rep isn't also an issue. I'm also interested to see other people's experiences as well, which it seems is similar to mine.

The 5700XT was the best consumer GPU at the time it was recommended by them. They weren't willing to work with us unless we were talking graphics. I use the word "willing" quite loosely. They didn't say they were uninterested, they just behaved in a way that it comes across like that. Like ghosting for long intervals.

Recent releases of cuDNN are broken with bugs as per our testing, but cuDNN is the benchmark that our HPC clients measure against.

I just want to make it clear here that I am biased towards AMD and am just frustrated that it is so difficult to work with them, even when I am secretly on their side when I face my company.

ID: grv8f7w

AMD does say they are bringing ROCm support to RDNA2 along with RDNA1. RDNA1 got caught in the lurch between CDNA and RDNA2.

And you can presumably do ROCm on windows with the linux subsystem which may be why AMD dropped windows support.

And I sincerely doubt that AMD has no HPC/ROCm representatives/support. Somebody has to support those supercomputer contracts at the very least, although they pay for their hardware and support. @bridgmanAMD

/comments/jlmqsz/does_radeon_6000_series_have_support_fo/" class="reddit-press-link" target="_blank" rel="noopener">https://www.reddit.com//comments/jlmqsz/does_radeon_6000_series_have_support_fo/

ID: grvp630

I don't know where you get the idea that rocm has "broken" nvidia's grasp on the ML market. NCNN, tensorflow directML, and ironically plaidML from intel have had more success in expanding AMD's ML support, whereas ROCm alone only supports specific GPU architectures from AMD for now and doesn't work on windows without something like plaidML.

It would be great if ROCm did what you're suggesting since AMD gpus have good compute performance and nvidia needs competition, but it doesn't and AMD hasn't made a dent in nvidia's ML market. They're taking market away from intel.

ID: grxdnxh

You don't understand what I'm saying.

I'm not saying AMD's lisa su did a thantos snap and suddenly every cuda developer's codebase was suddenly hi ROCm, half of all cuda developers suddenly forgot all they knew about CUDA, learned years of ROCm experience and started using ROCm or something.

What I meant was that AMD brought things like tensorflow and pytorch and so on to ROCm

That's what I meant.

My point was that AMD removed a big reason that nvidia had to not support OpenCL for a decade.

I didn't mean that suddenly billions of lines of code had changed and thousands or millions of people suddenly learned a new workflow. I meant AMD, or intel if you prefer plaidML over ROCm, removed the reason nvidia refused to support OpenCL for over a decade.

Now that people are abandoning OpenCL for vulkan and things like that, and now that tensor flow and pytorch work on ROCm or plaidml nvidia doesn't see the need to boycott OpenCL.

It would be great if ROCm did what you're suggesting since AMD gpus have good compute performance

Well... doesn't it?

I mean, not the thantos snap thing. Just the, you know, the thing you say you want but when I tell you it's there you're like "well I would like it but I want a miracle first".

ID: grur9lb

Maybe @AMDOfficial can connect you with a compute/ROCm representative?

引用元:https://www.reddit.com/r/Amd/comments/mau4at/compute_ecosystem_of_amd_gpus/

RSS
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x