Apple's MLX adding CUDA support

github.com

281 points by nsagent 8 hours ago

lukev 3 hours ago

So to make sure I understand, this would mean:

1. Programs built against MLX -> Can take advantage of CUDA-enabled chips

but not:

2. CUDA programs -> Can now run on Apple Silicon.

Because the #2 would be a copyright violation (specifically with respect to NVidia's famous moat).

Is this correct?

quitit 3 hours ago

It's 1.
It means that a developer can use their relatively low-powered Apple device (with UMA) to develop for deployment on nvidia's relatively high-powered systems.
That's nice to have for a range of reasons.
tho234j2344 17 minutes ago

I don't think #2 is really true - AMDs HIP is doing this exact thing after giving up on OpenCL way back in ~'17/'18.
- NekkoDroid 6 minutes ago
  
  I haven't looked into it, but doesn't HIP need everything to be recompiled against it? To my understanding it was mostly a source code translation effectivly.
tekawade 2 hours ago

I want #3 be able to connect NVIDIA GPU with Apple Silicon and run CUDA. Take advantage of apple silicon + unified memory + GPU + CUDA with PyTorch, JAX or TensorFlow.
Haven’t really explored MLX so can’t speak about it.
saagarjha 3 hours ago

No, it's because doing 2 would be substantially harder.
- lukev 3 hours ago
  
  There's a massive financial incentive (billions) to allow existing CUDA code to run on non-NVidia hardware. Not saying it's easy, but is implementation difficulty really the blocker?
  - fooker 6 minutes ago
    
    Existing high performance cuda code is almost all first party libraries, written by NVIDIA and uses weird internal flags and inline ptx.
    You can get 90% of the way there with a small team of compiler devs. The rest 10% would take hundreds of people working ten years. The cost of this is suspiciously close to the billions in financial incentive you mentioned, funny how efficient markets work.
  - lmm 2 hours ago
    
    I think it's ultimately a project management problem, like all hard problems. Yes it's a task that needs skilled programmers, but if an entity was willing to pay what programmers of that caliber cost and give them the conditions to let them succeed they could get it done.
  - saagarjha 3 hours ago
    
    Yes. See: AMD
    
    lukev 3 hours ago
    
    AMD has never implemented the CUDA API. And not for technical reasons.
    
    gpm 3 hours ago
    
    They did, or at least they paid someone else to.
    https://www.techpowerup.com/319016/amd-develops-rocm-based-s...
- hangonhn 3 hours ago
  
  Is CUDA tied very closely to the Nvidia hardware and architecture so that all the abstraction would not make sense on other platforms? I know very little about hardware and low level software.
  Thanks
  - dagmx 2 hours ago
    
    CUDA isn’t really that hyper specific to NVIDIA hardware as an api.
    But a lot of the most useful libraries are closed source and available on NVIDIA hardware only.
    You could probably get most open source CUDA to run on other vendors hardware without crazy work. But you’d spend a ton more work getting to parity on ecosystem and lawyer fees when NVIDIA come at you.
  - saagarjha 3 hours ago
    
    Yes, also it's a moving target where people don't just want compatibility but also good performance.
dagmx 2 hours ago

2 also further cements CUDA as the de facto API to target, and nobody would write MLX targeted code instead.
This way, you’re more incentivized to write MLX and have it run everywhere. It’s a situation of everyone wins, especially Apple because they can optimize it further for their platforms.
ls612 3 hours ago

#2 would be Google v. Oracle wouldn’t it?

nxobject 7 hours ago

If you're going "wait, no Apple platform has first-party CUDA support!", note that this set of patches also adds support for "Linux [platforms] with CUDA 12 and SM 7.0 (Volta) and up".

https://ml-explore.github.io/mlx/build/html/install.html

paulirish 5 hours ago

It's coming from zcbenz who created Electron among others https://zcbenz.com/ Nice.

neurostimulant 23 minutes ago

> Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.

Does this means you can use MLX on linux now?

Edit:

Just tested it and it's working but only python 3.12 version is available on pypi right now: https://pypi.org/project/mlx-cuda/#files

zdw 7 hours ago

How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )

I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?

edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.

freeone3000 3 hours ago

Eh yes but from my experience its lack of prefetch lends to significant memory stalls waiting for the copy. It might be suitable if your entire dataset fits in VRAM after doing a “manual prefetch” but it killed performance for my application (ML training) so hard that we actually got time to move to streaming loads.

dnchdnd 2 hours ago

Random aside: A lot of the people working on MLX don't seem to be officially affiliated with Apple at least in a superficial review. See for example: https://x.com/prince_canuma

Idly wondering, is Apple bankrolling this but wants to keep it in the DL? There were also rumours the team was looking to move at one point ?

numpad0 4 hours ago

> This PR is an ongoing effort to add a CUDA backend to MLX

looks like it allows MLX code to compile and run on x86 + GeForce hardware, not the other way around.

Abishek_Muthian 2 hours ago

I’ve been very impressed with MLX models; I can open up local models to everyone in the house, something I wouldn’t dare with my Nvidia computer for the risk of burning down the house.

I’ve been hoping Apple Silicon becomes a serious contender for Nvidia chips; I wonder if the CUDA support is just Embrace, extend, and extinguish (EEE).

benreesman 5 hours ago

I wonder how much this is a result of Strix Halo. I had a fairly standard stipend for a work computer that I didn't end up using for a while so I recently cashed it in on the EVO-X2 and fuck me sideways: that thing is easily competitive with the mid-range znver5 EPYC machines I run substitors on. It mops the floor with any mere-mortal EC2 or GCE instance, like maybe some r1337.xxxxlarge.metal.metal or something has an edge, but the z1d.metal and the c6.2xlarge or whatever type stuff (fast cores, good NIC, table stakes), blows them away. And those things are 3-10K a month with heavy provisioned IOPS. This thing has real NVME and it cost 1800.

I haven't done much local inference on it, but various YouTubers are starting to call the DGX Spark overkill / overpriced next to Strix Halo. The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).

Flawless CUDA on Apple gear would make it really tempting in a way that isn't true with Strix so cheap and good.

hamandcheese 5 hours ago

For the uninitiated, Strix Halo is the same as the AMD Ryzen AI Max+ 395 which will be in the Framework Desktop and is starting to show up in some mini PCs as well.
The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).
It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".
- Rohansi 4 hours ago
  
  You're comparing theoretical maximum memory bandwidth. It's not enough to only look at memory bandwidth because you're a lot more likely to be compute limited when you have a lot of memory bandwidth available. For example, M1 had so much bandwidth available that it couldn't make use of even when fully loaded.
  - zargon 17 minutes ago
    
    GPUs have both the bandwidth and the compute. During token generation, no compute is needed. But both Apple silicon and Strix Halo fall on their face during prompt ingestion, due to lack of compute.
- yieldcrv 4 hours ago
  
  Apple is just being stupid, handicapping their own hardware so they can sell the fixed one next year or the year after
  This is time tested Apple strategy is now undermining their AI strategy and potential competitiveness
  tl;dr they could have done 1600GB/s
  - Nevermark an hour ago
    
    So their products are so much better, in customer demand terms that they don’t need to rush tech out the door?
    Whatever story you want to create, if customers are happy year after year then Apple is serving them well.
    Maybe not with same feature dimension balance you want, or other artificial/wishful balances you might make up for them.
    (When Apple drops the ball it is usually painful, painfully obvious and most often a result of a deliberate and transparent priority tradeoff. No secret switcherooos or sneaky downgrading. See: Mac Pro for years…)
  - saagarjha 4 hours ago
    
    They could have shipped a B200 too. Obviously there are reasons they don't do that.
jitl 5 hours ago

It’s pretty explicitly targeting cloud cluster training in the PR description.
- ivape 3 hours ago
  
  If we believe that there’s not enough hardware to meet demand, then one could argue this helps Apple meet demand, even if it’s just by a few percentage points.
nl 5 hours ago

> The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).
Competitive AMD GPU neural compute has been any day now for at least 10 years.
- bigyabai 5 hours ago
  
  The inference side is fine, nowadays. llama.cpp has had a GPU-agnostic Vulkan backend for a while, it's the training side that tends to be a sticking point for consumer GPUs.
attentive 4 hours ago

how is it vs m4 mac mini?

sciencesama 3 hours ago

Apple is planing to build data centers with mseries of chips for both app development, testing and to host external services!

albertzeyer 7 hours ago

This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?

zcbenz 6 hours ago

In the absence of hardware unified memory, CUDA will automatically copy data between CPU/GPU when there are page faults.
- fenced_load 6 hours ago
  
  There is also NVLink c2c support between Nvidia's CPUs and GPUs that doesn't require any copy, CPUs and GPUs directly access each other's memory over a coherent bus. IIRC, they have 4 CPU + 4 GPU servers already available.
  - benreesman 5 hours ago
    
    Yeah NCCL is a whole world and it's not even the only thing involved, but IIRC that's the difference between 8xH100 PCI and 8xH100 SXM2.
- saagarjha 4 hours ago
  
  This seems like it would be slow…
  - freeone3000 3 hours ago
    
    Matches my experience. It’s memory stalls all over the place, aggravated (on 12.3 at least) there wasn’t even a prefetcher.
- nickysielicki 4 hours ago
  
  See also: https://www.kernel.org/doc/html/v5.0/vm/hmm.html
MBCook 7 hours ago

This is my guess, but does higher end hardware they sell, like the server rack stuff for AI, perhaps have the unified memory?
I know standard GPUs don’t.
The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.
- ajuhasz 6 hours ago
  
  The Jetsons[1] have unified memory[2].
  [1] https://www.nvidia.com/en-us/autonomous-machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s...
  - tonyarkles 6 hours ago
    
    They sure do and it's pretty amazing. One iteration of a vision system I worked on got frames from a camera over a Mellanox NIC that supports RDMA (Rivermax), preprocessed the images using CUDA, did inference on them with TensorRT, and the first time a single byte of the inference pipeline hit the CPU itself was when we were consuming the output.
- patrickkrusiec 6 hours ago
  
  The physical memory is not be unified, but on modern rack scale Nvidia systems, like Grace Hopper or NVL72, the CPU and the GPU(s) share the same virtual address space and have non-uniform memory access to each other's memory.
- freeone3000 3 hours ago
  
  Standard GPUs absolutely do. Since CUDA 11, all CUDA cards expose the same featureset at differing speeds (based on backing capability). You can absolutely (try to) run CUDA UMA on your 2060, and it will complete the computation.
- Y_Y 7 hours ago
  
  The servers don't, but the Jetsons do

m3kw9 30 minutes ago

I thought you either use MLX for apple silicone or you compile it for cudaw

MuffinFlavored 7 hours ago

Is this for Mac's with NVIDIA cards in them or Apple Metal/Apple Silicon speaking CUDA?... I can't really tell.

Edit: looks like it's "write once, use everywhere". Write MLX, run it on Linux CUDA, and Apple Silicon/Metal.

MBCook 7 hours ago

Seems you already found the answer.
I’ll note Apple hasn’t shipped an Nvidia card in a very very long time. Even on the Mac pros before Apple Silicon they only ever sold AMD cards.
My understanding from rumors is that they had a falling out over the problems with the dual GPU MacBook Pros and the quality of drivers.
I have no idea if sticking one in on the PCI bus let you use it for AI stuff though.
- VladVladikoff 4 hours ago
  
  Won’t work. No driver support.
- xuki 6 hours ago
  
  That particular MBP model had a high rate of GPU failure because it ran too hot.
  I imagined the convo between Steve Jobs and Jensen Huang went like this:
  S: your GPU is shit
  J: your thermal design is shit
  S: f u
  J: f u too
  Apple is the kind of company that hold a grudge for a very long time, their relationships with suppliers are very one way, their way or the highway.
  - narism 2 hours ago
    
    The MBPs didn’t run too hot, the Nvidia GPUs used an underfill that stopped providing structural support at a relatively normal temperature for GPUs (60-80 degrees C).
    GPU failures due to this also happened on Dell/HP/Sony laptops, some desktop models, as well as early models of the PS3.
    Some reading: https://www.badcaps.net/forum/troubleshooting-hardware-devic...
  - rcruzeiro 5 hours ago
    
    I think the ones that failed were the AMD ones, specifically the old 17 inches MacBook Pro.
    
    MBCook 3 hours ago
    
    I had 15” MBP, maybe a 2010, that was dual GPU with an Nvidia that was definitely a problem.
    
    roboror 3 hours ago
    
    D700s dying in the trash can Mac Pros cost me (and many others) a lot of time and money.
  - sciencesama 3 hours ago
    
    And so is the same with nvidea too
  - cindyllm 6 hours ago
    
    [dead]
  - bobmcnamara 5 hours ago
    
    S: omg so thin!!1!1!!l!
- kmeisthax 6 hours ago
  
  On Apple Silicon, writing to memory on a PCIe / Thunderbolt device will generate an exception. ARM spec says you're allowed to write to devices as if they were memory but Apple enforces that all writes to external devices go through a device memory mapping[0]. This makes using an external GPU on Apple Silicon[1] way more of a pain in the ass, if not impossible. AFAIK nobody's managed to write an eGPU driver for Apple Silicon, even with Asahi.
  [0] https://developer.arm.com/documentation/102376/0200/Device-m...
  [1] Raspberry Pi 4's PCIe has the same problem AFAIK
  - bobmcnamara 5 hours ago
    
    Ewww, that kills out of order CPU performance. If it's like ARMv7, it effectively turns each same-page access into it's own ordering barrier.
  - saagarjha 3 hours ago
    
    Writing to device memory does not generate an exception.
hbcondo714 4 hours ago

> "write once, use everywhere"
So my MLX workloads can soon be offloaded to the cloud!?
dkga 7 hours ago

This is the only strategy humble me can see working for CUDA in MLX
- whatever1 3 hours ago
  
  This is the right answer. Local models will be accelerated by Apple private cloud.
cowsandmilk 7 hours ago

Neither, it is for Linux computers with NVIDIA cards

orliesaurus 4 hours ago

Why is this a big deal, can anyone explain if they are familiar with the space?

elpakal 4 hours ago

> NVIDIA hardware is widely used for academic and massive computations. Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.
That one stands out to me as a mac user.
- radicaldreamer 4 hours ago
  
  MacBooks used to use Nvidia GPUs, then Apple had a falling out with Nvidia and the beef stands to this day (Apple didn’t use Nvidia hardware when training it’s own LLMs for Apple Intelligence).
  I wouldn’t be surprised if within the next few years we see a return of Nvidia hardware to the Mac, probably starting with low volume products like the MacPro, strictly for professional/high-end use cases.
  - fooker 4 hours ago
    
    > Apple didn’t use Nvidia hardware when training it’s own LLMs for Apple Intelligence
    Do you have some links for this?
    
    dialup_sounds 3 hours ago
    
    https://arxiv.org/abs/2407.21075
    tl;dr - they used Google TPUs
    
    almostgotcaught 4 hours ago
    
    People on hn make up more BS than your local bar
    https://www.investors.com/news/technology/apple-stock-apple-...

Keyframe 7 hours ago

Now do linux support / drivers for Mac hardware!

bigyabai 5 hours ago

I think we're seeing the twilight of those efforts. Asahi Linux was an absolute powerhouse of reverse-engineering prowess, and it took years to get decent Vulkan coverage and half of the modern lineup's GPUs supported. Meanwhile AMD and even Intel are shipping Vulkan 1.3 drivers day-one on new hardware. It's a cool enthusiast effort to extend the longevity of the hardware, but it bears repeating; nobody is disrupting Nvidia's bottom-line here. Apple doesn't sell hardware competitive with Nvidia's datacenter hardware, and even if they did it's not supported by the community. It's doubtful that Apple would make any attempt to help them.
There seems to a pervading assumption that Apple is still making a VolksComputer in 2025, blithely supporting a freer status-quo for computing. They laid out their priorities completely with Apple Silicon, you're either on Apple's side or falling behind. Just the way they want it.
lvl155 7 hours ago

Seriously. Those Apple guys became delusional especially after Jobs passed away. These guys just sat on their successes and did nothing for a decade plus. M1 was nice but that was all Jobs doing and planning. I don’t like this Apple. They forgot how to innovate.
But I guess we have a VR device nobody wants.
- marcellus23 5 hours ago
  
  > M1 was nice but that was all Jobs doing and planning
  M1 was launched 9 years after Jobs died. You're saying they had everything ready to go back then and just sat on their asses for a decade?
  - lvl155 5 hours ago
    
    Who bought Semi? Jobs knew they had to make their own. M1 is just a product of their iPhone chips hence all the efficiency.
    
    marcellus23 3 hours ago
    
    Jobs knew they had to make their own chips, and in your mind that constitutes "all the doing and planning"?
    
    saagarjha 3 hours ago
    
    Ok, but did you ever think about PA Semi being the Alpha guys? Maybe the DEC leadership deserves credit for M1
- jjtheblunt 6 hours ago
  
  It would be funny if you were typing out your response on an iPhone that has been running for 36 hours without recharging.
  - macinjosh 6 hours ago
    
    if only their batteries would last that long.

natas 3 hours ago

that means the next apple computer is going to use nvidia gpu(s).

MBCook 2 hours ago

There’s no evidence of that. The post clearly identifies a far more probable reason in letting things be developed in Mac’s then deployed on Nvidia supercomputers.
meepmorp 3 hours ago

but it's not an apple-submitted pr
- natas 2 hours ago
  
  they can't make it that obvious
  - meepmorp an hour ago
    
    o

teaearlgraycold 7 hours ago

I wonder if Jensen is scared. If this opens up the door to other implementations this could be a real threat to Nvidia. CUDA on AMD, CUDA on Intel, etc. Might we see actual competition?

jsight 7 hours ago

I think this is the other way around. It won't be cuda on anything except for nvidia.
However, this might make mlx into a much stronger competitor for Pytorch.
- mayli 6 hours ago
  
  Yeah, nice to have MLX-opencl or MLX-amd-whatever
- baby_souffle 7 hours ago
  
  If you implement compatible apis, are you prohibited from calling it cuda?
  - 15155 7 hours ago
    
    Considering 100% of the low-level CUDA API headers have the word "CUDA" in them, this would be interesting to know.
  - moralestapia 7 hours ago
    
    I'm sure I saw this lawsuit somewhere ...
    The gist is the API specification in itself is copyright, so it is copyright infringement then.
    
    wyldfire 7 hours ago
    
    Too subtle - was this oracle vs java one? Remind me: java won or lost that one?
    
    mandevil 6 hours ago
    
    Oracle sued Google, and Google won, 6-2 (RBG was dead, Barrett had not yet been confirmed when the case was heard).
    Supreme Court ruled that by applying the Four Factors of Fair Use, Google stayed within Fair Use.
    An API specification ends up being a system of organizing things, like the Dewey Decimal System (and thus not really something that can be copyrighted), which in the end marks the first factor for Google. Because Google limited the Android version of the API to just things that were useful for smart phones it won on the second factor too. Because only 0.4% of the code was reused, and mostly was rewritten, Google won on the third factor. And on the market factor, if they held for Oracle, it would harm the public because then "Oracle alone would hold the key. The result could well prove highly profitable to Oracle (or other firms holding a copyright in computer interfaces) ... [but] the lock would interfere with, not further, copyright's basic creativity objectives." So therefore the fourth factor was also pointing in Google's favor.
    Whether "java" won or lost is a question of what is "java"? Android can continue to use the Java API- so it is going to see much more activity. But Oracle didn't get to demand license fees, so they are sad.
    
    moralestapia 6 hours ago
    
    Oh man, thanks for this.
    I always thought it was resolved as infringement and they had to license the Java APIs or something ...
    Wow.
    
    mandevil 6 hours ago
    
    The district court ruled for Google over patents and copyright- that it was not a copyright at all, the Court of Appeals then reversed and demanded a second court trial on whether Google was doing fair use of Oracle's legitimate copyright, which the district court again held for Google, and then the Court of Appeals reversed the second ruling and held for Oracle that it was not fair use of their copyright, and then Google appealed that to the the Supreme Court ... and won in April 2021, putting an end to this case which was filed in August 2010. But the appeals court in between the district court and the Supreme Court meant that for a long while in the middle Oracle was the winner.
    This is part of why patents and copyrights can't be the moat for your company. 11 years, with lots of uncertainty and back-and-forth, to get a final decision.
    
    tough 4 hours ago
    
    Yeah this case made me think using llms to clean-room reverse engineer any API exposing SaaS or private codebase would be game
- teaearlgraycold 7 hours ago
  
  Oh bummer. Almost got excited.
tekacs 7 hours ago

This instance is the other way around, but that's what this is – CUDA on AMD (or other platforms): https://docs.scale-lang.com/stable/
almostgotcaught 7 hours ago

> CUDA backend
backend

gsibble 7 hours ago

Awesome

DidYaWipe 4 hours ago

[dead]

nerdsniper 7 hours ago

Edit: I had the details of the Google v Oracle case wrong. SCOTUS found that re-implementing an API does not infringe copyright. I was remembering the first and second appellate rulings.

Also apparently this is not a re-implementation of CUDA.

liuliu 7 hours ago

You misunderstood and this is not re-implementing CUDA API.
MLX is a PyTorch-like framework.
Uehreka 7 hours ago

This is exactly the kind of thing I wouldn’t opine on until like, an actual lawyer weighs in after thoroughly researching it. There are just too many shades of meaning in this kind of case law for laymen to draw actionable conclusions directly from the opinions.
Though I imagine that if Apple is doing this themselves, they likely know what they’re doing, whatever it is.
skyde 7 hours ago

this is CUDA backend to MLX not MLX backend for CUDA!