Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We run our LLM workloads on a M2 Ultra because of this. 2x the VRAM; one-time cost at $5350 was the same as, at the time, 1 month of 80GB VRAM GPU in GCP. Works well for us.


Can you elaborate, are those workflows in queue or can they serve multiple users in parallel ?

I think it’s super interesting to know real life workflows and performance of different LLMs and hardware, in case you can direct me to other resources. Thanks !


Our use case is atypical, based on what others seem to require. While we serve multiple requests in parallel, our workloads are not 'chat'.


About 10-20% of my companies gpu usage is inference dev. Yes horribly not efficient usage of resources. We could upgrade the 100ish devs who do this dev work to M4 mbp and free up gpu resources

Smart move by Apple


If the 2x multiplier holds up, the Ultra update should bring it up to 1080GBps. Amazing.


There isn't even an M3 Ultra. Will there be an M4 Ultra?


At some point there should be an upgrade to the M2 Ultra. It might be an M4 Ultra, it might be this year or next year. It might even be after the M5 comes out. Or it could be skipped in favour of the M5 Ultra. If anyone here knows they are definitely under NDA.


M3 was built on an expensive process node, I don’t think it was ever meant to be around long.


That would make the most sense for the next Mac Studio version.


There were rumors that the next Mac Studio will top out at 512Gb RAM, too.

Good news for anyone who wants to run 405B LMs locally...


And the week isn't over...


They announced earlier in the week that there will only be three days of announcements


comparing a laptop to a A100 (312 teraFLOPS) or H100 (~1P FLOPS) server is a stretch to say the least.

An M2 is according to a reddit post around 27 tflops

So < 1/10 the performance of just computation. let alone the memory.

What workflow would use something like this?


They aren't going to be using fp32 for inferencing, so those FP numbers are meaningless.

Memory and memory bandwidth matters most for inferencing. 819.2 GB/s for M2 Ultra is less than half that of A100, but having 192GB of RAM instead of 80gb means they can run inference on models that would require THREE of those A100s and the only real cost is that it takes longer for the AI to respond.

3 A100 at $5300/mo each for the past 2 years is over $380,000. Considering it worked for them, I'd consider it a massive success.

From another perspective though, they could have bought 72 of those Ultra machines for that much money and had most devs on their own private instance.

The simple fact is that Nvidia GPUs are massively overpriced. Nvidia should worry a LOT that Apple's private AI cloud is going to eat their lunch.


> comparing a laptop

Small correction: the M2 Ultra isn't found in laptops, its in the Studio.


Right now, there are 0.90$ per hour H100 80gbs that you can rent.


You have another one with a network gateway to provide hot failover?

Right?


High availability story for AI workloads will be a problem for another decade. From what I can see the current pressing problem is to get stuff working quickly and iterate quickly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: