We run our LLM workloads on a M2 Ultra because of this. 2x the VRAM; one-time co...

alfonsodev · on Oct 30, 2024

Can you elaborate, are those workflows in queue or can they serve multiple users in parallel ?

I think it’s super interesting to know real life workflows and performance of different LLMs and hardware, in case you can direct me to other resources. Thanks !

garciasn · on Oct 30, 2024

Our use case is atypical, based on what others seem to require. While we serve multiple requests in parallel, our workloads are not 'chat'.

bushbaba · on Oct 30, 2024

About 10-20% of my companies gpu usage is inference dev. Yes horribly not efficient usage of resources. We could upgrade the 100ish devs who do this dev work to M4 mbp and free up gpu resources

Smart move by Apple

manaskarekar · on Oct 30, 2024

If the 2x multiplier holds up, the Ultra update should bring it up to 1080GBps. Amazing.

SirMaster · on Oct 30, 2024

There isn't even an M3 Ultra. Will there be an M4 Ultra?

hmottestad · on Oct 30, 2024

At some point there should be an upgrade to the M2 Ultra. It might be an M4 Ultra, it might be this year or next year. It might even be after the M5 comes out. Or it could be skipped in favour of the M5 Ultra. If anyone here knows they are definitely under NDA.

wpm · on Oct 31, 2024

M3 was built on an expensive process node, I don’t think it was ever meant to be around long.

tromp · on Oct 30, 2024

That would make the most sense for the next Mac Studio version.

int_19h · on Oct 30, 2024

There were rumors that the next Mac Studio will top out at 512Gb RAM, too.

Good news for anyone who wants to run 405B LMs locally...

mpweiher · on Oct 30, 2024

And the week isn't over...

smith7018 · on Oct 30, 2024

They announced earlier in the week that there will only be three days of announcements

charlescurt123 · on Oct 30, 2024

comparing a laptop to a A100 (312 teraFLOPS) or H100 (~1P FLOPS) server is a stretch to say the least.

An M2 is according to a reddit post around 27 tflops

So < 1/10 the performance of just computation. let alone the memory.

What workflow would use something like this?

hajile · on Oct 30, 2024

They aren't going to be using fp32 for inferencing, so those FP numbers are meaningless.

Memory and memory bandwidth matters most for inferencing. 819.2 GB/s for M2 Ultra is less than half that of A100, but having 192GB of RAM instead of 80gb means they can run inference on models that would require THREE of those A100s and the only real cost is that it takes longer for the AI to respond.

3 A100 at $5300/mo each for the past 2 years is over $380,000. Considering it worked for them, I'd consider it a massive success.

From another perspective though, they could have bought 72 of those Ultra machines for that much money and had most devs on their own private instance.

The simple fact is that Nvidia GPUs are massively overpriced. Nvidia should worry a LOT that Apple's private AI cloud is going to eat their lunch.

kristianp · on Oct 30, 2024

> comparing a laptop

Small correction: the M2 Ultra isn't found in laptops, its in the Studio.

Der_Einzige · on Oct 30, 2024

Right now, there are 0.90$ per hour H100 80gbs that you can rent.

sgt101 · on Oct 30, 2024

You have another one with a network gateway to provide hot failover?

Right?

ithkuil · on Oct 30, 2024

High availability story for AI workloads will be a problem for another decade. From what I can see the current pressing problem is to get stuff working quickly and iterate quickly.