Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> So we decided to build both: two 24k clusters, one with RoCE and another with InfiniBand. Our intent was to build and learn from the operational experience.

I love how they built two completely insane clusters just to learn. That's badass.



It's not just to learn; an RoCE ethernet cluster with Aristas is way cheaper to build and maintain than a fancy InfiniBand cluster with Mellanox/NVidia networking, so proving that the former is good enough at scale will eventually save Meta a huge amount of money. InfiniBand cards are much more expensive than ethernet because there's few vendors, that have a quasi-monopoloy, and because overall far fewer of them are produced so there's less economy of scale.


More like Mark gave them 100k GPUs, and they are not sure what exactly to do with them..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: