Slurm is definitely still dominant, but OpenAI has been using k8s for training for many years now¹, and there are various ways to run slurm on top of Kubernetes, including the recent SUNK from coreweave²
at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.
at my company we use slurm "directly" for static compute we rent or own (i.e. not in a public cloud), but are considering using Kubernetes because that's how we run the rest of the company, and we'd rather invest more effort into being better at k8s than becoming good slurm admins.
¹: https://openai.com/index/scaling-kubernetes-to-2500-nodes/
²: https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-impl...