Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My experience, over 10 years building models with libraries using CUDA under the hood, this problem has nearly gone away in the past few years. Setting up CUDA on new machines and even getting multi GPU/nodes configuration working with NCCL and pytorch DDP, for example, is pretty slick. Have you experienced this recently?


yes, especially if you are trying to run various different projects you don't control

some will need specific versions of cuda

right now I masked cuda from upgrades in my system and I'm stuck on an old version to support some projects

I also had plenty of problems with gpu-operator to deploy on k8s: that helm chart is so buggy (or maybe just not great at handling some corner cases? no clue) I ended up swapping kubernetes distribution a few times (no chance to make it work on microk8s, on k3s it almost works) and eventually ended up installing drivers + runtime locally and then just exposing through containerd config




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: