We are working on a project where we plan to manage our new HPC system with OpenStack. This will replace 3 legacy HPC systems that were manged with propriatory management systems. Due to shifting requirements by our customers (scientist) we decided to move to a cloud framework where probably still the majority of resources is dedidcated to a batch scheduling system but it would allow us to also provide more cloud like services (jupyerhub, rstudio, databases, etc)
It's quite an ambitious project and we (4 engineers) basically spent the last year understanding the in and outs of OpenStack. We also went full in with integrating all kinds of datacenter components into OS (NetApp, SDN, DNS, etc)
Some lessons learned so far:
- OpenStack is very complex
- It's less of a product and more a framework and you need a dedicated engineering team with cross cutting skillset
- You definately need a dev/staging environment to test upgrades and customizations
- Some of the reference implementations of OS servies (SDN) are fine for small deployments but if can replace them with dedicated hardware/appliances you should do that.
Some lessons learned so far: