Talking to midmarket and enterprise customers and nobody is taking Proxmox seriously quite yet, I think due to concerns around support availability and long term viability. Hyper-V and Azure Local come up a lot in these conversations if you run a lot of Windows (Healthcare in the US is nearly entirely Windows based). Have some folks kicking tires on OpenShift, which is a HEAVY lift and not much less expensive than modern Broadcom licenses.
My personal dark horse favorite right now is HPE VM Essentials. HPE has a terrible track record of being awesome at enterprise software, but their support org is solid and the solution checks a heck of a lot of boxes, including broad support for non-HPE servers, storage, and networking. Solution is priced to move and I expect HPE smells blood in these waters, they're clearly dumping a lot of development resources into the product in this past year.
I've used them professionally during 0.9 times (2008.) and it was already quite useful and very stable (all advertised features worked).
17 years looks pretty good to me, Proxmox will not go away (neither product or company)
>(Healthcare in the US is nearly entirely Windows based).
This wasn't my experience in over a decade in the industry.
It's Windows dominant, but our environment was typically around a 70/30 split of Windows/Linux servers.
Cerner shops in particular are going to have a larger Linux footprint. Radiology, biomed, interface engines, and med records also tended to have quite a bit of nix infrastructure.
One thing that can be said is that containerization has basically zero penetration with any vendors in the space. Pretty much everyone is still doing a pets over cattle model in the industry.
HPE VM Essentials and Proxmox are just UI/wrappers/+ on top of kvm/virsh/libvirt for the virtualization side.
You can grow out of either by just moving to self hosted, or you can avoid both for the virtualization part if you don't care about the VMware like GUI if you are an automation focused company.
If we could do it 20 years ago once VT-x for production Oracle EBS instances for a smaller but publicly traded company with a IT team of 4, almost any midmarket enterprise could do it today, especially with modern tools.
It is culture and web-ui requirements and FUD that cause issues, not the underlying products that are stable today, but hidden from view.
Correction: In Proxmox VE we're not using virsh/libvirt at all, rather we have our own stack for driving QEMU on a low-level, our in-depth integration, especially with live local storage migration our Backup Servers dirty-bitmap (known as change block tracking in vmware worlds) would be possible in the form we have it. Same w.r.t. our own stack for managing LXC container.
The web UI part is actually one of our smaller code bases relative to the whole API and lower level backend code.
Correct sorry I don't use the web-ui's and was confusing oVirt, I forgot that you are using perl modules to call qemu/lxc.
I would strongly suggest more work on your NUMA/cpuset limitations. I know people have been working on it slowly but with the rise of E and P cores, you can't stick to pinning for many use cases and while I get hyperconvergence has it's costs, and platforms have to choose simple, the kernels cpuset proc system works pretty well there and dramatically reduces latency, especially for lakehouse style DP.
I do have customers who would be better served by a proxmox type solution, but need to isolate critical loads and/or avoid the problems with asymmetric cores and non-locality in the OLAP space.
IIRC lots of things that have worked for years in qemu-kvm are ignored when added to <VMID>.conf etc...
PVE itself is still made of a lot of perl, but nowadays, we actually do almost everything new in rust.
We already support CPUsets and pinning for Container VMs, but definitively can be improved, especially if you mean something more automated/guided by the PVE stack.
If you have something more specific, ideally somewhat actionable, it would be great if you could create an enhancement request at https://bugzilla.proxmox.com/ so that we can actually keep track of these requests.
While the input for qemu is called a "pve-cpuset" for affinity[0], it is using explicitly the taskset[1][3] command.
This is different than cpuset[2], or how libvirt allows the creation of partitions[3] using systemd slices in your case.
The huge advantage is that setting up basic slices can be done when provisioning the hypervisor, and you don't have the hard code cpu pinning numbers as you would in taskset, plus in theory it could be dynamic.
As cpusets are hierarchical, one could use various namespace schemes, which change per hypervisor, not exposing that implementation detail to the guest configuration. Think migrating from an old 16 core CPU to something more modern, and how all those guests will be pinned to a fraction of the new cores without user interaction.
Unfortunately I am deep into podman right now and don't have a proxmox at the moment or I would try to submit a bug.
This page[5] covers how even inter CCD traffic even on Ryzen is ~5x compared to local. That is something that would break the normal affinity if you move to a chip with more cores on a CCD as an example. And you can't see CCD placement in the normal numa-ish tools.
To be honest most of what I do wouldn't generalize, but you could use cpusets, with a hierarchy and open the choice to try and improve latency without requiring each person launching a self service VM to hard code the core ID's.
I do wish I had the time and resources to document this well, but hopefully that helps explain more about at least the cpuset part, not even applying the hard partitioning you could do to ensure say ceph is still running when you start to thrash etc...
KVM is awesome enough that there isn’t a lot of room left to differentiate at the hypervisor level. Now the problem is dealing with thousands of the things, so it’s the management layer where the product space is competing.
Thus why libvirt was added, it works with KVM, Xen, VMware ESXi, QEMU etc... but yes most of the tools like ansible only support libvirt_lxc and libvirt_qemu today but it isn't too hard to use for any modern admin with automation experiance.
Libvirt is the abstraction API that mostly hides the concrete implementation details.
I haven't tried oVirt or the other UIs on top of libvirt, but it seems less painful to me than digging through the Proxmox Perl modules when I hit a limitation of their system, but most people may not.
All of those UI's have to make sacrifices to be usable, I just miss the full power of libvirt/qemu/kvm for placement and reduced latency, especially in the era of p vs e cores, dozen's of numa nodes etc...
I would argue for long lived machines, automation is the trick for dealing with 1000's of things, but I get that is not always true for others use-cases.
I think some people may be supprised by just targeting libvirt vs looking for some web-ui.
My personal dark horse favorite right now is HPE VM Essentials. HPE has a terrible track record of being awesome at enterprise software, but their support org is solid and the solution checks a heck of a lot of boxes, including broad support for non-HPE servers, storage, and networking. Solution is priced to move and I expect HPE smells blood in these waters, they're clearly dumping a lot of development resources into the product in this past year.