Virtual machines perform extremely poorly, so you must take metal instances. The...

dekhn · on Oct 3, 2023

When you say "virtual machines perform extremely poorly", on what do you base that?

(note: I've worked in supercomputing and HPC for over two decades"

The network I was talking about is called UltraCluster which have an extremely high bandwidth and low latency, designed to get great scaling on MPI jobs (as well as ML). Typical instances used with UC are p5, which have 8 H100 nvidia GPUs, 192 vCPUs, 2TB RAM, 3.2Tbps bandwidth PER MACHINE, 900GB/sec between GPU peers, and 8 3.84TB SSDs. They are not marketed as metal instances.

No, it's not like a grid. Your thinking is dated and not representative of how people do HPC on AWS, Azure, or Google.

gnufx · on Oct 4, 2023

Azure has RDMA, though with slightly high quoted latency (I don't know the message rate), and tightly-coupled stuff appears to scale: https://techcommunity.microsoft.com/t5/azure-compute-blog/hp...

It seems how people do HPC on AWS is limited by what AWS can do (and maybe costs). Our experience was that even the elastic feature wasn't, and we often couldn't get resources anyway.

Maybe dated, but for context, we had 2TB and 128 real cores a decade ago, and I currently work with Summit-type hardware; I'd rather not admit after how long.

justinclift · on Oct 4, 2023

> 3.2Tbps bandwidth PER MACHINE

Looking into the Ultraclusters page you linked to in a sibling comment, it seems like the host machines pretty much fill out their PCIe connections with Infiniband networking to reach that figure:

    EFA is also coupled with NVIDIA GPUDirect RDMA (P5, P4d) and
    NeuronLink (Trn1) to enable low-latency accelerator-to-accelerator
    communication between servers with operating system bypass.

https://aws.amazon.com/ec2/ultraclusters/

mgaunard · on Oct 4, 2023

I base that on running code on m5-type instances.

If you care about correct NUMA and HyperThreading usage, and even more so if you care about latency on the CPU (for example for real-time trading), the only things that perform well are either metal of full-machine-but-with-hypervisor.

kiitos · on Oct 10, 2023

Obviously nobody is going to run workloads that need to exploit these kinds of things on ECS instances. But these workloads are niche, not normal. Most code that's written and deployed to some notion of "production" is not CPU bound, it is I/O bound.