Didn’t have to, prod benchmarked it for us, twice.
If you’re curious, the EBS disks Aurora uses for temporary storage, when faced with a QD of approximately 240, can manage approximately 5000 IOPS. This was an r6i.32xlarge.
My hypothesis is currently that the massive context switching the CPU had to do to handle the interrupts slowed down its acceptance of new connections / processing other queries enough such that everything piled up. I’ve no idea what kind of core pinning / isolation AWS does under the hood, but CPU utilization from disk I/O alone, according to Enhanced Monitoring, was about 20%.
If you’re curious, the EBS disks Aurora uses for temporary storage, when faced with a QD of approximately 240, can manage approximately 5000 IOPS. This was an r6i.32xlarge.
My hypothesis is currently that the massive context switching the CPU had to do to handle the interrupts slowed down its acceptance of new connections / processing other queries enough such that everything piled up. I’ve no idea what kind of core pinning / isolation AWS does under the hood, but CPU utilization from disk I/O alone, according to Enhanced Monitoring, was about 20%.