I think you misunderstood what I meant by "processor", the memory on a discrete GPU is very close to the GPUs processor die, but very far away from the CPU. The GPU may be able to read and write its own memory at 1TB/sec but the CPU trying to read or write that same memory will be limited by the PCIe bus, which is glacially slow by comparison, usually somewhere around 16-32GB/sec.
A huge part of optimizing code for discrete GPUs is making sure that data is streamed into GPU memory before the GPU actually needs it, because pushing or pulling data over PCIe on-demand decimates performance.
> CPU trying to read or write that same memory will be limited by the PCIe bus, which is glacially slow by comparison, usually somewhere around 16-32GB/sec.
If you’re forking out for H100’s you’ll usually be putting them on a bus with much higher throughput, 200GB/s or more.
A huge part of optimizing code for discrete GPUs is making sure that data is streamed into GPU memory before the GPU actually needs it, because pushing or pulling data over PCIe on-demand decimates performance.