Hacker Newsnew | past | comments | ask | show | jobs | submit | gregjm's commentslogin

> I don't know JAX well enough to explain exactly why it's 3x faster than NumPy on the same matrix multiplications.

JAX is basically a frontend for the XLA compiler, as you note. The secret sauce is two insights - 1) if you have enough control, you can modify the layout of tensor computations and permute them so they don’t have to match that of the input program but have a more favorable memory access pattern; 2) most things are memory bound, so XLA creates fusion kernels that combine many computations together between memory accesses. I don’t know if the Apple BLAS library has fused kernels with GEMM + some output layer, but XLA is capable of writing GEMM fusions and might pick them if they autotune faster on given input/output shapes.

> But I haven't verified that in detail. Might be time to learn.

If you set the environment variable XLA_FLAGS=--dump_hlo_to=$DIRECTORY then you’ll find out! There will be a “custom-call” op if it’s dispatching to BLAS, otherwise it will have a “dot” op in the post-optimization XLA HLO for the module. See the docs:

https://openxla.org/xla/hlo_dumps


I wonder why H100 H2D and D2H unpinned memcpy bandwidth is *faster* on PCIe with vendor B than on SXM with vendor D. Is resizable BAR available on PCIe but not SXM?

Or, could it be a software configuration difference? The driver API flag CU_MEMHOSTREGISTER_IOMEMORY states that host memory being physically contiguous may matter to the driver, in this context for memory-mapped memory. If vendor B has THP enabled or configured differently than vendor D, small allocations up to 2 MiB could be physically contiguous which may result in higher efficiency/more bytes transferred per request.

At a higher level: unpinned memcpy is a performance antipattern. Perhaps vendor D has fewer clients using unpinned memcpy in their workloads than vendor B, or they decided not to dedicate support to it for this reason. TensorFlow will go to great lengths to copy unpinned memory to a pinned staging buffer if you feed unpinned host memory tensors to a graph.


Are both using a PCIe switch? If one is and the other isn't, it could be about PCIe credit based flow control kicking in.


TCMalloc never munmaps, instead it mmap(MAP_FIXED) within unpopulated PROT_NONE regions, and then madvise(MADV_FREE) at page granularity to reduce RSS. Perhaps a similar approach for file I/O could help to dodge the cost of munmap TLB shootdowns after a file has been read, but using MADV_DONTNEED instead of MADV_FREE. There will probably be a shootdown associated with the MADV_DONTNEED, but maybe it will be lower cost than munmap?

You might also just keep around the file mapping until memory/address space pressure requires, and at that point MAP_FIXED over it.


Re: serpentine traversal, this has to do with the .reuse suffix applied to register operands as mentioned in your link. We don’t really have control over it because it’s happening inside of ptxas during SASS generation, but when CUTLASS does serpentine traversal they’re suggesting an order of MMA instruction issues that would result in at least one operand being reused from one instruction to the next— clever stuff.

I’d be so happy if SASS were documented and ptxas were open source, sometimes I spend entire days going through whitepapers and various sources of online documentation to get more hardware details…


> My so-called CPU “active” time is actually an inferred value; CUDA spins the CPU 100% constantly, even when the CPU is just waiting for the GPU

The CUDA Runtime and Driver APIs allow you to use“blocking synchronization” where the CPU will go to sleep while waiting for synchronization with the device. However, it seems that PyTorch doesn’t expose this functionality in any of its Python APIs:

https://github.com/pytorch/pytorch/issues/28224

What happens when you try using ctypes to call into libcudart.so to set the device flags as described in the above issue? You’ll have to call torch.cuda.init() for it to work, and unfortunately it won’t work if PyTorch is launching kernels from other threads.


Aha, I was hoping to learn about something like this, thanks for sharing. I'll try this some time. PyTorch does use different threads for the forward and backward pass, so as you suggest, setting that flag might only improve the forward pass.


The CUDA Runtime and Driver APIs have per-thread state, so using threads would unfortunately bypass our trick here to set the flag. Assuming you're on Linux, I might suggest creating a shared library to intercept calls to the Driver API, as all Runtime functions are implemented as wrappers around Driver functions. You'd have to intercept all calls to context creation and flag setting:

  * `cuCtxCreate`

  * `cuCtxCreate_v3`

  * `cuCtxSetFlags`

  * `cuDevicePrimaryCtxRetain`

  * `cuDevicePrimaryCtxSetFlags`
... and make sure that the three least significant bits of any `flags` variable are set to `CU_CTX_SCHED_BLOCKING_SYNC`.

cuDevicePrimaryCtxSetFlags: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PR...

dlsym(3): https://man.archlinux.org/man/dlsym.3.en

ld.so(8): https://man.archlinux.org/man/ld.so.8.en


I’m somewhat confused as to what is exposed, as the description in the quote sounds like a blocking call, but with a busy wait, which seems like it couldn’t be the only or main thing that PyTorch exposes.


Not just that: you can perfectly happily poll a marker you inserted into the CUDA stream, interspersed with sched_yield() syscalls to let other processes get work done in between you checking if the GPU got to a point where you can retrieve (as/if relevant) results and submit new work. You would have to dial the scheduler time slice to not keep those other processes running long enough after you yielded for your queue of submitted work to run dry before you get to top that queue off. This isn't as critical when you can completely fill the scheduler queue (I remember ~1000 entries, but it's been years and I haven't checked again if I even remembered correctly. Don't rely on this!), as you may want to force sleep there for some millisecond(s) to keep the CPU core sleeping instead of merely allowing other processes to get work done.


That is indeed the only API that it exposes.


CUTLASS, which is NVIDIA’s C++ template library for writing matrix multiply and convolution kernels parametrized over input/output types, operators, and algorithm block sizes, theoretically supports this. But, each input of the (k, n) shaped matrix B will be read from global memory ceil(n / block dimension) times in an algorithm that computes one (block dimension, block dimension) submatrix of the output matrix D per thread block. It will probably be more efficient to cast your B matrix to FP16 or INT8 lower precision in a preprocessing kernel to reduce memory traffic in the matrix multiply kernel.

On newer GPUs, though, we have this huge L2 cache which makes the calculus a little different if your working set fits into it. e.g. Ampere A100 has 40MB L2$.


Personally I would lean away from radix sort once the problem size gets too large, on Turing they made the per-SM L1 cache coherent and on Ampere they made L2 cache significantly larger so getting maximum throughput requires more actively designing for cache-friendly access patterns. The access pattern of radix sort will be very strided, so you get low cache utility for those longer distance swaps. If your problem fits into shared memory or even registers (for in-warp shuffles) then definitely, radix sort is super efficient!

Memory coalescing probably got a little more lenient than when you were programming. If any group of threads in a warp is accessing a 128B size-and-aligned region of memory, that’s perfectly matched the 128B line size between L1TEX cache and L2 cache and you get coalesced accesses. Even 32B size-and-aligned (e.g. each pair of threads accesses two halves of a 32B region) isn’t all that bad, since it aligns with the 32B L2 cache line size to global memory. You run the risk of getting poor load/store utility if you don’t access the other 96B in that L1TEX line somewhere in the same block, though.

Constant memory is less important than it used to be, since constant memory goes through the texture memory, and the L1 cache was unified and combined with the texture cache. There’s still an LDG.CONSTANT SASS instruction on Turing, though, so there’s still a difference in the hardware that escapes me right now. IIRC you can write a MOV instruction with the source operand in constant memory (rather than LDG.CONSTANT) which doesn’t put load on the load-store unit (LSU). Could be useful if you’re getting warp stalls due to memory pipe throttling.


Interesting, love it!

> on Turing they made the per-SM L1 cache coherent and on Ampere they made L2 cache significantly larger

Yeah Turing/Ampere are more than people give them credit for. They are quite a bit more powerful in compute in general iirc, the core is really Volta-descended iirc and not Pascal.

> so getting maximum throughput requires more actively designing for cache-friendly access patterns. The access pattern of radix sort will be very strided, so you get low cache utility for those longer distance swaps. If your problem fits into shared memory or even registers (for in-warp shuffles) then definitely, radix sort is super efficient!

Does dumping warp-sorted data into the buffer help? Eg can each warp or each block sort their output so what they're dumping in can be "galloped", like a partially-sorted mergesort or something? Finding those boundaries between output cells is (ironically) another prefix-scan lol.

Is it just that you need a different sort, or does sorting just not work that well now?

Are prefix-scans still good at all?

Interesting, anything in particular to read about it from Anandtech or Chips+Cheese or what? Or should I just read the whitepapers? (/sigh, reading primary sources, my only weakness)

(I really wish there was an Agner Fog for GPUs!)

> IIRC you can write a MOV instruction with the source operand in constant memory (rather than LDG.CONSTANT) which doesn’t put load on the load-store unit (LSU).

hackerman.jpg


> Interesting, anything in particular to read about it from Anandtech or Chips+Cheese or what? Or should I just read the whitepapers? (/sigh, reading primary sources, my only weakness)

I haven’t had great luck with consulting non-primary-sources, most of my knowledge comes from reading NVIDIA blog posts and GTC presentations as they become relevant. Lately I’ve been working with CUTLASS and reading through that documentation — maybe start with their presentations and work back through their references? I’ve learned a lot by reading the architecture tuning guides from NVIDIA, too.

> Does dumping warp-sorted data into the buffer help? Eg can each warp or each block sort their output so what they're dumping in can be "galloped", like a partially-sorted mergesort or something? Finding those boundaries between output cells is (ironically) another prefix-scan lol. > > Is it just that you need a different sort, or does sorting just not work that well now?

I’m not sure, honestly! All I know is that recently I’ve been looking at radix sort kernels with lower-than-expected memory throughout and low cache hit rates :)

> Are prefix-scans still good at all?

The CUB linear-time prefix scan kernels seem to be fantastic still, they operate basically at the speed of DRAM with really high compute utilization. When I’ve seen lower-than-expected performance with these kernels, it’s because of an issue with an inefficient transform being made as part of the input/output iterator ranges, or because of some local memory usage due to an indexing operation in a local variable that couldn’t be fully inlined into registers.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: