More

t-vi · 2025-11-16T18:30:54 1763317854

- I don't think it hurts to learn PyTorch (and having learned JAX is good, too). I don't know if JAX + triton is as impossible as you make it out, but it seems that PyTorch integration is quite good for many things. - For pallas, triton and CUDA/C++, you probably want to know a bit about how GPU works. There is the GPU-Mode discord / lectures / ressources if you are looking for material https://github.com/gpu-mode/ . - In my experience how well Triton works varies depending on what you want to do (depending on the how well the programming model fits the task). If it does, it is quite nice to get something reasonably fast reasonably fast. PyTorch (in the inductor torch.compile backend) has made many things work well, so you could check that out if you run out of examples elsewhere).

t-vi · 2025-11-06T18:38:54 1762454334

Note that the NVIDIA container uses CUDA+cuBLAS 13.0.2 which cites "Improved performance on NVIDIA DGX Spark for FP16/BF16 and FP8 GEMMs", which seems to be your use-case. In general, I would suspect that it mostly comes to versions of the libs.

Interestingly, there is a cuBLAS 13.1 whl on PyPI, not sure what that does.

riomus · 2025-11-07T18:13:38 1762539218

I did a shallow check on PyTorch (that reports it is 2.9.0) - and it is different from 2.9.0 from PyTorch index - and differences are from code parts that are months before 2.9.0 was out - that is why I am assuming that Nvidia is using their fork. For cuBLAS - natively i see it is available (libcublas.so.13.1.0.3) in same version as in the container.

t-vi · 2025-11-02T17:23:11 1762104191

It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.

For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:

One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.

with this footnote:

In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.

That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.

HarHarVeryFunny · 2025-11-02T18:17:26 1762107446

> It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today

Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?

Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.

The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.

joe_the_user · 2025-11-03T01:01:00 1762131660

I wonder if its just about the important neural nets now being trained by large, secretive corporations that aren't interested in sharing their knowledge.

HarHarVeryFunny · 2025-11-03T15:01:10 1762182070

I'm sure that's part of it, which is why it's nice to see Hugging Face sharing this, but still it obviously reflects the reality that large LLMs are difficult to train for whatever reasons (maybe more than just gradient issues - I haven't read that HF doc yet).

For simpler nets, like ResNet, it may just be that modern initialization and training recipes avoid most gradient issues, even though they are otherwise potentially still there.

t-vi · 2025-06-01T16:19:47 1748794787

Note that PyTorch's kernels are somewhat generic in shape. It has always been relatively easy to get speedups by specializing the shape, e.g. Apache TVM had that (back before it was "Apache" even).

t-vi · on Feb 4, 2025

If you like JIT wrappers and Python interpreters:

In Thunder[1], a PyTorch to Python JIT compiler for optimizing DL models, we are maintaining a bytecode interpreter covering 3.10-3.12 (and 3.13 soon) for our jit. That allows to run Python code while re-directing arbitrary function calls and operations but is quite a bit slower than CPython.

While the bytecode changes (and sometimes it is a back-and-forth for example in the call handling), it seems totally good once you embrace that there will be differences between Python versions.

What has been a large change is the new zero cost (in the happy path) exception handling, but I can totally why Python did that change to that from setting up try-block frames.

I will say that I was happy not to support Python <= 3.9 as changes were a lot more involved there (the bytecode format itself etc.).

Of course, working on this has also means knowing otherwise useless Python trivia afterwards. One of my favorites is how this works:

  l = [1, 2, 3]
  l[-1] += l.pop()
  print(l)

1. https://github.com/Lightning-AI/lightning-thunder/

sitkack · on Feb 10, 2025

What is your reason for a bytecode interpreter vs recompiling the AST?

t-vi · on May 14, 2024

When I did a similar thing (but with less LLM) I liked https://github.com/coqui-ai/TTS but back then I needed to cut out the conversion step from tensor to a list of numbers to make it work really nicely.

t-vi · on Jan 6, 2024

The subtraction is because "is an example of constructing the “Inner Product” distance" per the text above it. That ymmone might not be one could be because they only need that up to a constant and so don't care, but it's probably not ideal to name the thing containing 0s ymmone.

eigenket · on Jan 8, 2024

Is there a context where it is usual to call minus the dot-product a distance?

t-vi · on Sept 7, 2023

> Is avoiding CF potentially just a matter of sheer scale ?

My intuition would be that you get more orthogonal directions to the gradient (of previous samples) if you have larger model.

t-vi · on Sept 7, 2023

After the first epoch, the average time since the present data item was last used for during training is small at the beginning of an epoch grows during the epoch. I'd expect that to positively relate to loss on the present iteration.

t-vi · on Aug 28, 2023

My former neighbors run https://justanotherfoundry.com/ and I like their work and bought some.