Sure. Basically everything in https://github.com/tysam-code/hlb-CIFAR10 was directly founded on concepts shared in the above paper, down to the coding, commenting, and layout styles (hence why I advocate so strongly for it as a requirement for ML. The empirical benefits are clear to me).
Before I sat down and wrote my first line, I spent a very long time thinking about how to optimize the repo. Not just in terms of information flow during training, but how the code was laid out (minimize the expected value of deltas for changes from a superset of possible code changes), to even the explanatory comments (ratio of space vs mental effort to decode the repo for experienced vs inexperienced developers). I really want it to be a good exemplary model of a different, more scalable, and more efficient way of conducting small-scale (and potentially resource-constrained) research. To do that, you have to maximize information efficiency at every stage of the pipeline, including temporally (!!!!).
It's not perfect, but I've used info theory as a strong guiding light for that repo. There's more to say here, but it's a long conversation about the expected utility of doing research a few different kinds of ways.
Just a point of clarification -- are you looking to know why it is, or are you pointing out that it's important? I'm suspecting the latter but am happy to share some resources if it's the former.
(Also, the poster that you're replying to has a bit of a history of...stirring the pot in comment sections if you look. But hey, it let me show some of my open source work, so that's good, at least! ;PPPP)
I think it is useful to know what works well for foundational reasons so you know what to tweak and what is less likely yo work.
I would love to know more about why. I have hundreds of why questions about transformers! And deep learning in general. The main one being why are we so lucky that gradient descent is stable and works! Feels like a miracle and the opposite of the trickyness thrown at us by physics, maths and computer science!
I mean there are plenty of reasons I intuitively (just gut feel without mathematical intuition) think gradient descent should break down on big models. For example it could just become chaotic, nudging any parameter has a butterfly effect and the loss bounces around randomly and doesn’t converge.
W.r.t. the weight updates, ah, but it is extremely chaotic, your intuition is correct I feel! :) :D It took me a lot of years to figure that out though, so good job on that one. We've just adapted our architectures to clamp down on those extra degrees of freedom and break that chained dependency from layer to layer. For example, how residuals softly 'force' a single, defined latent space for each group of layers that adds into a residual, how batchnorm tends to re-center and scale the data to prevent a statistical explosion -- decoupling it from learned scale and bias. How label smoothing basically can be viewed as an interpolation between raw cross entropy and predicting a uniform distribution every time -- what do you think that does to the internal sensitivity of the network? It's a great, label-free way to explicitly train for stability. If you look at the expected value of the magnitudes of cross entropy for output distributions of different entropies, you'll quickly see why an entropy-regulated loss greatly slows down the popcorn-bag of changing kernels (heh) to a dull mumble. Heavy ball momentum prevents a lot of oscillation because it smooths out our gradient updates over time -- surprisingly effective! And of course LR schedules like OneCycle that do a kind of LR annealing are very effective as well.
There's many, many, many more little things that go into that, and I think it took a few decades for us to find a lot of those things! We could go back 20 years with the ideological innovations that we've found and even on the constrained compute would be leagues ahead of what was available at the time, as a result.
Another cool effect is how the Lottery Ticket Hypothesis shows that with rolling enough 'tickets', one can find weights suitably close to a decent representation from the start -- and much of the training is just tuning, and potentially suppressing the noise from more poor initializations. That still somewhat goes over my head but it is a cool hypothesis indeed.
I'm not sure if that's a good teaser response or not, but I'm happy to talk at length about certain things. :)))) <3 <3 <3 :D
Thanks! That sounds like years of learning and experience distilled into a few paragraphs. Scary and interesting at the same time. Scary because learning math intensive stuff is 10x slower than coding stuff. I may never understand all that. Right now I am taking a break from learning deep and trying to have some fun, hence I am doing the fastAI course and hope to categorise dog and cat pics etc, and find a nice use case to impress the kids. But I will probably swing back round to this stuff at some point as it is fun (and unfun at the same time) to learn.
Yeah, they are great and some of the reason (up the causal chain) for some of the work I've done! Seems really fun! <3 :))))
Facebook's Segment Anything Model I think has a lot of potentially really fun usecases. Plaintext description -> Network segmentation (https://github.com/facebookresearch/segment-anything/blob/ma...) Not sure if that's what you're looking for or not, but I love that impressing your kids is where your heart is. That kind of parenting makes me very, very, very, happy. :') <3