Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd love to believe it's true but I suspect they're overstating the result, or it's a fluke. Presumably teams at large firms like Meta would have put a lot of effort into checking whether not-synchronise-every-step training could match synchronise-every-step training before investing hundreds of millions of dollars into the low-latency, high-throughput network hardware necessary for the latter.


We're pretty confident it's not a fluke, and paper + code are the next step, within a couple months. It's not "synchronize every step", but it's "do something every step".

We double and triple and quadruple checked our results, to make sure that we are in fact getting results like this while only doing our thing every step, and it really keeps holding up.

Don't trust our word for it, though, you'll see when the paper comes out :)


Um, so why announce something before even a paper with replicable details is available? To put it bluntly, what are we supposed to do with the information?

I could be less harsh if this was some grant requirement to release a report before a certain date, but I don't see any grant funding declaration.


We're excited about the potential and want to find other folks also excited about it that are interested in working for/with us to build things on the foundations of DisTrO! Plus also it's so cool and mind boggling to us that we wanted to share the hype a little bit, it was hard not being able to tell anyone we were working on it


I sent a email yesterday to you guys to find a way I can help to build this pretty pretty cool idea.


I'm happy to have the project on my radar, and though they could be a bit clearer about the provisional nature of the research I don't think it's wrong to want to hype the potential of it a bit.


Is synchronize-every-step training the status quo for training LLMs?

I've not kept up-to-date with training/optimizer research for quite some time but during the deep learning craze there were papers like the ones about DistBelief/Downpour SDG[0] that showed how to scale up training by only doing occasional synchronization. Did that not transfer to transformer/LLM training?

[0]: https://proceedings.neurips.cc/paper_files/paper/2012/hash/6...


Yes, ultimately everyone is currently doing something which looks like synchronous data parallel training on the outside.

The linked PDF is very light on detail, but what results they do claim are about a 1.2bn parameter model. This is tiny; you don't need network-bound distributed training (ie, anything beyond a single datacenter class machine, or less if you're patient) to train a model that size. The comms requirements also scale with the model size, so I strongly suspect people hoping for embarrassingly-parallel-style scaling properties are going to be disappointed.

(They also appear to have, in part, reinvented parameter servers.)


in particular it appears that they only implement data parallel DP - at 1.2B you can fit full copy of model into memory, but larger models require splitting the weights across multiple machines (different techniques eg distributed data parallel DDP, tensor parallel TP, pipeline parallel TP, ...)

without more details it's unclear if the proposed technique keeps its speedups in that case


This is not true


I was into Nous at first, but it seems they mostly just do graphic design and vibes stuff so a16z gives them money. Which, whatever, nice work if your can get it, but don’t use the same tactics for research projects.


Not if it cost them a month to do so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: