I'd love to believe it's true but I suspect they're overstating the result, or i...

arilotter · on Aug 27, 2024

We're pretty confident it's not a fluke, and paper + code are the next step, within a couple months. It's not "synchronize every step", but it's "do something every step".

We double and triple and quadruple checked our results, to make sure that we are in fact getting results like this while only doing our thing every step, and it really keeps holding up.

Don't trust our word for it, though, you'll see when the paper comes out :)

RicoElectrico · on Aug 27, 2024

Um, so why announce something before even a paper with replicable details is available? To put it bluntly, what are we supposed to do with the information?

I could be less harsh if this was some grant requirement to release a report before a certain date, but I don't see any grant funding declaration.

arilotter · on Aug 27, 2024

We're excited about the potential and want to find other folks also excited about it that are interested in working for/with us to build things on the foundations of DisTrO! Plus also it's so cool and mind boggling to us that we wanted to share the hype a little bit, it was hard not being able to tell anyone we were working on it

SchwKatze · on Aug 27, 2024

I sent a email yesterday to you guys to find a way I can help to build this pretty pretty cool idea.

CuriouslyC · on Aug 27, 2024

I'm happy to have the project on my radar, and though they could be a bit clearer about the provisional nature of the research I don't think it's wrong to want to hype the potential of it a bit.

hobofan · on Aug 27, 2024

Is synchronize-every-step training the status quo for training LLMs?

I've not kept up-to-date with training/optimizer research for quite some time but during the deep learning craze there were papers like the ones about DistBelief/Downpour SDG[0] that showed how to scale up training by only doing occasional synchronization. Did that not transfer to transformer/LLM training?

[0]: https://proceedings.neurips.cc/paper_files/paper/2012/hash/6...

adw · on Aug 27, 2024

Yes, ultimately everyone is currently doing something which looks like synchronous data parallel training on the outside.

The linked PDF is very light on detail, but what results they do claim are about a 1.2bn parameter model. This is tiny; you don't need network-bound distributed training (ie, anything beyond a single datacenter class machine, or less if you're patient) to train a model that size. The comms requirements also scale with the model size, so I strongly suspect people hoping for embarrassingly-parallel-style scaling properties are going to be disappointed.

(They also appear to have, in part, reinvented parameter servers.)

huac · on Aug 27, 2024

in particular it appears that they only implement data parallel DP - at 1.2B you can fit full copy of model into memory, but larger models require splitting the weights across multiple machines (different techniques eg distributed data parallel DDP, tensor parallel TP, pipeline parallel TP, ...)

without more details it's unclear if the proposed technique keeps its speedups in that case

itkovian_ · on Aug 27, 2024

This is not true

bugglebeetle · on Aug 27, 2024

I was into Nous at first, but it seems they mostly just do graphic design and vibes stuff so a16z gives them money. Which, whatever, nice work if your can get it, but don’t use the same tactics for research projects.

regularfry · on Aug 27, 2024

Not if it cost them a month to do so.