Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the DPO paper linked from the OP page, DPO is described as "a simple RL-free algorithm for training language models from preferences." So as you say, "not technically RL."

Given that, shouldn't the first sentence on the linked page end with "...in a process known as DPO (...)" ? Ditto for the title.

It sounds like you're saying that the terms RL and RLHF should subsume DPO because they both solve the same problem, with similar results. But they're different techniques, and there are established terms for both of them.



I think the discussion in the other comment thread discusses this well. They are different techniques, but the line between RL & SL is quite fuzzy. The DPO authors advertise this as a "non-RL" technique to precisely get away from the reputation of unstable training RL has, but they also say and treat the language model as an (implicit) reward model, similar to PPO. The point is well taken though, I will update this page to clarify the differences to avoid confusion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: