Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know about strictly superior. It's certainly strictly easier for people with a budget, who just need "good enough" results the first try. I don't have any evidence whatsoever, but I'd expect that enough tuning and retries can get squeeze a bit more performance out of RLHF than you can get out of DPO.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: