Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: