Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

if two attentions A, B are identical, would (A - lambda * B) be just (1-lambda) * A, how does it "boost the signal value(s) over the "noise""?


How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was

    softmax(A) - softmax(lambda * A)
And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: