if two attentions A, B are identical, would (A - lambda \* B) be just (1-lambda)...

magicalhippo · on Oct 9, 2024

How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was

    softmax(A) - softmax(lambda * A)

And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.