Is there an intuitive reason why this ends up working this well compared to, say...

		miven on Oct 8, 2024 \| parent \| context \| favorite \| on: Differential Transformer Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?