Using basic linear algebra to prove that "orthogonalizing" the gradient gives the optimal loss improvement under a norm constraint.

Framing self-attention in terms of convex combinations and similarity matrices, aiming to precisely ground common intuitive explanations.