An optimality proof of the Muon weight update

2025-07-19

Using basic linear algebra to prove that "orthogonalizing" the gradient gives the optimal loss improvement under a norm constraint.

This article expects comfort with undergraduate linear algebra. It aims to be explicit and self-contained, with proofs of useful lemmas included in the appendix.

In Jeremy Bernstein’s great post Deriving Muon, he motivates the weight update for the Muon optimizer using a constrained optimization problem. He gives the solution for the optimal weight update, but I wasn’t sure how he arrived at it.

In this post, I provide an optimality proof of the solution. If you’re already familiar with Muon and the constrained optimization problem used to derive its weight update, click here to jump to the proof.

Addendum 2025-07-28: Laker Newhouse pointed out that he and Jeremy also provide a proof of this result in a Dec 2024 paper. See Proposition 5 and its proof in Appendix B. Thanks Laker!

A quick recap of Muon

Define a loss function L\mathcal{L}. Consider it as a function of the weights of one linear layer, WW. For a given change in the weights ΔW\Delta W, the new loss value at W+ΔWW + \Delta W is approximately L(W+ΔW)L(W)+WL,ΔW, L(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_{W} \mathcal{L}, \Delta W \rangle, using the first-order Taylor approximation. Here, the inner product ,\langle \cdot, \cdot \rangle is the Frobenius inner product on matrices,A,B:=Tr(ATB). \langle A, B \rangle := \mathrm{Tr}(A^TB). Using this Taylor approximation, the change in loss is therefore roughly the inner product L(W+ΔW)L(W)WL,ΔW. \mathcal{L}(W + \Delta W) - \mathcal{L}(W) \approx \langle \nabla_{W} \mathcal{L}, \Delta W \rangle. Muon aims to maximize the improvement in loss, subject to a norm constraint on ΔW\Delta W.

Specifically, it employs the root-mean-square operator norm, RMS.\|\cdot \|_\text{RMS}. We’ll properly derive this norm below. But briefly, ARMS\|A\|_{\text{RMS}} is the largest possible factor by which a matrix ARm×nA \in \mathbb{R}^{m \times n} can scale the size of its input, normalized by dimension: ARMS:=nmsupx0Ax2x2(1) \|A\|_{\text{RMS}} := \sqrt{ \frac{n}{m} } \text{sup}_{x \neq 0} \frac{\|Ax\|_{2}}{\|x\|_{2}} \tag{1} So we arrive at the constrained optimization problem (\dagger) that motivated Muon: min WL,ΔWs.t.ΔWRMSη.(†) \text{min} \ \langle \nabla_{W} \mathcal{L}, \Delta W \rangle \quad \text{s.t.} \quad \|\Delta W\|_\text{RMS} \leq \eta. \tag{†} For intuition, the requirement that ΔWRMS<η\|\Delta W\|_\text{RMS} < \eta is equivalent to the condition that the change in the layer’s outputs is bounded by the size of its inputs. From (11), (W+ΔW)xWxRMS=(ΔW)xRMS<ηmnx2, \|(W + \Delta W)x - Wx \|_{\text{RMS}} = \|(\Delta W)x\|_{\text{RMS}} < \eta \sqrt{ \frac{m}{n} } \|x\|_{\text{2}}, for all xx. So the constraint ΔWRMS<η\|\Delta W\|_{\text{RMS}} < \eta directly ensures that the layer’s outputs don’t change too much within an optimization step.

If we take the singular value decomposition of the gradient as WL=UΣVT\nabla_{W} \mathcal{L} = U \Sigma V^T, then the solution to ()(\dagger) is given as the “orthogonalization,” ΔW=ηmnUVT.(§) \Delta W^* = -\eta \sqrt{ \frac{m}{n} } UV^T. \tag{§} In this post, we’ll derive this solution and prove that it is optimal for (\dagger). First, some background on the RMS operator norm.

Preliminaries: RMS norm

For an nn-dimensional vector xRnx \in \mathbb{R}^{n}, the RMS norm is defined as xRMS:=1nx2=1ni=1nxi2. \|x\|_{\text{RMS}} := \frac{1}{\sqrt{n}} \|x\|_{2} = \sqrt{ \frac{1}{n} \sum_{i = 1}^n x_{i}^{2} }. A useful fact for intuition is that the RMS norm is just the 2\ell_2 norm, rescaled such that the ones vector 1Rn\mathbf{1} \in \mathbb{R}^{n} has norm 11 instead of n\sqrt{n}: 1RMS=1n12=1nn=1. \|\mathbf{1}\|_{\text{RMS}} = \frac{1}{\sqrt{ n }} \|\mathbf{1}\|_{2} = \frac{1}{\sqrt{n }} \sqrt{ n } = 1. This rescaling allows for a “dimension-invariant” notion of 2\ell_2 size.

We can use this vector norm to define a norm on matrices.

Preliminaries: RMS operator norm

The RMS operator norm is defined to be the largest multiple that a matrix AA can increase the RMS norm of its input:ARMS:=supx0AxRMSxRMS. \|A\|_{\text{RMS}} := \text{sup}_{x \neq 0} \frac{\|Ax\|_{\text{RMS}}}{\|x\|_{\text{RMS}}}. This is trivially equivalent to the definition from (1)(1). Expanding the RMS operator norm using the RMS vector norm definition, ARMS=supx0(1/m)Ax2(1/n)x2=nmsupx0Ax2x2. \|A\|_{\text{RMS}} = \text{sup}_{x \neq 0} \frac{(1 / \sqrt{ m })\|Ax\|_{2}}{(1 / \sqrt{ n })\|x\|_{2}} = \sqrt{\frac{n}{m} } \text{sup}_{x \neq 0} \frac{\|Ax\|_{2}}{\|x\|_2}. The RMS operator norm can also be expressed in terms of the singular values of AA. For those familiar, the right-hand supremum in the equation above is known as the spectral norm, A:=supx0Ax2x2. \|A\|_{*} := \text{sup}_{x \neq 0} \frac{\|Ax\|_{2}}{\|x\|_{2}}. The spectral norm of a matrix, the most AA can stretch a vector in the 2\ell_2 sense, is precisely its largest singular value: A=supx0Ax2x2=σmax(A). \|A\|_{*} = \text{sup}_{x \neq 0} \frac{\|Ax\|_{2}}{\|x\|_{2}} = \sigma_{\text{max}}(A). See Fact 1 for a formal proof. Taking the equality at face value, we now observe, ARMS=supx0AxRMSxRMS=nmsupx0Ax2x2=nmσmax(A).(2) \|A\|_{\text{RMS}} = \text{sup}_{x \neq 0} \frac{\|Ax\|_{\text{RMS}}}{\|x\|_{\text{RMS}}} = \sqrt{ \frac{n}{m} } \text{sup}_{x \neq 0} \frac{\|Ax\|_{\text{2}}}{\|x\|_{\text{2}}} = \sqrt{ \frac{n}{m} } \sigma_\text{max}(A). \tag{2} We now have all the machinery we need to solve the Muon problem (\dagger). Before we do so, let’s warm up with a simpler problem.

Warmup: Maximizing inner products

Consider the linear program, max aTbs.t.b2α. \text{max} \ a^T b \quad \text{s.t.} \quad \|b\|_{2} \leq \alpha. The answer is likely immediately obvious: pick bb in the direction of aa, scaled to the appropriate norm size. Specifically, let bb be b=αa2a. b = \frac{\alpha}{\|a\|_{2}} a. Graphic illustrating the constrained vector inner product optimization problem Picking bb to be “aligned with” aa to maximize the inner product is pretty intuitive. But how might we prove this?

One option is to assume that we’ve found a better candidate, say cc, and find a logical contradiction. If we find a contradiction, the better candidate cc cannot actually exist—so we win.

Proof: let’s assume that we have cc such that aTc>aTb=αa2. a^T c > a^T b = \alpha \|a\|_{2}. We want to show that any such cc is not a valid solution to our linear program. To do so, we’ll show that it violates the norm condition, meaning c2>α.\|c\|_{2} > \alpha.

We’ll do this using the Cauchy-Schwarz inequality1. The inequality tells us that aTca2c2. |a^Tc| \leq \|a\|_{2} \|c\|_{2}. Using our assumption that aTc>aTba^Tc > a^Tb, c21a2aTc>1a2aTb=1a2αa2=α. \|c\|_{2} \geq \frac{1}{\|a\|_{2}} |a^Tc| > \frac{1}{\|a\|_{2}} |a^Tb| = \frac{1}{\|a\|_{2}} \alpha\|a\|_{2} = \alpha. So cc doesn’t satisfy the norm constraint, meaning it is not a valid solution to the problem.

But the only thing we asked of cc is that it has a better objective value than bb. Therefore the same argument applies to all vectors xx with aTx>aTba^Tx > a^T b. This means that all vectors xx with a strictly larger objective value are invalid solutions. In other words, bb is optimal.

End proof.

With this, we’re now equipped to analyze the Muon problem.

Back to Muon

The parallels between the warmup problem, max aTbs.t.b2α, \text{max} \ a^T b \quad \text{s.t.} \quad \|b\|_{2} \leq \alpha, and the Muon problem (\dagger), min WL,ΔWs.t.ΔWRMSη, \text{min} \ \langle \nabla_{W} \mathcal{L}, \Delta W \rangle \quad \text{s.t.} \quad \|\Delta W\|_\text{RMS} \leq \eta, should be clear. In both cases, we’re optimizing an inner product over a norm ball.

For ease of exposition, we’ll frame the Muon problem (\dagger) as a maximization problem in this article. So we’ll instead aim to solve:max WL,Ms.t.MRMSη.(2) \text{max} \ \langle -\nabla_{W} \mathcal{L}, M \rangle \quad \text{s.t.} \quad \|M\|_{\text{RMS}} \leq \eta. \tag{2} To build intuition for the solution (§)(\S), we’ll first solve a slightly simplified version of the problem in which ΔW\Delta W^* will arise naturally.

Then, like the warmup, we’ll proceed by contradiction and show that any candidate better than ΔW\Delta W^* does not satisfy the RMS condition.

Why orthogonalize? Deriving a solution a priori

First, the simplified problem.

Based on the warmup, to optimize our inner product objective, we might expect a solution that feels like a matrix “in the direction” of WL-\nabla_{W}\mathcal{L} . So let’s restrict our attention to matrices MM with the same singular basis as WL-\nabla_W \mathcal{L}.

This simplification will allow us to analytically derive the optimal weight update ΔW\Delta W^*.

Reducing to the same singular basis

Since WL=UΣVT\nabla_{W} \mathcal{L} = U\Sigma V^T, a valid SVD of the negative gradient is simply WL=(UΣVT)=(U)ΣVT -\nabla_{W} \mathcal{L} = -(U\Sigma V^T) = (-U)\Sigma V^T So take MM to be a matrix M=(U)DVT, M = (-U) D V^T, for some arbitrary nonnegative diagonal matrix DD.

Writing out the objective value, we have WL,M=(U)ΣVT,(U)DVT. \langle -\nabla_{W} \mathcal{L}, M \rangle = \Big\langle (-U)\Sigma V^T, (-U)DV^T \Big\rangle. It turns out that the Frobenius inner product does not change when you pre- or post-multiply by orthogonal matrices (Fact 2). So WL,M=(U)ΣVT,(U)DVT=Σ,D. \langle -\nabla_{W} \mathcal{L}, M \rangle = \Big\langle (-U)\Sigma V^T, (-U)DV^T \Big\rangle = \langle \Sigma, D \rangle. Since Σ\Sigma is a diagonal matrix, we can write the matrix inner product as a function only of the diagonal entries of each matrix: Σ,D=Tr(ΣTD)=iΣiiDii. \langle \Sigma, D \rangle = \mathrm{Tr}(\Sigma^T D) = \sum_{i} \Sigma_{ii} D_{ii}. We can simplify this expression even further. If we pull out the diagonal elements into vectors, defining σ:=vec(Σ)\sigma := \text{vec}(\Sigma) and d:=vec(D)d := \text{vec}(D), we can rewrite the sum as WL,M=iΣiiDii=iσidi=σTd. \langle -\nabla_{W} \mathcal{L}, M \rangle = \sum_{i} \Sigma_{ii} D_{ii} = \sum_{i} \sigma_{i} d_{i} = \sigma^T d. We’ve now reduced the matrix inner product objective to a vector inner product.

We can also express the RMS\mathrm{RMS} norm of MM in terms of dd. Since the entries of dd are the singular values of MM by construction, we have that MRMS=nmmaxidi, \|M\|_{\text{RMS}} = \sqrt{ \frac{n}{m} } \text{max}_{i} d_{i}, by equation 11.

In our restricted setting, then, we’ve now successfully reduced the matrix-domain problem (\dagger) to the vector-domain problem, max σTds.t.maxi diηmn. \text{max} \ \sigma^T d \quad \mathrm{s.t.} \quad \text{max}_{i} \ d_{i}\leq \eta\sqrt{ \frac{m}{n} }. This reduced version is quite easy to solve.

Deriving a solution

Since σ\sigma represents the singular values of WL-\nabla_W \mathcal{L}, it must have nonnegative elements. So maximizing the inner product σTd\sigma^T d is equivalent to maximizing each of the component products σidi\sigma_i d_i.

Under our constraint on did_i, then, the best solution to this reduced problem is simply the constant vector di:=ηmn. d_{i} := \eta \sqrt{ \frac{m}{n} }. If this isn’t immediately obvious, try to prove it yourself. A graphic illustrating the vectorized optimization problem over σ and d

Translating from the vector parameter dd to the matrix parameter MM, the best solution to our restricted matrix-domain problem is therefore M=(U)DVT=(U)(diI)VT=ηmnUVT. M = (-U)DV^T = (-U)(d_{i}I)V^T = -\eta \sqrt{ \frac{m}{n} }UV^T. This is precisely the optimal value (§\S) given by Bernstein!

It’s now clearer why it’s a good idea to define our weight update as the negative gradient WL-\nabla_{W} \mathcal{L} with its singular values “clamped” to ηm/n\eta \sqrt{ m / n }.

For a weight update MM in the same singular basis as WL-\nabla_{W}\mathcal{L}, the inner product objective and the RMS norm are expressible entirely in terms of the singular values of WL-\nabla_{W} \mathcal{L} and MM. The inner product objective turns into a dot product between two nonnegative singular value vectors σ,d\sigma, d. And the RMS norm constraint turns into an element-wise maximum constraint on the singular values dd of MM. The optimal solution is therefore just to clamp the parameters dd to their maximum allowable values.

Just like the warmup problem, then, the solution is a “rescaled” (singular value clamped) matrix MM “in the direction of” (in the same singular basis as) the negative gradient WL-\nabla_{W} \mathcal{L}.

With this intuition, we can dive into the proof.

A proof of optimality

The proof is not as elegant as the warmup, since we can’t use Cauchy-Schwarz.2 But it relies on the same contradiction technique.

Assume we have some matrix AA that does better than ΔW\Delta W^*, meaning WL,A>WL,ΔW.(3) \langle -\nabla_{W} \mathcal{L}, A \rangle > \langle - \nabla_{W}\mathcal{L}, \Delta W^* \rangle. \tag{3} We’ll show that ARMS>η\|A \|_\text{RMS} > \eta and therefore that AA cannot be a solution to ()(\dagger).

Proof: First, we’ll rewrite AA in terms of the singular basis U,V-U, V as A=(U)BVT, A = (-U)BV^T, defining B=(U)TAVB = (-U)^T A V. With this, we can simplify the problem expression slightly.

Using the fact that the Frobenius inner product is invariant under orthogonal transformations of its inputs (Fact 2), we can rewrite the objective as: WL,A=(U)ΣVT,(U)BVT=Σ,B. \langle -\nabla_{W} \mathcal{L}, A \rangle = \big\langle (-U)\Sigma V^T, (-U)BV^T \big\rangle = \langle \Sigma, B \rangle. Since Σ\Sigma is diagonal, this reduces to Σ,B=Tr(ΣTB)=iΣiiBii. \langle \Sigma, B \rangle = \mathrm{Tr}(\Sigma^TB) = \sum_{i} \Sigma_{ii} B_{ii}. Applying the same argument to ΔW\Delta W^*, we have WL,ΔW=Σ,(ηm/n)I=iΣii(ηmn). \langle -\nabla_{W} \mathcal{L}, \Delta W^* \rangle = \Big\langle \Sigma, \left(\eta \sqrt{ m / n } \right) I \Big\rangle = \sum_{i} \Sigma_{ii} \left( \eta \sqrt{ \frac{m}{n} } \right). Our assumption in 33 means that WL,AWL,ΔW>0. \langle -\nabla_{W} \mathcal{L}, A \rangle - \langle - \nabla_{W}\mathcal{L}, \Delta W^* \rangle > 0. So iΣii(Biiηm/n)>0. \sum_{i} \Sigma_{ii} \left( B_{ii} - \eta \sqrt{ m / n } \right) > 0. For this sum to be strictly positive, at least one term must be strictly positive. Call this term ii. All singular values are nonnegative, so Σii0\Sigma_{ii} \geq 0. For the iith term in the sum to be positive, then, we must have Bii>ηm/n.(4) B_{ii} > \eta \sqrt{ m / n }. \tag{4} This fact will be the key to showing ARMS>η\|A\|_\text{RMS} > \eta.

By Fact 3, the RMS norm is invariant under orthogonal transformations, so ARMS=BRMS.\|A\|_\text{RMS} = \|B\|_{\text{RMS}}. It therefore suffices to show BRMS>η.\|B\|_\text{RMS} > \eta. To do this, all we need to do is find one vector vv such that BvRMS>ηvRMS.(5) \|Bv\|_{\text{RMS}} > \eta\|v\|_{\text{RMS}}. \tag{5} Pick vv to be the iith coordinate vector, meaning vj={1j=i0otherwise. v_{j} = \begin{cases} 1 & j = i \\ 0 & \mathrm{otherwise.} \end{cases} Note that vv has unit 2\ell_2 norm, so vRMS=1nv2=1n. \|v\|_{\text{RMS}} = \frac{1}{\sqrt{ n }} \|v\|_{2} = \frac{1}{\sqrt{ n }}. So BvBv is just the iith column of BB, meaning (Bv)j=Bij. (Bv)_{j} = B_{ij}. In particular, (Bv)i=Bii.(Bv)_i = B_{ii}. Now, we can calculate Bv2\|Bv\|_2 and attempt to show (55). Bv22=vTBTBv=jBij2Bii2, \|Bv\|_{2}^{2} = v^T B^T Bv = \sum_{j} B_{ij}^{2} \geq B_{ii}^{2}, since all other terms in the sum are nonnegative: Bij20B_{ij}^{2} \geq 0. But from equation (4)(4), we therefore have that Bv2Bii>ηm/n. \|Bv\|_{2} \geq B_{ii} > \eta \sqrt{ m / n }. Applying this inequality to the RMS norm, we find BvRMS=1mBv2>1mηmn=η1n=ηvRMS. \|Bv\|_{\text{RMS}} = \frac{1}{\sqrt{ m }} \|Bv\|_{2} > \frac{1}{\sqrt{ m }} \eta \sqrt{ \frac{m}{n} } = \eta \frac{1}{\sqrt{ n }} = \eta\|v\|_{\text{RMS}}. Therefore BRMS=ARMS>η\|B\|_\text{RMS} = \|A\|_{\text{RMS}} > \eta. So AA cannot be a valid solution to the problem, and ΔW\Delta W^* as given in Equation (§\S) is optimal.

End proof.

Conclusion

In this article, we explored and proved the optimality of the solution ΔW\Delta W^* to the Muon constrained optimization problem (\dagger). We situated it within the framework of maximizing an inner product over a norm ball, and used this analogy to motivate the otherwise potentially-unintuitive formula.


Appendix: Useful Facts

Fact 1

The spectral norm of a matrix is its largest singular value.

For any matrix ARm×nA \in \mathbb{R}^{m \times n}, A=σmax(A), \|A\|_{*} = \sigma_{\text{max}}(A), where σmax(A)\sigma_\text{max}(A) is the largest singular value of AA.

Proof: Let UΣVTU \Sigma V^T be the SVD of AA. Without loss of generality, assume the singular values are ordered in descending magnitude, so Σ11Σ22\Sigma_{11} \geq \Sigma_{22} \geq \dots.

To prove the equality, it suffices to prove the inequalities σmax(A)AandAσmax(A). \sigma_{\text{max}}(A) \leq \|A\|_{*} \quad \text{and} \quad \|A\|_{*} \leq \sigma_{\text{max}}(A). We’ll start with the left hand inequality.

Step 1: Maximum Singular Value \leq Spectral Norm

To show this, it suffices to find some vector vv such that Av2/v2=σmax(A). \|Av\|_{2} / \|v\|_{2} = \sigma_{\text{max}}(A). But this is trivial. Let v:=Ve1v := Ve_1, where e1e_1 is the coordinate vector with first component 11 and all other components 00. Then Av=UΣVT(Ve1)=UΣe1=U(σmax(A)e1)=σmax(A)Ue1. Av = U\Sigma V^T (Ve_{1}) = U \Sigma e_{1} = U(\sigma_{\text{max}}(A)e_{1}) = \sigma_{\text{max}}(A) Ue_{1}. Taking norms, I claim that v2=Ue12=e12=1\|v\|_2 = \|Ue_1\|_2 = \|e_1\|_2 = 1, where the middle equality follows because orthogonal matrices preserve 2\ell_2 norms. The proof of this claim is simple. Let OO be an orthogonal transformation and ww an arbitrary vector. Then Ow22=wTOTOw=wTIw=wTw=w22. \|Ow\|_{2}^{2} = w^T O^TOw = w^T I w = w^T w = \|w\|_{2}^{2}. And so Ue12=e12=1\|Ue_1\|_2 = \|e_1\|_2 = 1. Therefore, we’ve found a vector vv such that Av2v2=Av2=σmax(A)Ue1=σmax(A). \frac{\|Av\|_{2}}{\|v\|_{2}} = \|Av\|_{2} = \sigma_\text{max}(A)\|Ue_{1}\| = \sigma_{\text{max}}(A). So the spectral norm must be at least the maximum singular value, Aσmax(A). \|A\|_{*} \geq \sigma_{\text{max}}(A).

Step 2: Spectral Norm \leq Maximum Singular Value

For this inequality, we need to show that Av2σmax(A)v2 \|Av\|_{2} \leq \sigma_{\text{max}}(A) \|v\|_{2} for all vectors vRnv \in \mathbb{R}^{n}.

Again relying on the fact that orthogonal matrices preserve the 2\ell_2 norm, Therefore Av2\|Av\|_2 = ΣVTv2\| \Sigma V^T v\|_2. Define w:=VTvw := V^T v, so Av2=Σw2. \|Av \|_{2} = \|\Sigma w\|_{2}.

Each element of Σw\Sigma w is just (Σw)i=Σiiwi(\Sigma w)_i = \Sigma_{ii} w_i. And each diagonal element Σii\Sigma_{ii} is bounded by the maximum singular value, σmax(A)\sigma_\text{max}(A). So (Σw)i=Σiiwiσmax(A)wi, (\Sigma w)_{i} = \Sigma_{ii} w_{i} \leq \sigma_{\text{max}}(A) w_{i}, And therefore Σw2σmaxw2 \|\Sigma w\|_{2} \leq \sigma_{\text{max}} \|w\|_{2} But w=VTvw = V^Tv and VV is orthogonal, so w2=v2\|w\|_{2} = \|v\|_{2}. We conclude, Av2=Σw2σmaxw2=σmaxv2. \|Av\|_{2} = \|\Sigma w\|_{2} \leq \sigma_{\text{max}} \|w\|_{2} = \sigma_{\text{max}}\|v\|_{2}. End proof.

Fact 2

The Frobenius inner product is invariant under orthogonal transformations.

Let AA and BB be arbitrary m×nm \times n matrices. And let URm×m,VRn×nU \in \mathbb{R}^{m \times m}, V \in \mathbb{R}^{n \times n} be orthogonal matrices, meaning UTU=UUT=ImandVTV=VVT=In. U^TU = UU^T = I_{m} \quad \text{and} \quad V^TV = VV^T = I_{n}. Then UAVT,UBVT=A,B \langle UAV^T, UBV^T \rangle = \langle A, B \rangle Proof: Expand out the definition of the Frobenius inner product, UAVT,UBVT=Tr((UAVT)T(UBVT))=Tr(VAT(UTU)BVT)=Tr(VATBVT). \begin{align*} \langle UAV^T, UBV^T \rangle &= \mathrm{Tr} \Big( (UAV^T)^T (UBV^T) \Big) \\ &= \mathrm{Tr} ( VA^T (U^TU) BV^T) \\ &= \mathrm{Tr} (VA^T BV^T). \end{align*} Using the cyclic trace property3, this reduces to Tr(VATBVT)=Tr(ATBVTV)=Tr(ATB)=A,B. \mathrm{Tr}(V A^T B V^T) = \mathrm{Tr}(A^T B V^T V) = \mathrm{Tr}(A^TB) = \langle A, B \rangle. End proof.

Fact 3

The RMS norm is invariant under orthogonal transformations.

Let AA be an m×nm \times n matrix, and let URm×m,VRn×nU \in \mathbb{R}^{m \times m}, V \in \mathbb{R}^{n \times n} be orthogonal. Then ARMS=UAVTRMS. \|A\|_{\text{RMS}} = \|UAV^T \|_{\text{RMS}}. Proof: For brevity, call B:=UAVTB := UAV^T. Since MRMS=nmσmax(M), \| M \|_{\text{RMS}} = \sqrt{ \frac{n}{m} } \sigma_{\text{max}}(M), it suffices to show that BB and AA have the same singular values.

Say A=LSRTA = LSR^T is the SVD of AA. Without loss of generality, order the singular values in descending magnitude, so S11S22 S_{11} \geq S_{22} \geq \dots So we can write BB as B=UAVT=(UL)S(RV)T. B = U A V^T = (UL) S (RV)^T. But since U,L,R,VU, L, R, V are all orthogonal matrices, the above is a SVD of BB. So the two matrices have the same maximum singular value, σmax(A)=Σ11=σmax(B). \sigma_{\text{max}}(A) = \Sigma_{11} = \sigma_{\text{max}}(B). Therefore ARMS=nmσmax(A)=nmσmax(B)=BRMS. \|A\|_{\text{RMS}} = \sqrt{ \frac{n}{m} } \sigma_{\text{max}}(A) = \sqrt{ \frac{n}{m} } \sigma_{\text{max}}(B) = \|B\|_{\text{RMS}}. End proof.


  1. In short, the Cauchy-Schwarz inequality states that for any two vectors x,yRdx, y \in \mathbb{R}^{d}, the magnitude of their inner product is at most the product of their norms:xTyx2y2. |x^Ty| \leq \|x\|_{2} \|y\|_{2}. If you haven’t seen it before, it’s worth trying to prove. Here’s a sketch: restrict your focus to the case where both xx and yy have unit norm: x2=y2=1\|x\|_2 = \|y\|_2 = 1. Assume you have some pair x,yx, y with xTy>1x^Ty > 1. Then what must be true about xy2\|x - y\|_2

  2. Cauchy-Schwarz type inequalities only hold for norms induced by inner products. Specifically, say VV is a vector space equipped with an inner product ,V\langle \cdot, \cdot \rangle_{V}. This inner product defines a norm on VV as well, given by xV:=x,xV. \|x\|_{V} := \sqrt{ \langle x, x \rangle_{V} }. Using this norm, we have a Cauchy-Schwarz inequality:x,yVxVyV. \langle x, y \rangle_{V} \leq \|x\|_{V}\|y\|_{V}. These inequalities do not necessarily hold for other norms. For example, consider the vector xR2x \in \mathbb{R}^{2} given by x=(1,1)x = (1, 1). And consider the \ell_{\infty} norm, y:=maxiyi\|y\|_\infty := \text{max}_i |y_i|. The Cauchy-Schwarz inequality does not hold for the pairing between the standard inner product and the \ell_{\infty} norm:x,xR2=4x2=1. \langle x, x \rangle_{\mathbb{R}^{2}} = 4 \geq \|x\|_{\infty}^{2} = 1. In the Muon setting, we cannot use Cauchy-Schwarz because the Frobenius inner product does not induce the RMS norm:X,XF=Tr(XTX)=iσi2(X)σmax(X)=X. \sqrt{ \langle X, X \rangle_{F} } = \sqrt{ \text{Tr}(X^TX) } = \sqrt{ \sum_{i} \sigma_{i}^{2}(X) } \geq \sigma_{\text{max}}(X) = \|X\|_{*}.  

  3. The cyclic trace property states thatTr(ABC)=Tr(BCA)=Tr(CAB). \mathrm{Tr}(ABC) = \mathrm{Tr}(BCA) = \mathrm{Tr}(CAB). It is quite easy to prove, and worth doing as an exercise. Start by showing that the trace is commutative, meaning Tr(AB)=Tr(BA). \mathrm{Tr}(AB) = \mathrm{Tr}(BA).