michael orlitzky

A non-proof of the Lagrange multiplier theorem

posted 2019-04-09

It is a melancholy experience for a professional mathematician to find himself writing about mathematics. The function of a mathematician is to do something, to prove new theorems, to add to mathematics, and not to talk about what he or other mathematicians have done… Exposition, criticism, appreciation, is work for second-rate minds.
G. H. Hardy, in A Mathematician’s Apology

Introduction

Lagrange multipliers are a way to solve optimization problems (to minimize or maximize some function) subject to equality constraints. For example, if $f$ is a real-valued function defined on the $n$-dimensional real space $\Rn$, we might want to $$ \begin{equation} \begin{aligned} & \text{maximize} & f\of{\vector{x}} & \\ & \text{subject to} & g_{1}\of{\vector{x}} & = 0,\\ & & g_{2}\of{\vector{x}} & = 0,\\ & & \vdots\\ & & g_{m}\of{\vector{x}} & = 0,\\ & \text{and} & \vector{x} & \in \Rn. \end{aligned} \label{eq:optimization-problem} \end{equation} $$

Here, the function $f$ is the thing we want to optimize, and the collection of functions $g_{1},g_{2},\ldots,g_{m}$ is like a list of properties that we want the solution—the optimal point—to have. The functions $g_{1}$ through $g_{m}$ are called constraints, and the whole problem is referred to as a constrained optimization problem.

The method of Lagrange multipliers solves problem $\eqref{eq:optimization-problem}$ and, like most dark magic, is often presented as mathematics. You invent $m$ new variables $\lambda_{1},\lambda_{2},\ldots,\lambda_{m}$ and then define a function $\mathcal{L} : \Rnplusm \rightarrow \realline$ of all $n+m$ variables, called the Lagrange function, and somehow the gradient of the Lagrange function being zero is necessary at a solution to the original problem. Usually this hand waving is accompanied by some pictures of ellipses that only obscure the main idea:

There is a necessary condition for optimality in $\eqref{eq:optimization-problem}$ that looks easy to prove.
But it isn't.
The function $\gradient{\mathcal{L}}$ is an obnoxious way of packing that condition onto one line.

Most people give you either the intuition (looks easy to prove), or the formal proof that uses some powerful calculus stuff for no apparent reason. I won't name names (WIKIPEDIA), but a few people try to pass off the intuition as a proof. This article is about the gap between the two. Specifically, what is missing from the intuitive explanation that we need for a formal proof? To that end, I'm going to give an erroneous proof of the following theorem:

Theorem 1

If $\vector{x_{0}} = \colvec{x_{1},x_{2},\ldots,x_{n}}$ is a solution to problem $\eqref{eq:optimization-problem}$, and if each $g_{i}$ and $f$ is continuously-differentiable at $\vector{x_{0}}$, and if the set $\setc{ \gradient{g_{i}}\of{\vector{x_{0}}} }{ i \in \set{1,2,\ldots,m} }$ is linearly-independent, then there exist real numbers $\lambda_{1}, \lambda_{2}, \ldots, \lambda_{m}$ such that $$ \gradient{\mathcal{L}}\of{ \colvec{x_{1}, x_{2}, \ldots, x_{n}, \lambda_{1}, \lambda_{2}, \ldots, \lambda_{m} } } = \vector{0}, $$ where the function $\mathcal{L} : \Rnplusm \rightarrow \realline$ is defined by $$ \begin{align*} \mathcal{L}\of{ \colvec{x_{1}, x_{2}, \ldots, x_{n}, \lambda_{1}, \lambda_{2}, \ldots, \lambda_{m} } } & = f\of{ \colvec{x_{1},x_{2},\ldots,x_{n}} }\\ &- \sum_{i=1}^{m} \lambda_{i}g_{i}\of{\colvec{x_{1},x_{2},\ldots,x_{n}}}. \end{align*} $$

I said it was obnoxious. The idea is that the shortcomings in the non-proof will illuminate the need for more powerful tools.

Background

G. H. Hardy, brilliant mathematician and occasional jackass
Ben Orlin

From linear algebra, you will need to know what an inner product $\ip{\vector{v}}{\vector{w}}$ between two vectors $\vector{v}$ and $\vector{w}$ is. The “dot” operation in $\alpha \cdot \vector{x}$ means scalar multiplication; the vector dot product is used only by women and children. The norm of a vector $\vector{x}$ is $\sqrt{ \ip{\vector{x}}{\vector{x}} }$ and is written $\norm{\vector{x}}$. If $V$ is a vector space and if $W$ is a subspace of $V$, then $V$ is a direct sum of $W$ and its orthogonal complement $W^{\perp}$, written $$ V = \directsum{W}{W^{\perp}}. $$ This means that any $\vector{v} \in V$ can be written uniquely as a sum $\vector{v} = \vector{w_{1}} + \vector{w_{2}}$ where $\vector{w_{1}} \in W$ and $\vector{w_{2}} \in W^{\perp}$ are orthogonal. The span of a subset $X \subseteq V$ is the smallest subspace of $V$ containing $X$, and is denoted by $\spanof{X}$. Equivalently, $\spanof{X}$ is the set of all linear combinations in $V$ of the elements of $X$. The book Linear Algebra Done Right by Sheldon Axler is an outstanding reference for all of this.

Now, calculus. So sorry. First, all vectors are column vectors, so $\colvec{x_{1},x_{2},\ldots,x_{n}}$ is what an element of $\Rn$ looks like.

Notation warning: for certain things to make sense we should really be using $\Rnbyone$ instead of $\Rn$, but they're isomorphic and it's standard practice in mathematics to not make any sense, so we'll stick with $\Rn$. In that same spirit, we'll use $\realline$ in most places where $\Ronebyone$ would be more appropriate.

The gradient of a function $f : \Rn \rightarrow \realline$ is another function $$ \gradient{f} : \Rn \rightarrow \Rn $$ defined by $$ \gradient{f}\of{ \colvec{x_{1}, x_{2}, \ldots, x_{n}} } = \begin{bmatrix} f_{x_{1}}\of{\colvec{x_{1}, x_{2}, \ldots, x_{n}}}\\ f_{x_{2}}\of{\colvec{x_{1}, x_{2}, \ldots, x_{n}}}\\ \vdots\\ f_{x_{n}}\of{\colvec{x_{1}, x_{2}, \ldots, x_{n}}} \end{bmatrix}. $$ The symbol $f_{x_{i}}$ denotes the partial derivative of $f$ with respect to the variable $x_{i}$, and is somewhat standard. More precisely, if $f$ is a function of $\colvec{x_{1}, x_{2}, \ldots, x_{n}}$, then $f_{x_{i}}$ is the directional derivative of $f$ in the direction of the $i$th standard basis vector in $\Rn$. To save space, we will write a single $\vector{x}$ to represent the entire column vector $\colvec{x_{1}, x_{2}, \ldots, x_{n}}$. With this new notation, the formula above can be abbreviated, $$ \gradient{f}\of{ \vector{x} } = \begin{bmatrix} f_{x_{1}}\of{\vector{x}}\\ f_{x_{2}}\of{\vector{x}}\\ \vdots\\ f_{x_{n}}\of{\vector{x}} \end{bmatrix}. $$

At any point $\vector{x_{0}} \in \Rn$, we can express $f$ in terms of its Taylor series expansion, $$ \begin{equation} f\of{ \vector{x_{0}} + \vector{p} } = f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}} + \text{ other stuff}. \label{eq:taylor-series-f} \end{equation} $$ By choosing $\vector{p}$ wisely in equation $\eqref{eq:taylor-series-f}$, one can find the value of $f$ at any point in $\Rn$. The “other stuff” in equation $\eqref{eq:taylor-series-f}$ is some remainder term that goes to zero as the norm of $\vector{p}$ goes to zero. Since the remainder goes to zero, we can simply drop it from equation $\eqref{eq:taylor-series-f}$ to obtain what is called a first-order approximation of $f$ that is valid whenever $\norm{\vector{p}}$ is small:

$$ \begin{equation} f\of{ \vector{x_{0}} + \vector{p} } \approx f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}}. \label{eq:linear-approx-f} \end{equation} $$ The “approximately equals” relationship here means that by choosing a $\vector{p}$ with a small enough norm on the left-hand side, we can get as close as we want to the value on the right-hand side. In particular, if the value on the right-hand side is positive, then we can make the left-hand side positive.

The approximation $\eqref{eq:linear-approx-f}$ tells us what happens when we move a small distance (corresponding to the vector $\vector{p}$) away from the starting point $\vector{x_{0}}$. In particular, it tells us that the change in $f$ near $\vector{x_{0}}$ depends only on the inner product $\ip{ \gradient{f}\of{ \vector{x_{0}} }}{\vector{p}}$ which is zero if and only if $\vector{p}$ and $\gradient{f}\of{ \vector{x_{0}}}$ are orthogonal. In other words, $$ f\of{ \vector{x_{0}} + \vector{p} } - f\of{ \vector{x_{0}} } \approx 0 \iff \vector{p} \in \spanof{\set{\gradient{f}\of{ \vector{x_{0}} }}}^{\perp}. $$

This material can be found in any book on multivariable calculus. Khan Academy has an introductory article about the gradient, which is a special case of the Jacobian matrix. A thorough treatment is given in Advanced Calculus of Several Variables by C.H. Edwards, Jr. You are also welcome to try the Principles of Mathematical Analysis by Walter Rudin, but wear a helmet.

Intuition

Theorem 1 states precise differentiability requirements, but for simplicity, in this section, you can just assume that all functions are smooth. Recall from kindergarten that if $\vector{x_{0}}$ is a global maximum of $f$, then we must have $\gradient{f}\of{ \vector{x_{0}} } = \vector{0}$. In fact, this follows from our first-order approximation of $f$. To see why, suppose that $\gradient{f}\of{ \vector{x_{0}} }$ is not zero. In that case, we can let $\vector{p} = \alpha \cdot \gradient{f}\of{ \vector{x_{0}} }$ for some small positive real number $\alpha$ in $\eqref{eq:linear-approx-f}$ to find $$ \begin{align*} f\of{ \vector{x_{0}} + \alpha \cdot \gradient{f}\of{ \vector{x_{0}} } } & \approx f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } } { \alpha \cdot \gradient{f}\of{ \vector{x_{0}} }} \\ & = f\of{ \vector{x_{0}} } + \alpha \norm{\gradient{f}\of{ \vector{x_{0}} }}^{2}\\ & > f\of{ \vector{x_{0}} }. \end{align*} $$ The small positive scalar multiple is necessary because $\eqref{eq:linear-approx-f}$ is only valid close to $\vector{x_{0}}$. The inequality above would contradict the fact that $\vector{x_{0}}$ is a global maximum of $f$, so it must be the case that $\gradient{f}\of{ \vector{x_{0}} } = \vector{0}$ at such a point. Viewed the other way around, having $\gradient{f}\of{ \vector{x_{0}} } = \vector{0}$ ensures that there is no direction $\vector{p}$ that will increase the value of $f$

Ok. That's all well and good, but what does finding a global maximum of $f$ have to do with the constrained problem $\eqref{eq:optimization-problem}$? Well, we would like to find a necessary condition for a point $\vector{x_{0}}$ to be optimal in the constrained problem. And, after staring at $\eqref{eq:optimization-problem}$ long enough, one necessary condition reveals itself: we must have $g_{i}\of{\vector{x_{0}}} = 0$ for each constraint function $g_{i}$, because the problem says so. Points satisfying all of the constraints in an optimization problem are called feasible points. The set of feasible points in problem $\eqref{eq:optimization-problem}$ is $\setc{ \vector{x} \in \Rn }{ g_{1}\of{\vector{x}} = g_{2}\of{\vector{x}} = \cdots = g_{m}\of{\vector{x}} = 0 }$. Thus we can restate: if a point $\vector{x_{0}}$ is optimal in $\eqref{eq:optimization-problem}$, then it must be feasible.

That's one necessary condition, but we really want one involving $f$. If we already know that $\vector{x_{0}}$ is feasible, then in a sense, the necessary condition involving $f$ should be “easier” than the condition for the unconstrained problem. There we needed $\gradient{f}\of{ \vector{x_{0}} } = \vector{0}$ to ensure that no direction would increase the value of $f$. But in the constrained problem, if we start at a feasible point $\vector{x_{0}}$, then we only need to ensure that $f$ doesn't increase in certain directions—namely those that will move us to another feasible point. This is the same sense in which it's easier to pick all winning lottery numbers than it is to pick all losing ones.

More concretely, at a feasible point $\vector{x_{0}}$, we only have to worry about the directions $\vector{p}$ with $$ \begin{equation} g_{i}\of{\vector{x_{0}} + \vector{p}} = 0 \text{ for all } i \in \set{1,2,\ldots,m}, \label{eq:remain-feasible-g} \end{equation} $$ because otherwise, $\vector{x_{0}} + \vector{p}$ will be an infeasible point: for some $g_{i}$, we would have $g_{i}\of{\vector{x_{0}} + \vector{p}} \ne 0$. Recall our first-order approximation, this time using $g_{i}$ instead of $f$: $$ g_{i}\of{ \vector{x_{0}} + \vector{p} } \approx g_{i}\of{ \vector{x_{0}} } + \ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} \text{ for all } i \in \set{1,2,\ldots,m}. $$ Since each $g_{i}\of{ \vector{x_{0}} }$ is zero by the assumption that $\vector{x_{0}}$ is feasible, we conclude that $$ \begin{equation} g_{i}\of{ \vector{x_{0}} + \vector{p} } \approx \ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} \text{ for all } i \in \set{1,2,\ldots,m}. \label{eq:linear-approx-g} \end{equation} $$ After combining $\eqref{eq:remain-feasible-g}$ and $\eqref{eq:linear-approx-g}$, it should become clear that we only have to worry about those $\vector{p}$ such that, for all $i \in \set{1,2,\ldots,m}$, we have $\ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} = \vector{0}$. Or to rephrase, we are only worried about the vectors $\vector{p}$ that are orthogonal to every element of the set $\setc{ \gradient{g_{i}}\of{\vector{x_{0}}} }{ i \in \set{1,2,\ldots,m} }$, and thus to its span. Because, to really hammer this home, $$ \begin{equation} \begin{aligned} g_{i}\of{\vector{x_{0}} + \vector{p}} \approx & \ \ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} \approx \text{ nonzero, for some } i \in \set{1,2,\ldots,m}\\ & \ \text{ if } \vector{p} \notin \spanof{ \setc{ \gradient{g_{i}}\of{ \vector{x_{0}} } } { i \in \set{1,2,\ldots,m} } }^{\perp}. \end{aligned} \label{eq:first-span-appearance} \end{equation} $$

The “span” in $\eqref{eq:first-span-appearance}$ is going to appear a lot from now on, so let's give it a name. Whenever problem $\eqref{eq:optimization-problem}$ is in scope we will let $$ \begin{equation} S = \spanof{ \setc{ \gradient{g_{i}}\of{ \vector{x_{0}} } } { i \in \set{1,2,\ldots,m} } }, \label{eq:definition-of-S} \end{equation} $$ and we can immediately put this notation to good use. The approximation of $f$ in $\eqref{eq:linear-approx-f}$ held for all directions $\vector{p}$, so it holds in particular for the directions that we just decided we care about: $$ \begin{equation*} f\of{ \vector{x_{0}} + \vector{p} } \approx f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}} \text{ for all } \vector{p} \in S^{\perp}. \end{equation*} $$ (I'm not going to keep saying it, but in all of these approximations, the norm of $\vector{p}$ must be small. It doesn't really matter. If you have a vector in mind, scale it by a small positive amount to make its norm small.) By analogy with what we did earlier, we need to ensure that this is not greater than $f\of{ \vector{x_{0}} }$, and to do that, we'll need $$ \begin{equation*} \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}} = 0 \text{ for all } \vector{p} \in S^{\perp} \iff \gradient{f}\of{ \vector{x_{0}} } \in \qty{S^{\perp}}^{\perp} = S. \end{equation*} $$ Since the span of a set of vectors is the set of all linear combinations of those vectors, this last condition is saying that there exists some real numbers $\lambda_{1},\lambda_{2},\ldots,\lambda_{m}$ such that $\gradient{f}\of{ \vector{x_{0}} }$ can be written as the linear combination $\sum_{i=1}^{m} \lambda_{i} \cdot \gradient{g_{i}}\of{ \vector{x_{0}} }$. And that is our necessary condition. Note that this is, in some sense, an easier job than we had before when we wanted a global maximum. Here, $\lambda_{1} = \lambda_{2} = \cdots \lambda_{m} = 0$ will work, satisfying the $\gradient{f}\of{\vector{x_{0}}} = \vector{0}$ condition that we had for a global maximum. But now, some other non-zero values of the $\lambda_{i}$ might work too. So, there are (in general) more of them, and they should be easier to find.

The unproof

In the introduction, I made three claims about problem $\eqref{eq:optimization-problem}$ and the Lagrange function $\mathcal{L}$:

There is a necessary condition for optimality in $\eqref{eq:optimization-problem}$ that looks easy to prove.
But it isn't.
The function $\gradient{\mathcal{L}}$ is an obnoxious way of packing that condition onto one line.

Keeping in mind the definition of the subspace $S$ from $\eqref{eq:definition-of-S}$, let's get to the point, and prove Theorem 1 based on our intuition.

Proposition 2

If $\vector{x_{0}} \in \Rn$ is a feasible point in problem $\eqref{eq:optimization-problem}$, and if each $g_{i}$ and $f$ is continuously-differentiable at $\vector{x_{0}}$, and if $\dim\of{S} = m$, and if $\gradient{f}\of{ \vector{x_{0}} } \notin S$, then $\vector{x_{0}}$ is not optimal—that is, it does not maximize $f$—in problem $\eqref{eq:optimization-problem}$.

Proof. Assume that $\gradient{f}\of{ \vector{x_{0}} } \notin S$, and decompose the ambient space $\Rn$ into two orthogonal subspaces, $$ \Rn = \directsum{S}{S^{\perp}}. $$

Having done so, we can express $\gradient{f}\of{ \vector{x_{0}} }$ as the sum of two vectors, one from each component in the direct sum above:

$$ \begin{equation*} \gradient{f}\of{ \vector{x_{0}} } = \vector{p_{1}} + \vector{p_{2}}, \text{ where } \vector{p_{1}} \in S \text{ and } \vector{p_{2}} \in S^{\perp}. \end{equation*} $$

And since we have assumed that $\gradient{f}\of{ \vector{x_{0}} } \notin S$, we can conclude that $\vector{p_{2}} \ne \vector{0}$, because, otherwise, $\gradient{f}\of{ \vector{x_{0}} } = \vector{p_{1}}$, and $\vector{p_{1}}$ lies in $S$ contrary to that assumption.

Now we're going to substitute $\vector{p_{2}}$ into our linear approximations $\eqref{eq:linear-approx-f}$ and $\eqref{eq:linear-approx-g}$ for $f$ and $g$, after scaling it by some small positive $\alpha$ to ensure that the approximations are valid: $$ \begin{equation} \begin{aligned} f\of{ \vector{x_{0}} + \alpha\vector{p_{2}} } & \approx f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\alpha\vector{p_{2}}}\\ & = f\of{ \vector{x_{0}} } + \ip{ \vector{p_{1}} + \vector{p_{2}} }{\alpha\vector{p_{2}}}\\ & = f\of{ \vector{x_{0}} } + \alpha \norm{\vector{p_{2}}}^{2}\\ & > f\of{ \vector{x_{0}} }, \end{aligned} \label{eq:f contradiction} \end{equation} $$ $$ \begin{equation} g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} } \approx \alpha\ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p_{2}}} = 0, \text{ for all } i \in \set{1,2,\ldots,m}. \label{eq:g-contradiction} \end{equation} $$ The inequality in $\eqref{eq:f contradiction}$ shows that if we move in the direction $\vector{p_{2}}$ from the feasible starting point $\vector{x_{0}}$, then we increase the value of $f$. And $\eqref{eq:g-contradiction}$ shows that the resulting point will remain feasible. Thus, the existence of $\vector{p_{2}}$ shows that $\vector{x_{0}}$ does not maximize $f$ in $\eqref{eq:optimization-problem}$. ∎

I won't keep you in suspense: the error is in that proof. The proposition itself is true, but the proof is no good. If you were paying attention, you might have noticed that we didn't use the assumption that our functions are continuously-differentiable anywhere; nor did we use the fact that $\dim\of{S} = m$. That should set off some alarms. The next two results are also true, and their proofs… are not wrong on purpose. The following is merely a logical consequence of Proposition 2, after taking into consideration that all solutions are feasible.

Corollary 3

If $\vector{x_{0}}$ is a solution to problem $\eqref{eq:optimization-problem}$, and if each $g_{i}$ and $f$ is continuously-differentiable at $\vector{x_{0}}$, and if $\dim\of{S} = m$, then $\gradient{f}\of{ \vector{x_{0}} } \in S$ and $g_{i}\of{\vector{x_{0}}} = 0$ for all $i \in \set{1,2,\ldots,m}$.

All that remains is to show that the Lagrange function $\mathcal{L}$ is but gift wrapping. Below we restate Theorem 1, having replaced the condition that $\setc{ \gradient{g_{i}}\of{\vector{x_{0}}} }{ i \in \set{1,2,\ldots,m} }$ be linearly-independent with the equivalent statement that $\dim\of{S} = m$.

Theorem 1

If $\vector{x_{0}} = \colvec{x_{1},x_{2},\ldots,x_{n}}$ is a solution to problem $\eqref{eq:optimization-problem}$, and if each $g_{i}$ and $f$ is continuously-differentiable at $\vector{x_{0}}$, and if $\dim\of{S} = m$, then there exist real numbers $\lambda_{1}, \lambda_{2}, \ldots, \lambda_{m}$ such that $$ \gradient{\mathcal{L}}\of{ \colvec{x_{1}, x_{2}, \ldots, x_{n}, \lambda_{1}, \lambda_{2}, \ldots, \lambda_{m} } } = \vector{0}, $$ where the function $\mathcal{L} : \Rnplusm \rightarrow \realline$ is defined by $$ \begin{align*} \mathcal{L}\of{ \colvec{x_{1}, x_{2}, \ldots, x_{n}, \lambda_{1}, \lambda_{2}, \ldots, \lambda_{m} } } & = f\of{ \colvec{x_{1},x_{2},\ldots,x_{n}} }\\ &- \sum_{i=1}^{m} \lambda_{i}g_{i}\of{\colvec{x_{1},x_{2},\ldots,x_{n}}}. \end{align*} $$

Proof. Assuming that $\mathcal{L}$ is defined as above, we will proceed by proving that the following are equivalent:

There exist $\lambda_{1},\ldots,\lambda_{m} \in \realline$ such that $$ \gradient{\mathcal{L}}\of{ \colvec{ x_{1},x_{2},\ldots,x_{n},\lambda_{1},\lambda_{2},\ldots,\lambda_{m} } } = \vector{0}. $$
$\gradient{f}\of{ \vector{x_{0}} } \in S$ and $g_{i}\of{ \vector{x_{0}} } = 0$ for all $i \in \set{1,2,\ldots,m}$.

In other words, we will show that the condition in the theorem that involves $\gradient{\mathcal{L}}$ is the same as the condition in Corollary 3. The result then follows from that corollary, which we already believe. The equivalence is not hard to see; simply write out what the first condition means: $$ \begin{gather*} \exists \lambda_{1},\lambda_{2},\ldots,\lambda_{m} \in \realline : \gradient{\mathcal{L}}\of{ \colvec{x_{1}, x_{2}, \ldots,x_{n}, \lambda_{1}, \lambda_{2}, \ldots, \lambda_{m} } } = \vector{0}\\ \Updownarrow\\ \exists \lambda_{1},\lambda_{2},\ldots,\lambda_{m} \in \realline : \left\lbrace \begin{aligned} f_{x_{1}}\of{ \vector{x_{0}} } - \sum_{i=1}^{m} \lambda_{i}\qty{g_{i}}_{x_{1}}\of{ \vector{x_{0}} } & = 0\\ f_{x_{2}}\of{ \vector{x_{0}} } - \sum_{i=1}^{m} \lambda_{i}\qty{g_{i}}_{x_{2}}\of{ \vector{x_{0}} } & = 0\\ & \vdots\\ f_{x_{n}}\of{ \vector{x_{0}} } - \sum_{i=1}^{m} \lambda_{i}\qty{g_{i}}_{x_{n}}\of{ \vector{x_{0}} } & = 0\\ - g_{1}\of{\vector{x_{0}}} & = 0\\ - g_{2}\of{\vector{x_{0}}} & = 0\\ & \vdots\\ - g_{m}\of{\vector{x_{0}}} & = 0 \end{aligned}\right.\\ \Updownarrow\\ \exists \lambda_{1},\lambda_{2},\ldots,\lambda_{m} \in \realline : \left\lbrace \begin{aligned} \gradient{f}\of{ \vector{x_{0}} } & = \sum_{i=1}^{m} \lambda_{i} \cdot \gradient{g_{i}}\of{ \vector{x_{0}}}\\ g_{1}\of{ \vector{x_{0}} } & = 0\\ g_{2}\of{ \vector{x_{0}} } & = 0\\ & \vdots\\ g_{m}\of{ \vector{x_{0}} } & = 0\\ \end{aligned}\right.\\ \Updownarrow\\ \gradient{f}\of{ \vector{x_{0}} } \in \spanof{ \setc{\gradient{g_{i}}\of{ \vector{x_{0}} }}{i \in \set{1,2,\ldots,m}} }\\ \text{ and }\\ g_{i}\of{ \vector{x_{0}} } = 0 \text{ for all } i \in \set{1,2,\ldots,m}\\ \Updownarrow\\ \gradient{f}\of{ \vector{x_{0}} } \in S \text { and } g_{i}\of{ \vector{x_{0}} } = 0 \text{ for all } i \in \set{1,2,\ldots,m}. \end{gather*} $$ This shows that the two conditions are equivalent. All done. ∎

What went wrong

In the proof of Proposition 2, we derived approximation $\eqref{eq:g-contradiction}$ which bears repeating: $$ \begin{equation*} g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} } \approx \alpha\ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p_{2}}} = 0, \text{ for all } i \in \set{1,2,\ldots,m}. \end{equation*} $$ This showed that some small scalar multiple of $\vector{p_{2}}$ would take us to another feasible point. Weeeeeeeeeeeeeelllllllllllllllll not quite:

The “approximately equals” relationship here means that by choosing a $\vector{p_{2}}$ with a small enough norm on the left-hand side, we can get as close as we want to the value on the right-hand side.
me, like ten minutes ago

If $g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} }$ is “approximately equal” to a positive number, then we can make it close to that positive number. Things close to a positive number are positive; so, we can make $g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} }$ positive in that case. Likewise, if it's approximately equal to a negative number, then we can make it negative by choosing $\alpha$ small enough. But, here's the catch: being able to make something approximately zero is not the same as being able to make it actually zero!

We needed to find a counterexample that was truly feasible, but instead found one that was only close to feasible.

How not to have it go wrong

Death is the solution to all problems.
Joseph Stalin

Keep in mind that we're coming up with necessary conditions for optimality, and they're not unique. Some are better, some are worse; some are better and worse. What I'm trying to say is, there's more than one way to proceed. One way would be to find a better counterexample: we need a $\vector{p_{2}}$ in $\eqref{eq:g-contradiction}$ that moves us to a truly feasible point $\vector{x_{0}} + \vector{p_{2}}$; that is, one that has $g_{i}\of{\vector{x_{0}} + \vector{p_{2}}} = 0$ for all $i$. But we also want that same point to increase the value of $f$ relative to the starting point $\vector{x_{0}}$. Taken together, these aren't such easy goals.

Fortunately, calculus can do the heavy lifting for us, ensuring that such a counterexample would exist in all of the directions we care about. This approach was hinted at by the extra assumptions that I tacked on to all of the results, namely that the functions all be continuously-differentiable at $\vector{x_{0}}$, and that the set $\setc{ \gradient{g_{i}}\of{\vector{x_{0}}} }{ i \in \set{1,2,\ldots,m} }$ be linearly-independent.

If we define the function $G : \Rn \rightarrow \Rm$ by $$ \begin{equation} G\of{\vector{x}} = \begin{bmatrix} g_{1}\of{ \vector{x} }\\ g_{2}\of{ \vector{x} }\\ \vdots\\ g_{m}\of{ \vector{x} } \end{bmatrix}, \label{eq:definition-of-G} \end{equation} $$ then the derivative of $G$ at $\vector{x_{0}}$, where it is continuous, is $$ \begin{equation} G'\of{\vector{x_{0}}} = \begin{bmatrix} \transpose{\gradient{g_{1}}\of{ \vector{x_{0}} }}\\ \transpose{\gradient{g_{2}}\of{ \vector{x_{0}} }}\\ \vdots\\ \transpose{\gradient{g_{m}}\of{ \vector{x_{0}} }} \end{bmatrix}. \label{eq:G-prime-at-x0} \end{equation} $$ The linear-independence of its rows means that $m \le n$ and that $G'\of{\vector{x_{0}}}$ is invertible.

So now we decompose the space into $\Rn = \directsum{S}{S^{\perp}}$ again, and by cleverly choosing what is what, we can apply some powerful theorem like

Wielded skillfully, these two theorems tell you exactly what you want to know. The feasible set is a nice smooth surface on which we can draw a bunch of continuous paths, all passing through the point $\vector{x_{0}}$. The derivative of $f$ at $\vector{x_{0}}$ along any of those paths (by the chain rule) must be zero, because the value of $f$ can't change on the surface near $\vector{x_{0}}$. It turns out that said paths all pass through $\vector{x_{0}}$ in a direction contained in $S^{\perp}$, so and we wind up concluding that $\gradient{f}\of{\vector{x_{0}}}$ must be orthogonal to everything in $S^{\perp}$ to avoid deriving a contradiction. This proves Proposition 2, and the rest follows. See the appendix for the details.

The paper A short elementary proof of the Lagrange multiplier theorem by Brezhneva, Tret'yakov, and Wright tries to get by without any of the powerful theorems that I just said you needed. But, to motivate doing so, they cite other authors who have used…

the Brouwer fixed-point theorem,
the convex interior mapping theorem (based on the Brouwer fixed-point theorem),
the Ekeland variational principle,
a penalty function and repeated applications of the Bolzano-Weierstrass theorem.

Chapter 5 of Convex Analysis and Optimization uses a penalty function to prove slightly stronger necessary conditions than the usual ones, and (based on the bibliography) the technique probably goes back to the McShane paper The Lagrange multiplier rule cited by those other authors.

(Please, use Library Genesis and Sci-Hub to find these references.)

In any case, you do have to tack on some extra assumptions to make things work. Doing so has become a cottage industry: these assumptions are called constraint qualifications, and they bought a zoo.

appendix: a non-non-proof

If the new process paused because it was swapped out, set the stack level to the last call to savu(u_ssav). This means that the return actually returns from the last routine which did the savu.

You are not expected to understand this.
Dennis Ritchie, source code to the V6 Unix kernel

Warning: this section is not nearly as friendly as the others. I'll make no condescending claims as to what sort of background you should have to understand this. In particular, using parentheses for function application is stupid, and you're about to find out why.

The Rank Theorem

Suppose $m,n,r$ are nonnegative integers, $m \ge r$, $n \ge r$, $\vector{F}$ is a $\mathcal{C}'$-mapping of an open set $E \subseteq \Rn$ into $\Rm$, and $\vector{F}'\of{\vector{x}}$ has rank $r$ for every $\vector{x} \in E$. Fix $\vector{a} \in E$, put $A = \vector{F}'\of{\vector{a}}$, let $Y_{1}$ be the range of $A$, and let $P$ be a projection in $\Rm$ whose range is $Y_{1}$. Let $Y_{2}$ be the null space of $P$. Then there are open sets $U$ and $V$ in $\Rn$, with $\vector{a} \in U$, $U \subseteq E$, and there is a one-to-one $\mathcal{C}'$-mapping $\vector{H}$ of $V$ onto $U$ (whose inverse is also of class $\mathcal{C}'$) such that $$ \begin{equation} \vector{F}\of{\vector{H}\of{\vector{x}}} = A\vector{x} + \phi\of{A\vector{x}} \ \ \ \ \ \ \text{(} \vector{x} \in V \text{)} \label{eq:rank-theorem} \end{equation} $$ where $\phi$ is a $\mathcal{C}'$-mapping of the open set $A\of{V} \subseteq Y_{1}$ into $Y_{2}$.

Well, that's Rudin. Let's use it to prove…

Proposition 2

Proof. Let's map our notation to that of the rank theorem. First, the $\mathcal{C}'$ thing means “continuously differentiable.” We'll be substituting the following into the statement of the theorem:

His $m$ and $n$ are the same as ours, and $r = m$ for us because $r$ should be the rank of the matrix $G'\of{\vector{x_{0}}}$, whose $m$ rows we've assumed are linearly-independent.
His $\vector{a}$ is our $\vector{x_{0}}$.
His $\vector{F}$ is our $G$ from $\eqref{eq:definition-of-G}$.
As a result, his $A$ is our $G'\of{\vector{x_{0}}}$ from $\eqref{eq:G-prime-at-x0}$.
His $Y_{1}$ is our $\range{G'\of{\vector{x_{0}}}}$, and his $Y_{2}$ is our $\range{G'\of{\vector{x_{0}}}}^{\perp}$.

We're going to assume that $\vector{x_{0}}$ is optimal this time around. Then, also assuming that each $g_{i}$ and $f$ is continuously-differentiable at $\vector{x_{0}}$ and that $\dim\of{S} = m$, we will show that $\gradient{f}\of{ \vector{x_{0}} } \in S$. The idea here is that we're going to construct $n-m$ differentiable paths through $\vector{x_{0}}$ in the feasible set, one for each “dimension” in $S^{\perp}$. Since $f$ is supposed to be maximized (within the feasible set, anyway) at that point, its gradient should be orthogonal to the paths there. This will lead us to the conclusion that $\gradient{f}\of{\vector{x_{0}}}$ is orthogonal to all of $S^{\perp}$.

First off, since $\vector{H}$ is invertible, we can let $\vector{v_{0}} \in V$ be such that $\vector{H}\of{\vector{v_{0}}} = \vector{x_{0}} \in U$. Then, let $\set{\vector{q_{1}},\vector{q_{2}},\ldots,\vector{q_{m-n}}}$ be a basis for $S^{\perp}$. Since $V$ is an open set, we can pick tiny enough intervals $\intervaloo{-t_{j}}{t_{j}} \subseteq \realline$ and construct $m-n$ differentiable paths in V, $$ \left. \begin{align*} & \gamma_{j} : \intervaloo{-t_{j}}{t_{j}} \rightarrow V\\ & \gamma_{j}\of{t} = \vector{v_{0}} + t\vector{q_{j}} \end{align*} \right\rbrace \text{ for } j \in \set{1,2,\ldots,m-n}. $$ Now for each $j$, the composition $\compose{\vector{H}}{\gamma_{j}}$ is a path in $U$ through $\vector{x_{0}}$, since at $t=0$ we will have $\vector{H}\of{\gamma_{j}\of{0}} = \vector{H}\of{\vector{v_{0}}} = \vector{x_{0}}$. As a result, we know that the derivative of $\compose{f}{\compose{\vector{H}}{\gamma_{j}}}$ must be zero at $t=0$; the path is contained entirely in the feasible set, and $\vector{x_{0}}$, which corresponds to $t=0$ along the path, maximizes $f$ within that set. Heed our earlier notation warning, and then let's write that all out: $$ \begin{equation} \forall j \in \set{1,2,\ldots,m-n} : \left\lbrace \begin{aligned} &\ \qty{ \compose{f}{\compose{\vector{H}}{\gamma_{j}}} }'\of{0}\\ = &\ \compose{ f'\of{ \vector{H}\of{\gamma_{j}\of{0}} } } {\qty{ \compose{ \vector{H}'\of{\gamma_{j}\of{0}} } {\gamma_{j}'\of{0}} }}\\ = &\ \compose{ f'\of{ \vector{x_{0}} } } { \compose{ \vector{H}'\of{\vector{v_{0}}} } {\vector{q_{j}} } }\\ = &\ \ip{\gradient{f}\of{\vector{x_{0}}}} {\vector{H}'\of{\vector{v_{0}}}\of{\vector{q_{j}}}}\\ = &\ 0. \end{aligned} \right. \label{eq:f-gradient-inner-products} \end{equation} $$

So really, now what we need to do is figure out what $\vector{H}'\of{\vector{v_{0}}}$ does. Because, if it were missing from the inner product above, we'd be done! And that's actually not too hard to figure out either, using equation $\eqref{eq:rank-theorem}$ from the rank theorem. First, note that both $$ \begin{equation} \begin{aligned} G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} } & = \vector{0}\\ \text{ and } & \ \\ \phi\of{ G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} } } & = \vector{0}, \end{aligned} \label{eq:g-prime-zeros} \end{equation} $$ because we have the following direct sum decomposition of zero itself: $$ \vector{0} = G\of{ \vector{x_{0}} } = G\of{ \vector{H}\of{ \vector{v_{0}} } } = \underset{\in\ \range{G'\of{\vector{x_{0}}}}} {\underbrace{G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} }}} + \underset{\in\ \range{G'\of{\vector{x_{0}}}}^{\perp}} {\underbrace{\phi\of{ G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} } }}}. $$

Now, using $\eqref{eq:g-prime-zeros}$ and the linearity of $G'\of{ \vector{x_{0}} }$, we see that each $\compose{G}{\compose{\vector{H}}{\gamma_{j}}}$ is identically zero on its domain: $$ \forall j \in \set{1,2,\ldots,m-n} : \left\lbrace \begin{align*} &\ G\of{\vector{H}\of{\gamma_{j}\of{t}}}\\ = &\ G'\of{\vector{x_{0}}}\of{\gamma_{j}\of{t}} + \phi\of{G'\of{\vector{x_{0}}}\of{\gamma_{j}\of{t}}}\\ = &\ G'\of{\vector{x_{0}}}\of{ \vector{v_{0}} + t\vector{q_{j}} } + \phi\of{G'\of{\vector{x_{0}}}\of{ \vector{v_{0}} + t\vector{q_{j}} }}\\ = &\ t G'\of{\vector{x_{0}}}\of{\vector{q_{j}} } + \phi\of{t G'\of{\vector{x_{0}}}\of{ \vector{q_{j}} }}\\ = &\ t \vector{0} + \phi\of{t \vector{0} }\\ = &\ \vector{0}. \end{align*} \right. $$

And finally, since we now know that each $\compose{G}{\compose{\vector{H}}{\gamma_{j}}}$ is identically zero on its domain, we know that their derivatives are zero there, too. In particular, at $t=0$. Reheed the notation warning, and then let's see what that tells us: $$ \forall j \in \set{1,2,\ldots,m-n} : \left\lbrace \begin{align*} &\ \qty{ \compose{G}{\compose{\vector{H}}{\gamma_{j}}} }'\of{0}\\ = &\ \compose{ G'\of{ \vector{H}\of{\gamma_{j}\of{0}} } } {\qty{ \compose{ \vector{H}'\of{\gamma_{j}\of{0}} } {\gamma_{j}'\of{0}} }}\\ = &\ \compose{ G'\of{ \vector{x_{0}} } } { \compose{ \vector{H}'\of{\vector{v_{0}}} }{ \vector{q_{j}} } }\\ = &\ G'\of{ \vector{x_{0}} } \of{ \vector{H}'\of{ \vector{v_{0}} }\of{ \vector{q_{j}} } }\\ = &\ \vector{0}. \end{align*} \right. $$ By linearity, the fact that this last equality holds for all $\vector{q_{j}}$ means that $$ G'\of{ \vector{x_{0}} } \of{ \vector{H}'\of{\vector{v_{0}}} \of{S^{\perp}} } = \set{ \vector{0} }. $$ The rank theorem tells us that $\vector{H}'\of{\vector{v_{0}}}$ is invertible, by way of telling us that $\vector{H}$ is continuously-differentiable. Recall that an invertible matrix applied to a subspace gives you back a subspace of the same dimension. The preceding equality tells us that $\vector{H}'\of{\vector{v_{0}}} \of{S^{\perp}}$ is contained in the kernel of $G'\of{ \vector{x_{0}} }$; and, since it also has the same dimension as the kernel of $G'\of{ \vector{x_{0}} }$, it must be the case that $$ \ker\of{G'\of{ \vector{x_{0}} }} = \vector{H}'\of{\vector{v_{0}}}\of{S^{\perp}}. $$ Moreover, visual inspection should convince you that $$ \ker\of{G'\of{ \vector{x_{0}} }} = S^{\perp}. $$ Now look back at the equations $\eqref{eq:f-gradient-inner-products}$, which in hindsight say that $$ \ip{ \gradient{f}\of{\vector{x_{0}}} } { \vector{H}'\of{\vector{v_{0}}}\of{ S^{\perp} } } = \ip{ \gradient{f}\of{\vector{x_{0}}} } { S^{\perp} } = \set{0}. $$ At long last, we've shown that $\gradient{f}\of{\vector{x_{0}}} \in \qty{S^{\perp}}^{\perp} = S$. ∎