It is a melancholy experience for a professional mathematician to find himself writing about mathematics. The function of a mathematician is to do something, to prove new theorems, to add to mathematics, and not to talk about what he or other mathematicians have done… Exposition, criticism, appreciation, is work for second-rate minds.

G. H. Hardy, in A Mathematician’s Apology

Lagrange multipliers are a way to solve optimization problems (to minimize or maximize some function) subject to equality constraints. For example, if $f$ is a real-valued function defined on the $n$-dimensional real space $\Rn$, we might want to $$ \begin{equation} \begin{aligned} & \text{maximize} & f\of{\vector{x}} & \\ & \text{subject to} & g_{1}\of{\vector{x}} & = 0,\\ & & g_{2}\of{\vector{x}} & = 0,\\ & & \vdots\\ & & g_{m}\of{\vector{x}} & = 0,\\ & \text{and} & \vector{x} & \in \Rn. \end{aligned} \label{eq:optimization-problem} \end{equation} $$

Here, the function $f$ is the thing we want to optimize, and the
collection of functions $g_{1},g_{2},\ldots,g_{m}$ is like a list of
properties that we want the solution—the optimal
point—to have. The functions $g_{1}$ through $g_{m}$ are
called *constraints*, and the whole problem is referred to as
a *constrained optimization problem*.

The method of
Lagrange multipliers solves
problem $\eqref{eq:optimization-problem}$ and, like most dark
magic, is often presented as mathematics. You invent $m$ new
variables $\lambda_{1},\lambda_{2},\ldots,\lambda_{m}$ and then
define a function $\mathcal{L} : \Rnplusm \rightarrow \realline$ of
all $n+m$ variables, called the *Lagrange function*, and
somehow the gradient of the Lagrange function being zero is
necessary at a solution to the original problem. Usually this hand
waving is accompanied by some pictures of ellipses that only
obscure the main idea:

- There is a necessary condition for optimality in $\eqref{eq:optimization-problem}$ that looks easy to prove.
- But it isn't.
- The function $\gradient{\mathcal{L}}$ is an obnoxious way of packing that condition onto one line.

Most people give you either the intuition (*looks* easy to
prove), or the formal proof that uses some powerful calculus stuff
for no apparent reason. I won't name names (WIKIPEDIA), but a few
people try to pass off the intuition *as* a proof. This
article is about the gap between the two. Specifically, what is
missing from the intuitive explanation that we need for a formal
proof? To that end, I'm going to give an erroneous proof of the
following theorem:

I *said* it was obnoxious. The idea is that the shortcomings
in the non-proof will illuminate the need for more powerful tools.

G. H. Hardy, brilliant mathematician and occasional jackass

Ben Orlin

From linear algebra, you will need to know what an inner product
$\ip{\vector{v}}{\vector{w}}$ between two vectors $\vector{v}$ and
$\vector{w}$ is. The “dot” operation in $\alpha \cdot
\vector{x}$ means *scalar* multiplication; the vector
dot product is used only by women and children. The norm of
a vector $\vector{x}$ is $\sqrt{ \ip{\vector{x}}{\vector{x}} }$ and
is written $\norm{\vector{x}}$. If $V$ is a vector space
and if $W$ is a subspace of $V$, then $V$ is a direct sum of $W$ and
its orthogonal
complement $W^{\perp}$, written
$$
V = \directsum{W}{W^{\perp}}.
$$
This means that any $\vector{v} \in V$ can be written uniquely as a
sum $\vector{v} = \vector{w_{1}} + \vector{w_{2}}$ where
$\vector{w_{1}} \in W$ and $\vector{w_{2}} \in W^{\perp}$ are
orthogonal. The span of a subset $X \subseteq V$ is the smallest
subspace of $V$ containing $X$, and is denoted by
$\spanof{X}$. Equivalently, $\spanof{X}$ is the set of all linear
combinations in $V$ of the elements of $X$. The book *Linear
Algebra Done Right* by Sheldon Axler is an outstanding reference
for all of this.

Now, calculus. So sorry. First, all vectors are *column*
vectors, so $\colvec{x_{1},x_{2},\ldots,x_{n}}$ is what an element
of $\Rn$ looks like.

**Notation warning**: for certain things to make sense
we should really be using $\Rnbyone$ instead of $\Rn$, but they're
isomorphic and it's standard practice in mathematics to not make any
sense, so we'll stick with $\Rn$. In that same spirit, we'll use
$\realline$ in most places where $\Ronebyone$ would be more
appropriate.

The gradient of a function $f : \Rn \rightarrow \realline$ is another function $$ \gradient{f} : \Rn \rightarrow \Rn $$ defined by $$ \gradient{f}\of{ \colvec{x_{1}, x_{2}, \ldots, x_{n}} } = \begin{bmatrix} f_{x_{1}}\of{\colvec{x_{1}, x_{2}, \ldots, x_{n}}}\\ f_{x_{2}}\of{\colvec{x_{1}, x_{2}, \ldots, x_{n}}}\\ \vdots\\ f_{x_{n}}\of{\colvec{x_{1}, x_{2}, \ldots, x_{n}}} \end{bmatrix}. $$ The symbol $f_{x_{i}}$ denotes the partial derivative of $f$ with respect to the variable $x_{i}$, and is somewhat standard. More precisely, if $f$ is a function of $\colvec{x_{1}, x_{2}, \ldots, x_{n}}$, then $f_{x_{i}}$ is the directional derivative of $f$ in the direction of the $i$th standard basis vector in $\Rn$. To save space, we will write a single $\vector{x}$ to represent the entire column vector $\colvec{x_{1}, x_{2}, \ldots, x_{n}}$. With this new notation, the formula above can be abbreviated, $$ \gradient{f}\of{ \vector{x} } = \begin{bmatrix} f_{x_{1}}\of{\vector{x}}\\ f_{x_{2}}\of{\vector{x}}\\ \vdots\\ f_{x_{n}}\of{\vector{x}} \end{bmatrix}. $$

At any point $\vector{x_{0}} \in \Rn$, we can express $f$ in terms of
its Taylor
series expansion,
$$
\begin{equation}
f\of{ \vector{x_{0}} + \vector{p} }
=
f\of{ \vector{x_{0}} }
+ \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}}
+ \text{ other stuff}.
\label{eq:taylor-series-f}
\end{equation}
$$
By choosing $\vector{p}$ wisely in
equation $\eqref{eq:taylor-series-f}$, one can find the value
of $f$ at any point in $\Rn$. The “other stuff” in
equation $\eqref{eq:taylor-series-f}$ is some remainder term
that goes to zero as the norm of $\vector{p}$ goes to zero. Since
the remainder goes to zero, we can simply drop it from
equation $\eqref{eq:taylor-series-f}$ to obtain what is called
a *first-order approximation* of $f$ that is valid whenever
$\norm{\vector{p}}$ is small:

$$ \begin{equation} f\of{ \vector{x_{0}} + \vector{p} } \approx f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}}. \label{eq:linear-approx-f} \end{equation} $$ The “approximately equals” relationship here means that by choosing a $\vector{p}$ with a small enough norm on the left-hand side, we can get as close as we want to the value on the right-hand side. In particular, if the value on the right-hand side is positive, then we can make the left-hand side positive.

The approximation $\eqref{eq:linear-approx-f}$ tells us what happens when we move a small distance (corresponding to the vector $\vector{p}$) away from the starting point $\vector{x_{0}}$. In particular, it tells us that the change in $f$ near $\vector{x_{0}}$ depends only on the inner product $\ip{ \gradient{f}\of{ \vector{x_{0}} }}{\vector{p}}$ which is zero if and only if $\vector{p}$ and $\gradient{f}\of{ \vector{x_{0}}}$ are orthogonal. In other words, $$ f\of{ \vector{x_{0}} + \vector{p} } - f\of{ \vector{x_{0}} } \approx 0 \iff \vector{p} \in \spanof{\set{\gradient{f}\of{ \vector{x_{0}} }}}^{\perp}. $$

This material can be found in any book on multivariable
calculus. Khan Academy has an
introductory article about the gradient, which is a special case
of the
Jacobian matrix. A thorough treatment is given in *Advanced
Calculus of Several Variables* by C.H. Edwards, Jr. You are also
welcome to try the *Principles of Mathematical Analysis* by
Walter Rudin, but wear a helmet.

Theorem 1 states precise differentiability
requirements, but for simplicity, in this section, you can just
assume that all functions are smooth. Recall from kindergarten that
if $\vector{x_{0}}$ is a global maximum of $f$, then we must have
$\gradient{f}\of{ \vector{x_{0}} } = \vector{0}$. In fact, this
follows from our first-order approximation of $f$. To see why,
suppose that $\gradient{f}\of{ \vector{x_{0}} }$ is *not*
zero. In that case, we can let $\vector{p} = \alpha \cdot
\gradient{f}\of{ \vector{x_{0}} }$ for some small positive real
number $\alpha$ in $\eqref{eq:linear-approx-f}$ to find
$$
\begin{align*}
f\of{ \vector{x_{0}} + \alpha \cdot \gradient{f}\of{ \vector{x_{0}} } }
& \approx
f\of{ \vector{x_{0}} }
+ \ip{ \gradient{f}\of{ \vector{x_{0}} } }
{ \alpha \cdot \gradient{f}\of{ \vector{x_{0}} }} \\
& =
f\of{ \vector{x_{0}} }
+ \alpha \norm{\gradient{f}\of{ \vector{x_{0}} }}^{2}\\
& > f\of{ \vector{x_{0}} }.
\end{align*}
$$
The small positive scalar multiple is necessary
because $\eqref{eq:linear-approx-f}$ is only valid close to
$\vector{x_{0}}$. The inequality above would contradict the fact
that $\vector{x_{0}}$ is a global maximum of $f$, so it must be the
case that $\gradient{f}\of{ \vector{x_{0}} } = \vector{0}$ at such a
point. Viewed the other way around, having $\gradient{f}\of{
\vector{x_{0}} } = \vector{0}$ ensures that there is no direction
$\vector{p}$ that will increase the value of $f$

Ok. That's all well and good, but what does finding a global maximum
of $f$ have to do with the constrained
problem $\eqref{eq:optimization-problem}$? Well, we would like
to find a necessary condition for a point $\vector{x_{0}}$ to be
optimal in the constrained problem. And, after staring
at $\eqref{eq:optimization-problem}$ long enough, one necessary
condition reveals itself: we must have $g_{i}\of{\vector{x_{0}}} =
0$ for each constraint function $g_{i}$, because the problem says
so. Points satisfying all of the constraints in an optimization
problem are called *feasible* points. The set of feasible
points in problem $\eqref{eq:optimization-problem}$ is $\setc{
\vector{x} \in \Rn }{ g_{1}\of{\vector{x}} = g_{2}\of{\vector{x}} =
\cdots = g_{m}\of{\vector{x}} = 0 }$. Thus we can restate: if a
point $\vector{x_{0}}$ is optimal
in $\eqref{eq:optimization-problem}$, then it must be feasible.

That's one necessary condition, but we really want one involving
$f$. If we already know that $\vector{x_{0}}$ is feasible, then in a
sense, the necessary condition involving $f$ should be
“easier” than the condition for the unconstrained
problem. There we needed $\gradient{f}\of{ \vector{x_{0}} } =
\vector{0}$ to ensure that no direction would increase the value of
$f$. But in the constrained problem, if we start at a feasible point
$\vector{x_{0}}$, then we only need to ensure that $f$ doesn't
increase in *certain* directions—namely those that will
move us to another feasible point. This is the same sense in which
it's easier to pick all winning lottery numbers than it is to pick
all losing ones.

More concretely, at a feasible point $\vector{x_{0}}$, we only have to worry about the directions $\vector{p}$ with $$ \begin{equation} g_{i}\of{\vector{x_{0}} + \vector{p}} = 0 \text{ for all } i \in \set{1,2,\ldots,m}, \label{eq:remain-feasible-g} \end{equation} $$ because otherwise, $\vector{x_{0}} + \vector{p}$ will be an infeasible point: for some $g_{i}$, we would have $g_{i}\of{\vector{x_{0}} + \vector{p}} \ne 0$. Recall our first-order approximation, this time using $g_{i}$ instead of $f$: $$ g_{i}\of{ \vector{x_{0}} + \vector{p} } \approx g_{i}\of{ \vector{x_{0}} } + \ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} \text{ for all } i \in \set{1,2,\ldots,m}. $$ Since each $g_{i}\of{ \vector{x_{0}} }$ is zero by the assumption that $\vector{x_{0}}$ is feasible, we conclude that $$ \begin{equation} g_{i}\of{ \vector{x_{0}} + \vector{p} } \approx \ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} \text{ for all } i \in \set{1,2,\ldots,m}. \label{eq:linear-approx-g} \end{equation} $$ After combining $\eqref{eq:remain-feasible-g}$ and $\eqref{eq:linear-approx-g}$, it should become clear that we only have to worry about those $\vector{p}$ such that, for all $i \in \set{1,2,\ldots,m}$, we have $\ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} = \vector{0}$. Or to rephrase, we are only worried about the vectors $\vector{p}$ that are orthogonal to every element of the set $\setc{ \gradient{g_{i}}\of{\vector{x_{0}}} }{ i \in \set{1,2,\ldots,m} }$, and thus to its span. Because, to really hammer this home, $$ \begin{equation} \begin{aligned} g_{i}\of{\vector{x_{0}} + \vector{p}} \approx & \ \ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p}} \approx \text{ nonzero, for some } i \in \set{1,2,\ldots,m}\\ & \ \text{ if } \vector{p} \notin \spanof{ \setc{ \gradient{g_{i}}\of{ \vector{x_{0}} } } { i \in \set{1,2,\ldots,m} } }^{\perp}. \end{aligned} \label{eq:first-span-appearance} \end{equation} $$

The “span” in $\eqref{eq:first-span-appearance}$ is going to appear a lot from now on, so let's give it a name. Whenever problem $\eqref{eq:optimization-problem}$ is in scope we will let $$ \begin{equation} S = \spanof{ \setc{ \gradient{g_{i}}\of{ \vector{x_{0}} } } { i \in \set{1,2,\ldots,m} } }, \label{eq:definition-of-S} \end{equation} $$ and we can immediately put this notation to good use. The approximation of $f$ in $\eqref{eq:linear-approx-f}$ held for all directions $\vector{p}$, so it holds in particular for the directions that we just decided we care about: $$ \begin{equation*} f\of{ \vector{x_{0}} + \vector{p} } \approx f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}} \text{ for all } \vector{p} \in S^{\perp}. \end{equation*} $$ (I'm not going to keep saying it, but in all of these approximations, the norm of $\vector{p}$ must be small. It doesn't really matter. If you have a vector in mind, scale it by a small positive amount to make its norm small.) By analogy with what we did earlier, we need to ensure that this is not greater than $f\of{ \vector{x_{0}} }$, and to do that, we'll need $$ \begin{equation*} \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\vector{p}} = 0 \text{ for all } \vector{p} \in S^{\perp} \iff \gradient{f}\of{ \vector{x_{0}} } \in \qty{S^{\perp}}^{\perp} = S. \end{equation*} $$ Since the span of a set of vectors is the set of all linear combinations of those vectors, this last condition is saying that there exists some real numbers $\lambda_{1},\lambda_{2},\ldots,\lambda_{m}$ such that $\gradient{f}\of{ \vector{x_{0}} }$ can be written as the linear combination $\sum_{i=1}^{m} \lambda_{i} \cdot \gradient{g_{i}}\of{ \vector{x_{0}} }$. And that is our necessary condition. Note that this is, in some sense, an easier job than we had before when we wanted a global maximum. Here, $\lambda_{1} = \lambda_{2} = \cdots \lambda_{m} = 0$ will work, satisfying the $\gradient{f}\of{\vector{x_{0}}} = \vector{0}$ condition that we had for a global maximum. But now, some other non-zero values of the $\lambda_{i}$ might work too. So, there are (in general) more of them, and they should be easier to find.

In the introduction, I made three claims about problem $\eqref{eq:optimization-problem}$ and the Lagrange function $\mathcal{L}$:

- There is a necessary condition for optimality in $\eqref{eq:optimization-problem}$ that looks easy to prove.
- But it isn't.
- The function $\gradient{\mathcal{L}}$ is an obnoxious way of packing that condition onto one line.

Keeping in mind the definition of the subspace $S$ from $\eqref{eq:definition-of-S}$, let's get to the point, and prove Theorem 1 based on our intuition.

*Proof.*
Assume that $\gradient{f}\of{ \vector{x_{0}} } \notin S$, and
decompose the ambient space $\Rn$ into two orthogonal subspaces,
$$
\Rn
=
\directsum{S}{S^{\perp}}.
$$

Having done so, we can express $\gradient{f}\of{ \vector{x_{0}} }$ as the sum of two vectors, one from each component in the direct sum above:

$$ \begin{equation*} \gradient{f}\of{ \vector{x_{0}} } = \vector{p_{1}} + \vector{p_{2}}, \text{ where } \vector{p_{1}} \in S \text{ and } \vector{p_{2}} \in S^{\perp}. \end{equation*} $$

And since we have assumed that $\gradient{f}\of{ \vector{x_{0}} } \notin S$, we can conclude that $\vector{p_{2}} \ne \vector{0}$, because, otherwise, $\gradient{f}\of{ \vector{x_{0}} } = \vector{p_{1}}$, and $\vector{p_{1}}$ lies in $S$ contrary to that assumption.

Now we're going to substitute $\vector{p_{2}}$ into our linear approximations $\eqref{eq:linear-approx-f}$ and $\eqref{eq:linear-approx-g}$ for $f$ and $g$, after scaling it by some small positive $\alpha$ to ensure that the approximations are valid: $$ \begin{equation} \begin{aligned} f\of{ \vector{x_{0}} + \alpha\vector{p_{2}} } & \approx f\of{ \vector{x_{0}} } + \ip{ \gradient{f}\of{ \vector{x_{0}} } }{\alpha\vector{p_{2}}}\\ & = f\of{ \vector{x_{0}} } + \ip{ \vector{p_{1}} + \vector{p_{2}} }{\alpha\vector{p_{2}}}\\ & = f\of{ \vector{x_{0}} } + \alpha \norm{\vector{p_{2}}}^{2}\\ & > f\of{ \vector{x_{0}} }, \end{aligned} \label{eq:f contradiction} \end{equation} $$ $$ \begin{equation} g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} } \approx \alpha\ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p_{2}}} = 0, \text{ for all } i \in \set{1,2,\ldots,m}. \label{eq:g-contradiction} \end{equation} $$ The inequality in $\eqref{eq:f contradiction}$ shows that if we move in the direction $\vector{p_{2}}$ from the feasible starting point $\vector{x_{0}}$, then we increase the value of $f$. And $\eqref{eq:g-contradiction}$ shows that the resulting point will remain feasible. Thus, the existence of $\vector{p_{2}}$ shows that $\vector{x_{0}}$ does not maximize $f$ in $\eqref{eq:optimization-problem}$. ∎

I won't keep you in suspense: the error is in that proof. The proposition itself is true, but the proof is no good. If you were paying attention, you might have noticed that we didn't use the assumption that our functions are continuously-differentiable anywhere; nor did we use the fact that $\dim\of{S} = m$. That should set off some alarms. The next two results are also true, and their proofs… are not wrong on purpose. The following is merely a logical consequence of Proposition 2, after taking into consideration that all solutions are feasible.

All that remains is to show that the Lagrange function $\mathcal{L}$ is but gift wrapping. Below we restate Theorem 1, having replaced the condition that $\setc{ \gradient{g_{i}}\of{\vector{x_{0}}} }{ i \in \set{1,2,\ldots,m} }$ be linearly-independent with the equivalent statement that $\dim\of{S} = m$.

*Proof.*
Assuming that $\mathcal{L}$ is defined as above, we will proceed by
proving that the following are equivalent:

- There exist $\lambda_{1},\ldots,\lambda_{m} \in \realline$ such that $$ \gradient{\mathcal{L}}\of{ \colvec{ x_{1},x_{2},\ldots,x_{n},\lambda_{1},\lambda_{2},\ldots,\lambda_{m} } } = \vector{0}. $$
- $\gradient{f}\of{ \vector{x_{0}} } \in S$ and $g_{i}\of{ \vector{x_{0}} } = 0$ for all $i \in \set{1,2,\ldots,m}$.

In other words, we will show that the condition in the theorem that involves $\gradient{\mathcal{L}}$ is the same as the condition in Corollary 3. The result then follows from that corollary, which we already believe. The equivalence is not hard to see; simply write out what the first condition means: $$ \begin{gather*} \exists \lambda_{1},\lambda_{2},\ldots,\lambda_{m} \in \realline : \gradient{\mathcal{L}}\of{ \colvec{x_{1}, x_{2}, \ldots,x_{n}, \lambda_{1}, \lambda_{2}, \ldots, \lambda_{m} } } = \vector{0}\\ \Updownarrow\\ \exists \lambda_{1},\lambda_{2},\ldots,\lambda_{m} \in \realline : \left\lbrace \begin{aligned} f_{x_{1}}\of{ \vector{x_{0}} } - \sum_{i=1}^{m} \lambda_{i}\qty{g_{i}}_{x_{1}}\of{ \vector{x_{0}} } & = 0\\ f_{x_{2}}\of{ \vector{x_{0}} } - \sum_{i=1}^{m} \lambda_{i}\qty{g_{i}}_{x_{2}}\of{ \vector{x_{0}} } & = 0\\ & \vdots\\ f_{x_{n}}\of{ \vector{x_{0}} } - \sum_{i=1}^{m} \lambda_{i}\qty{g_{i}}_{x_{n}}\of{ \vector{x_{0}} } & = 0\\ - g_{1}\of{\vector{x_{0}}} & = 0\\ - g_{2}\of{\vector{x_{0}}} & = 0\\ & \vdots\\ - g_{m}\of{\vector{x_{0}}} & = 0 \end{aligned}\right.\\ \Updownarrow\\ \exists \lambda_{1},\lambda_{2},\ldots,\lambda_{m} \in \realline : \left\lbrace \begin{aligned} \gradient{f}\of{ \vector{x_{0}} } & = \sum_{i=1}^{m} \lambda_{i} \cdot \gradient{g_{i}}\of{ \vector{x_{0}}}\\ g_{1}\of{ \vector{x_{0}} } & = 0\\ g_{2}\of{ \vector{x_{0}} } & = 0\\ & \vdots\\ g_{m}\of{ \vector{x_{0}} } & = 0\\ \end{aligned}\right.\\ \Updownarrow\\ \gradient{f}\of{ \vector{x_{0}} } \in \spanof{ \setc{\gradient{g_{i}}\of{ \vector{x_{0}} }}{i \in \set{1,2,\ldots,m}} }\\ \text{ and }\\ g_{i}\of{ \vector{x_{0}} } = 0 \text{ for all } i \in \set{1,2,\ldots,m}\\ \Updownarrow\\ \gradient{f}\of{ \vector{x_{0}} } \in S \text { and } g_{i}\of{ \vector{x_{0}} } = 0 \text{ for all } i \in \set{1,2,\ldots,m}. \end{gather*} $$ This shows that the two conditions are equivalent. All done. ∎

In the proof of Proposition 2, we
derived approximation $\eqref{eq:g-contradiction}$ which bears
repeating:
$$
\begin{equation*}
g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} }
\approx
\alpha\ip{ \gradient{g_{i}}\of{ \vector{x_{0}} } }{\vector{p_{2}}} = 0,
\text{ for all }
i \in \set{1,2,\ldots,m}.
\end{equation*}
$$
This showed that some small scalar multiple of $\vector{p_{2}}$
would take us to another feasible
point. *Weeeeeeeeeeeeeelllllllllllllllll* not quite:

The “approximately equals” relationship here means that by choosing a $\vector{p_{2}}$ with a small enough norm on the left-hand side, we can get as close as we want to the value on the right-hand side.

me, like ten minutes ago

If $g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} }$ is “approximately equal” to a positive number, then we can make it close to that positive number. Things close to a positive number are positive; so, we can make $g_{i}\of{ \vector{x_{0}} + \alpha\vector{p_{2}} }$ positive in that case. Likewise, if it's approximately equal to a negative number, then we can make it negative by choosing $\alpha$ small enough. But, here's the catch: being able to make something approximately zero is not the same as being able to make it actually zero!

We needed to find a counterexample that was truly feasible, but instead found one that was only close to feasible.

Death is the solution to all problems.

Joseph Stalin

Keep in mind that we're coming up with *necessary* conditions
for optimality, and they're not unique. Some are better, some are
worse; some are better *and* worse. What I'm trying to say
is, there's more than one way to proceed. *One* way would be
to find a better counterexample: we need a $\vector{p_{2}}$
in $\eqref{eq:g-contradiction}$ that moves us to a truly
feasible point $\vector{x_{0}} + \vector{p_{2}}$; that is, one that
has $g_{i}\of{\vector{x_{0}} + \vector{p_{2}}} = 0$ for all $i$. But
we also want that same point to increase the value of $f$ relative
to the starting point $\vector{x_{0}}$. Taken together, these aren't
such easy goals.

Fortunately, calculus can do the heavy lifting for us, ensuring that such a counterexample would exist in all of the directions we care about. This approach was hinted at by the extra assumptions that I tacked on to all of the results, namely that the functions all be continuously-differentiable at $\vector{x_{0}}$, and that the set $\setc{ \gradient{g_{i}}\of{\vector{x_{0}}} }{ i \in \set{1,2,\ldots,m} }$ be linearly-independent.

If we define the function $G : \Rn \rightarrow \Rm$ by $$ \begin{equation} G\of{\vector{x}} = \begin{bmatrix} g_{1}\of{ \vector{x} }\\ g_{2}\of{ \vector{x} }\\ \vdots\\ g_{m}\of{ \vector{x} } \end{bmatrix}, \label{eq:definition-of-G} \end{equation} $$ then the derivative of $G$ at $\vector{x_{0}}$, where it is continuous, is $$ \begin{equation} G'\of{\vector{x_{0}}} = \begin{bmatrix} \transpose{\gradient{g_{1}}\of{ \vector{x_{0}} }}\\ \transpose{\gradient{g_{2}}\of{ \vector{x_{0}} }}\\ \vdots\\ \transpose{\gradient{g_{m}}\of{ \vector{x_{0}} }} \end{bmatrix}. \label{eq:G-prime-at-x0} \end{equation} $$ The linear-independence of its rows means that $m \le n$ and that $G'\of{\vector{x_{0}}}$ is invertible.

So now we decompose the space into $\Rn = \directsum{S}{S^{\perp}}$ again, and by cleverly choosing what is what, we can apply some powerful theorem like

Wielded skillfully, these two theorems tell you exactly what you want to know. The feasible set is a nice smooth surface on which we can draw a bunch of continuous paths, all passing through the point $\vector{x_{0}}$. The derivative of $f$ at $\vector{x_{0}}$ along any of those paths (by the chain rule) must be zero, because the value of $f$ can't change on the surface near $\vector{x_{0}}$. It turns out that said paths all pass through $\vector{x_{0}}$ in a direction contained in $S^{\perp}$, so and we wind up concluding that $\gradient{f}\of{\vector{x_{0}}}$ must be orthogonal to everything in $S^{\perp}$ to avoid deriving a contradiction. This proves Proposition 2, and the rest follows. See the appendix for the details.

The paper A short elementary proof of the Lagrange multiplier theorem by Brezhneva, Tret'yakov, and Wright tries to get by without any of the powerful theorems that I just said you needed. But, to motivate doing so, they cite other authors who have used…

- the Brouwer fixed-point theorem,
- the convex interior mapping theorem (based on the Brouwer fixed-point theorem),
- the Ekeland variational principle,
- a penalty function and repeated applications of the Bolzano-Weierstrass theorem.

Chapter 5 of Convex Analysis and Optimization uses a penalty function to prove slightly stronger necessary conditions than the usual ones, and (based on the bibliography) the technique probably goes back to the McShane paper The Lagrange multiplier rule cited by those other authors.

(Please, use Library Genesis and Sci-Hub to find these references.)

In any case, you do have to tack on some extra assumptions to make
things work. Doing so has become a cottage industry: these
assumptions are called *constraint qualifications*, and they
bought a zoo.

If the new process paused because it was swapped out, set the stack level to the last call to savu(u_ssav). This means that the return actually returns from the last routine which did the savu.

You are not expected to understand this.

Dennis Ritchie, source code to the V6 Unix kernel

**
Warning: this section is not nearly as friendly as the
others. I'll make no condescending claims as to what sort of
background you should have to understand this. In particular, using
parentheses for function application is stupid, and you're
about to find out why.
**

Well, that's Rudin. Let's use it to prove…

*Proof.*
Let's map our notation to that of the rank theorem. First, the
$\mathcal{C}'$ thing means “continuously differentiable.”
We'll be substituting the following into the statement of the
theorem:

- His $m$ and $n$ are the same as ours, and $r = m$ for us because $r$ should be the rank of the matrix $G'\of{\vector{x_{0}}}$, whose $m$ rows we've assumed are linearly-independent.
- His $\vector{a}$ is our $\vector{x_{0}}$.
- His $\vector{F}$ is our $G$ from $\eqref{eq:definition-of-G}$.
- As a result, his $A$ is our $G'\of{\vector{x_{0}}}$ from $\eqref{eq:G-prime-at-x0}$.
- His $Y_{1}$ is our $\range{G'\of{\vector{x_{0}}}}$, and his $Y_{2}$ is our $\range{G'\of{\vector{x_{0}}}}^{\perp}$.

We're going to assume that $\vector{x_{0}}$ *is* optimal this
time around. Then, also assuming that each $g_{i}$ and $f$ is
continuously-differentiable at $\vector{x_{0}}$ and that $\dim\of{S}
= m$, we will show that $\gradient{f}\of{ \vector{x_{0}} } \in S$.
The idea here is that we're going to construct $n-m$ differentiable
paths through $\vector{x_{0}}$ in the feasible set, one for
each “dimension” in $S^{\perp}$. Since $f$ is supposed
to be maximized (within the feasible set, anyway) at that point, its
gradient should be orthogonal to the paths there. This will lead us
to the conclusion that $\gradient{f}\of{\vector{x_{0}}}$ is
orthogonal to all of $S^{\perp}$.

First off, since $\vector{H}$ is invertible, we can let $\vector{v_{0}} \in V$ be such that $\vector{H}\of{\vector{v_{0}}} = \vector{x_{0}} \in U$. Then, let $\set{\vector{q_{1}},\vector{q_{2}},\ldots,\vector{q_{m-n}}}$ be a basis for $S^{\perp}$. Since $V$ is an open set, we can pick tiny enough intervals $\intervaloo{-t_{j}}{t_{j}} \subseteq \realline$ and construct $m-n$ differentiable paths in V, $$ \left. \begin{align*} & \gamma_{j} : \intervaloo{-t_{j}}{t_{j}} \rightarrow V\\ & \gamma_{j}\of{t} = \vector{v_{0}} + t\vector{q_{j}} \end{align*} \right\rbrace \text{ for } j \in \set{1,2,\ldots,m-n}. $$ Now for each $j$, the composition $\compose{\vector{H}}{\gamma_{j}}$ is a path in $U$ through $\vector{x_{0}}$, since at $t=0$ we will have $\vector{H}\of{\gamma_{j}\of{0}} = \vector{H}\of{\vector{v_{0}}} = \vector{x_{0}}$. As a result, we know that the derivative of $\compose{f}{\compose{\vector{H}}{\gamma_{j}}}$ must be zero at $t=0$; the path is contained entirely in the feasible set, and $\vector{x_{0}}$, which corresponds to $t=0$ along the path, maximizes $f$ within that set. Heed our earlier notation warning, and then let's write that all out: $$ \begin{equation} \forall j \in \set{1,2,\ldots,m-n} : \left\lbrace \begin{aligned} &\ \qty{ \compose{f}{\compose{\vector{H}}{\gamma_{j}}} }'\of{0}\\ = &\ \compose{ f'\of{ \vector{H}\of{\gamma_{j}\of{0}} } } {\qty{ \compose{ \vector{H}'\of{\gamma_{j}\of{0}} } {\gamma_{j}'\of{0}} }}\\ = &\ \compose{ f'\of{ \vector{x_{0}} } } { \compose{ \vector{H}'\of{\vector{v_{0}}} } {\vector{q_{j}} } }\\ = &\ \ip{\gradient{f}\of{\vector{x_{0}}}} {\vector{H}'\of{\vector{v_{0}}}\of{\vector{q_{j}}}}\\ = &\ 0. \end{aligned} \right. \label{eq:f-gradient-inner-products} \end{equation} $$

So really, now what we need to do is figure out what $\vector{H}'\of{\vector{v_{0}}}$ does. Because, if it were missing from the inner product above, we'd be done! And that's actually not too hard to figure out either, using equation $\eqref{eq:rank-theorem}$ from the rank theorem. First, note that both $$ \begin{equation} \begin{aligned} G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} } & = \vector{0}\\ \text{ and } & \ \\ \phi\of{ G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} } } & = \vector{0}, \end{aligned} \label{eq:g-prime-zeros} \end{equation} $$ because we have the following direct sum decomposition of zero itself: $$ \vector{0} = G\of{ \vector{x_{0}} } = G\of{ \vector{H}\of{ \vector{v_{0}} } } = \underset{\in\ \range{G'\of{\vector{x_{0}}}}} {\underbrace{G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} }}} + \underset{\in\ \range{G'\of{\vector{x_{0}}}}^{\perp}} {\underbrace{\phi\of{ G'\of{ \vector{x_{0}}} \of{ \vector{v_{0}} } }}}. $$

Now, using $\eqref{eq:g-prime-zeros}$ and the linearity of $G'\of{ \vector{x_{0}} }$, we see that each $\compose{G}{\compose{\vector{H}}{\gamma_{j}}}$ is identically zero on its domain: $$ \forall j \in \set{1,2,\ldots,m-n} : \left\lbrace \begin{align*} &\ G\of{\vector{H}\of{\gamma_{j}\of{t}}}\\ = &\ G'\of{\vector{x_{0}}}\of{\gamma_{j}\of{t}} + \phi\of{G'\of{\vector{x_{0}}}\of{\gamma_{j}\of{t}}}\\ = &\ G'\of{\vector{x_{0}}}\of{ \vector{v_{0}} + t\vector{q_{j}} } + \phi\of{G'\of{\vector{x_{0}}}\of{ \vector{v_{0}} + t\vector{q_{j}} }}\\ = &\ t G'\of{\vector{x_{0}}}\of{\vector{q_{j}} } + \phi\of{t G'\of{\vector{x_{0}}}\of{ \vector{q_{j}} }}\\ = &\ t \vector{0} + \phi\of{t \vector{0} }\\ = &\ \vector{0}. \end{align*} \right. $$

And finally, since we now know that each $\compose{G}{\compose{\vector{H}}{\gamma_{j}}}$ is identically zero on its domain, we know that their derivatives are zero there, too. In particular, at $t=0$. Reheed the notation warning, and then let's see what that tells us: $$ \forall j \in \set{1,2,\ldots,m-n} : \left\lbrace \begin{align*} &\ \qty{ \compose{G}{\compose{\vector{H}}{\gamma_{j}}} }'\of{0}\\ = &\ \compose{ G'\of{ \vector{H}\of{\gamma_{j}\of{0}} } } {\qty{ \compose{ \vector{H}'\of{\gamma_{j}\of{0}} } {\gamma_{j}'\of{0}} }}\\ = &\ \compose{ G'\of{ \vector{x_{0}} } } { \compose{ \vector{H}'\of{\vector{v_{0}}} }{ \vector{q_{j}} } }\\ = &\ G'\of{ \vector{x_{0}} } \of{ \vector{H}'\of{ \vector{v_{0}} }\of{ \vector{q_{j}} } }\\ = &\ \vector{0}. \end{align*} \right. $$ By linearity, the fact that this last equality holds for all $\vector{q_{j}}$ means that $$ G'\of{ \vector{x_{0}} } \of{ \vector{H}'\of{\vector{v_{0}}} \of{S^{\perp}} } = \set{ \vector{0} }. $$ The rank theorem tells us that $\vector{H}'\of{\vector{v_{0}}}$ is invertible, by way of telling us that $\vector{H}$ is continuously-differentiable. Recall that an invertible matrix applied to a subspace gives you back a subspace of the same dimension. The preceding equality tells us that $\vector{H}'\of{\vector{v_{0}}} \of{S^{\perp}}$ is contained in the kernel of $G'\of{ \vector{x_{0}} }$; and, since it also has the same dimension as the kernel of $G'\of{ \vector{x_{0}} }$, it must be the case that $$ \ker\of{G'\of{ \vector{x_{0}} }} = \vector{H}'\of{\vector{v_{0}}}\of{S^{\perp}}. $$ Moreover, visual inspection should convince you that $$ \ker\of{G'\of{ \vector{x_{0}} }} = S^{\perp}. $$ Now look back at the equations $\eqref{eq:f-gradient-inner-products}$, which in hindsight say that $$ \ip{ \gradient{f}\of{\vector{x_{0}}} } { \vector{H}'\of{\vector{v_{0}}}\of{ S^{\perp} } } = \ip{ \gradient{f}\of{\vector{x_{0}}} } { S^{\perp} } = \set{0}. $$ At long last, we've shown that $\gradient{f}\of{\vector{x_{0}}} \in \qty{S^{\perp}}^{\perp} = S$. ∎