Continuous Optimisation

Newton-Ralphson method

When you search this on YouTube, you get explanations of how it works for approximating the solutions to an equation, say $y(x)=0$ . This can also be rephrased as minimising $f(x) = (y(x))^2$ . This is when the method is applied in its original form. The whole idea behind the method is that for a scalar function $f(\bold{x})$ , you can approximate as a quadratic in $\bold{t=x-x_n}$ by Taylor approximation around $\bold{x_n}$ , like this :

f(\bold{x})=f(\bold{x_n+t})\approx f(\bold{x_n})+ (\nabla f(\bold{x_n}))\bold{t} + \frac{1}{2}\bold{t^TH\,t}

and then minimise that approximation to get the next iteration, $\bold{x_{n+1}}$ by setting the gradient wrt $\bold{t}$ to be 0, like this :

\nabla_\bold{t}f(\bold{x}) = \nabla f(\bold{x_n+t}) = 0 + \nabla f(\bold{x_n}) + \frac{1}{2}\bold{Ht} + \frac{1}{2}\bold{t^TH} = \nabla f(\bold{x_n})+\bold{Ht} = 0

Note : The Hessian $\bold{H}$ is a symmetric matrix and thus $\bold{t^TH=Ht}$ (don’t try to prove this by symbolic manipulation, but by writing down the matrix as a bunch of column vectors) .

Thus, we get $\bold{t=-H}^{-1}(\nabla f(\bold{x_n}))$ . Putting it all together, we get the final equation :

\bold{x_{n+1}=x_n-H^{-1}}\nabla f(\bold{x_n})

Applying this general method to $f(x) = (y(x))^2$ gives us $x_{n+1}=x_n-[2y'^2 + 2yy'']^{-1}[2yy'] = x_n - [\frac{yy'}{y'^2+yy''}]_{_{x_n}}$ . Since $y(x_n) \approx 0$ , so we just have $x_{n+1} = x_n - \frac{y(x_n)}{y'(x_n)}$

This is the equation that you usually see all around.

To see the general method in action, read my friend’s blog :

Blogs - Optimizing bivariate functions

Understanding optimization of bivariate functions using Newton’s method and L-BFGS.

https://kishan-ved.github.io/Blogs/posts/secondorder/

Condition number

Suppose you are trying to solve an equation $f(\bold{x}) = \bold{c}$ but some numerical method, and after a lot of ( $n$ ) iterations, you end up at the approximate value $\bold{x}_n$ which gives the value of the function as $f(\bold{x}_n)$ , you know the relative error in the value of $f$ , which is $\frac{||f(\bold{x}_i)-\bold{c}||}{\bold{c}}$ , but you don’t know the relative error in $\bold{x}$ , namely $\frac{||\bold{x_n-x_\infty}||}{||\bold{x}_\infty||}$ . To get an upper bound on this error, we define the condition number as

K = \max_n\frac{(\frac{||\bold{x_n-x_\infty}||}{||\bold{x}_\infty||})}{(\frac{||f(\bold{x}_i)-\bold{c}||}{\bold{c}})}

For example, for a matrix equation $\bold{f(x) = Ax=c}$ , if the error in $\bold{x}$ is $\delta\bold{x}$ , then as

\bold{\frac{(\frac{||\delta x||}{||x||})}{(\frac{||\delta f||}{||f||})}} = \bold{\frac{||Ax||}{||x||}(\frac{||A \,\delta x||}{||\delta x||})^{-1}} \leq {\sigma_{max}}\;{\sigma_{min}}^{-1}

where $\sigma_{max}$ and $\sigma_{min}$ are the maximum and minimum singular values for $\bold{A}$ , we have the condition number as $\dfrac{\sigma_{max}}{\sigma_{min}}$ .

Gradient Descent

Given a function $f(\bold{x})$ to be minimised, step along the negative gradient by doing:

\bold{x}_{i+1} = \bold{x}_i -\gamma_i (\nabla f_{(\bold{x_i})})^T

The transpose is because we usually consider the gradient to be a row vector.

The step size $\gamma_i$ depends on $\bold{x}_i$ .

Gradient Descent with Momentum

Rather that viewing the gradient as the velocity (or momentum), if we view it as acceleration (or force), then we get the gradient descent with momentum.

It is able to escape local minimums, unlike the normal gradient descent, and takes bigger steps when it knows it’s going in the right direction, thus completing faster.

We use the change in $\bold{x}$ in the last update, denoted as $\Delta\bold{x}_i = \bold{x}_i-\bold{x}_{i-1}$ as well to calculate the next step, like this :

\Delta\bold{x}_{i+1} = \bold{x}_{i+1}-\bold{x}_{i}=-\gamma_i(\nabla f)_{(\bold{x}_i)})^ T + \alpha\Delta \bold{x}_i

The reason for adding the $\alpha < 1$ is to include “friction” , because otherwise we can end up in an infinite loop (like a friction-less ball moving in bowl like surface and always reaching the same height when its speed is 0, and thus never stopping (both velocity and acceleration being 0))

Here’s a nice article explaining this (and more variations) visually :

Step Size

Sure, you can take steps of constant sizes, but you can do more. Say you want to minimise $f(\bold{x})$ and you have reached a point $\bold{x_n}$ . Now, you want to descend along the gradient again. The question is, till when ? The answer is: till we are descending, that is, till $f(\bold{x})$ is decreasing. Basically, till $\frac{d}{d\gamma}f(\bold{x_n}-\gamma \nabla f(\bold{x_n})) = 0 \iff (\nabla f(\bold{x_n}-\gamma \nabla f(\bold{x_n})))(\nabla f(\bold{x_n}))^T = 0$ . The solution to this equation would be the exact value of optimum step size. But most of the times you don’t have an analytical solution. This can be approximated as $(\nabla f(\bold{x_n}) - \gamma (\nabla f(\bold{x_n}))\bold{H}_f)(\nabla f(\bold{x_n}))^T \implies \gamma = \frac{||\nabla f(\bold{x_n})||^2}{\nabla f(\bold{x_n})\bold{H}_f(\nabla f(\bold{x_n}))^T}$ . But a lot of the times you don’t know the hessian, and it is more or less useless, because if you are calculating the hessian anyway, just use Newton’s method, as it will give a better result.

What we really need is a cheap way to find when $\gamma$ should increase and when not. This is where backtracking line search comes in.

You start with $\gamma = 1$ and reduce $\gamma$ by a fixed scaling factor $\beta < 1$ while $f(\bold{x_n}-\gamma\nabla f(\bold{x_n})) \geq f(\bold{x_n}) - \frac{\gamma}{2}||\nabla f(\bold{x_n})||^2$ is true. You go to the next step when $\gamma$ doesn’t satisfy this inequality. One way to speed this up is to start the iteration with the chose value of $\gamma$ in the last step. If it is already small enough to not satisfy the inequality, you keep increasing by $\beta$ until the last $\gamma$ that doesn’t satisfy the condition, or you can just start from $\gamma = 1$ again.

Lagrange Multipliers for Equations

This is a rather nice method where the whole idea is that to minimise (locally) a function $f(\bold{x})$ under the constraints $g_i(\bold{x}) = 0$ , which can be written compactly as $\bold{g(x)=0}$ , you consider the Lagrangian $L(\bold{x},\bold{a}) = f(\bold{x})+\bold{a^Tg(x)}$ and minimise (locally) this new function under no constraints by setting the gradient to 0.

Basically, our new problem becomes $\nabla f (\bold{x_0}) + a_1\nabla g_1(\bold{x_0}) + a_2\nabla g_2(\bold{x_0}) + \dots \;= 0$ which can be read as “ $\nabla f(\bold{x_0}),\nabla g_1(\bold{x_0}),\nabla g_2(\bold{x_0}) \dots$ “ are linearly dependent row vectors, still following the old constraints, $\bold{g(x_0)=0}$ .

Proof :

Consider any point in the locality of minima $\bold{x_0}$ of $f$ , say $\bold{x_1=x_0+}t\bold{b}$ with $g(\bold{x_1})$ where $\bold{b}$ is a unit vector and $t$ is a scalar. Both $t,\bold{x_1}$ are in fact functions of $\bold{b}$ . Now, suppose $\bold{x_1 \to x_0}$ , then we have $\lim_{\bold{x_1 \to x_0}}\frac{g_i(\bold{x_1})-g_i(\bold{x_0})}{t} = \nabla g_i(\bold{x_0}) \;\bold{b} = 0$ .

Now, since $\bold{x_0}$ is the solution to the minimisation problem (minimise $f$ under constraints), thus the function $f$ must not change on going in any direction (by an infinitesimal distance) where the constraint is being followed, say $\bold{b}$ for example. What this means is that $\lim_{\bold{t \to 0}}\frac{f(\bold{x_0+b}t)-f(\bold{x_0})}{t} = \nabla f(\bold{x_0}) \;\bold{b} = 0$ . Reiterating we are saying there should no way to move such that $f$ would change and $\bold{g=0}$ is still followed, as otherwise, we can just move in that or its opposite direction to decrease $f$ .

So, what we have just showed is that any unit vector $\bold{b}$ perpendicular to all of $\nabla g_i(\bold{x_0})$ is also perpendicular to $\nabla f(\bold{x_0})$ , but that means that $\nabla f(\bold{x_0})$ doesn’t have any component in the null space formed by all of $\nabla g_i(\bold{x_0})$ , and thus it is the vector space spanned by the set of $\nabla g_i(\bold{x_0})$ , that is to say $\nabla f(\bold{x_0}),\nabla g_1(\bold{x_0}),\nabla g_2(\bold{x_0}) \dots$ are linearly dependent, which is what we wanted to prove.

This method even works for inequality constraints $h(\bold{x})\leq 0$ , since that can be converted to an equality by introducing a new variable $x_{n+1}$ in the vector $\bold{x}$ and writing $h(\bold{x})+ x_{n+1}^2 = 0$ . The resultant problem can be simplified a lot and that simplified version has a much more intuitive derivation than the one you would do after applying this trick. We’ll discuss it later.

Gradient descent under equality constraints.

Notice that we talked about moving under a bunch of constraints $\bold{g(x)=0}$ in order to decrease the value of $f(\bold{x})$ . This is a lot like gradient descent, except here we aren’t moving along the gradient $\nabla f(\bold{x})$ directly, but its projection on the null space formed by $\nabla g_i(\bold{x})$ .

Lagrangian multiplier method for inequality constraints.

Consider the function $f(\bold{x})$ to be locally minimised under the constraints $g_i\bold{(x)} \leq 0$ . For a solution $\bold{x_0}$ , first of all, at least one of the inequalities must become an equality, which we usually call as the constraint being tight. This is because if $\forall i, \;g_i(\bold{x_0}) < 0$ , then $\forall i, \;g_i(\bold{x}) < 0$ in all directions in the locality of $\bold{x_0}$ , and thus we can descend along the gradient to reach a point in this locality that follows all constraints and has a smaller value of $f$ . So now, WLOG, suppose that $g_i(\bold{x_0}) = 0\; \forall i \leq k$ and $g_i(\bold{x_0}) < 0 \;\forall i > k$ , then moving a little around $\bold{x_0}$ doesn’t affect the inequalities $g_i(\bold{x_0}) < 0 \;\forall i > k$ . So we can move following the constraints $g_i(\bold{x_0}) = 0\; \forall i \leq k$ as if the other constraints didn’t exist. Clearly for $\bold{x_0}$ to be a local minima, the projection of gradient $\nabla f(\bold{x_0})$ on the null space formed by $G = \{\nabla g_i(\bold{x_0})\;|\;1\leq i\leq k\}$ should be $\bold{0}$ and thus it is linearly dependent on these vectors, which means $\nabla f(\bold{x_0})+a_1\nabla g_1(\bold{x_0})+a_2\nabla g_2(\bold{x_0})+\dots + a_k\nabla g_k(\bold{x_0}) = 0$ . Also, suppose we moved a little along a unit vector $\bold{b}$ such that $\nabla g_i(\bold{x_0})\bold{b} = 0 \;\forall \;i \leq k \;;\; i\neq j$ , and $\nabla g_j(\bold{x_0})\bold{b} <0$ (Yes, such a unit vector always exists for linearly independent vectors (a subset of $G$ in this case) due to the existence of reciprocal system of vectors) , then we have $\nabla f(\bold{x_0})\bold{b} = - a_j\nabla g_j(\bold{x_0})\bold{b}$ . So if $a_j < 0$ , then $\nabla f(\bold{x_0})\bold{b} < 0$ and thus we can step along $\bold{b}$ to reduce $f$ . This should not happen and thus we have the additional restriction that $a_i \geq 0 \;\forall i$ .

Thus any critical point $\bold{x_0}$ in the old problem is also part of a critical point $(\bold{x_0,a_0})$ of the function $L(\bold{x}) = f(\bold{x})+\sum_{i=1}^n a_ig_i(\bold{x})$ constrained by $a_i \geq 0,\,g_i \leq 0 \; \;\forall i$ . Here we can use the fact that if $g_j(\bold{x_0}) < 0$ , then $a_j$ has to be necessarily 0. (or else there are points in the locality where $L$ is bigger and points where it is smaller found by decreasing or increasing $a_j$ , and thus we are not at the minimum). So, in effect, in our partial differential equations, we are still considering the gradients of only those constraint functions which evaluate to 0 at $\bold{x_0}$ , that is their constraint is tight.

Of course the fact that $g_i(\bold{x_0})<0 \implies a_i = 0$ also tells us that $a_i > 0 \implies a_i \neq 0 \implies g_i(\bold{x_0}) \nless 0 \implies g_i(\bold{x_0}) = 0$ . Basically, at least one of $a_i$ or $g_i(\bold{x_0})$ must be 0. Thus you can divide this problem in $2^n-1$ cases based on whether $a_i=0$ or $a_i > 0$ , and solve each case by setting gradient to 0. Remember that the case where $g_i(\bold{x_0}) < 0 \;\forall i \implies a_i = 0 \;\forall i$ is not allowed and this was established at the very start

Gradient descent under general constraints

If you have reached a point $\bold{x_1}$ somewhere in the process, then to decide the direction in which you should step, you consider the set $G_0$ which contains the gradients of all the constraint functions for an equality constraint. You add to this set the gradient of any constraint function $g_1$ for the inequality constraint $g_1(\bold{x_0})\leq 0$ for which the constraint is tight and the projection of the gradient $\nabla f$ on the null space of $G_0$ has a negative dot product with $\nabla g_1$ to get the updated set $G_1$ . Then you add another such tight inequality constraint function’s gradient, which has a negative dot product with the projection of $\nabla f$ on null space of $G_1$ , and thus get the updated set $G_2$ . You keep repeating this process until you can’t find another constraint function that satisfies this rule. The projection of $\nabla f$ on the null space of this final set, say $G_k$ gives us the direction to move in.

Notice that here we are quite literally constraining the gradient by not letting its projection, which would give us the optimum direction to move in under only the constraints involved in the set $G_i$ , have a negative dot product with the gradient of any other tight constraint function . If it was negative, we would break the constraint by descending along this projection as then $g_i$ would increase on such a step and become positive. It’s basically like we update our optimal direction to move in as more constraints are put on us.

Convex function

A function $f:D \to C$ is convex iff $f(\alpha\bold{x_2}+(1-\alpha)\bold{x_1}) \leq \alpha f(\bold{x_2}) + (1-\alpha)f(\bold{x_1}) \;\;\forall\;\alpha\in [0,1] \;\forall \bold{x_1,x_2}\in D$ .

For example, an upward quadratic is a convex function .

This inequality is called Jensen’s inequality, and is the definition of a convex function, not its property .A concave function is where the inequality is switched to $\geq$ .

Consider $\bold{x_2=x_1+x}$ , then for a convex function, we have $f(\bold{x_1 + \alpha x}) \leq \alpha (f(\bold{x_1 + x}) - f(\bold{x_1})) +f(\bold{x_1})$ . This can be rewritten as $\frac{f(\bold{x_1+\alpha x})-f(\bold{x_1})}{\alpha} \leq f(\bold{x_1 + x}) - f(\bold{x_1})$ . From here, it’s easy to see that as $\alpha \bold{x} \to \bold{0}$ , we get $\nabla f(\bold{x_1}) \bold{x} \leq f(\bold{x_1 + x}) - f(\bold{x_1})$ . Since this is a general property, we can also write $\nabla f(\bold{x_1+x}) (-\bold{x}) \leq f(\bold{(x_1 + x)+(-x)}) - f(\bold{x_1+x})$ , or simply $\nabla f(\bold{x_1+x})\bold{x} \geq f(\bold{x_1+x}) - f(\bold{x_1})$ . What this means is that $\nabla f(\bold{x_1+x})\bold{x} \geq \nabla f(\bold{x_1})\bold{x} \;\;\forall \bold{x_1,x}\in D$ , or rewritten, $(\nabla f(\bold{x_1+x})- \nabla f(\bold{x_1}))\bold{x} \geq 0 \;\;\forall \bold{x_1,x}\in D$ . Thus it’s also true for $\bold{x=}t\bold{b}$ where $\bold{b}$ is some vector. Thus $(\frac{\nabla f(\bold{x_1}+t\bold{b})-\nabla f(\bold{x_1})}{t})\bold{b} \geq 0$ . It’s easy to see that as $t \to 0$ , the thing inside the bracket is just $\bold{b^TH}_f$ . Thus you have $\bold{b^THb} \geq 0 \;\forall \;\bold{b}$ , and thus the hessian is positive semi-definite.

A constrained problem of minimising $f(\bold{x})$ under the constraints $\bold{g(x)=0}$ is called a convex problem if $f(\bold{x}),g_i(\bold{x})$ are convex functions.

Linear Programming

Consider a the function $f(\bold{x})=\bold{c^Tx}$ . In order to minimise $f$ under the constraint $\bold{Ax\leq b}$ , we consider the Lagrangian $L(\bold{x,a}) = \bold{c^Tx+a^T(Ax-b)} = \bold{x^T(c+A^Ta)-a^Tb}$ and maximise it under the constraints that $a_i \geq 0 \geq g_i$ . Taking the gradient wrt $\bold{x}$ and setting to 0, we get $\bold{c+A^Ta=0}$ .

Thus we just have to maximise $L(\bold{x,a})=\bold{-a^Tb}$ as a function of $\bold{a}$ under the restriction that $a_i \geq 0$ .