Continuous Optimisation

Newton-Ralphson method

When you search this on YouTube, you get explanations of how it works for approximating the solutions to an equation, say y(x)=0y(x)=0 . This can also be rephrased as minimising f(x)=(y(x))2f(x) = (y(x))^2 . This is when the method is applied in its original form. The whole idea behind the method is that for a scalar function f(x)f(\bold{x}) , you can approximate as a quadratic in t=xxn\bold{t=x-x_n} by Taylor approximation around xn\bold{x_n}, like this :

f(x)=f(xn+t)f(xn)+(f(xn))t+12tTHtf(\bold{x})=f(\bold{x_n+t})\approx f(\bold{x_n})+ (\nabla f(\bold{x_n}))\bold{t} + \frac{1}{2}\bold{t^TH\,t}

and then minimise that approximation to get the next iteration, xn+1\bold{x_{n+1}} by setting the gradient wrt t\bold{t} to be 0, like this :

tf(x)=f(xn+t)=0+f(xn)+12Ht+12tTH=f(xn)+Ht=0\nabla_\bold{t}f(\bold{x}) = \nabla f(\bold{x_n+t}) = 0 + \nabla f(\bold{x_n}) + \frac{1}{2}\bold{Ht} + \frac{1}{2}\bold{t^TH} = \nabla f(\bold{x_n})+\bold{Ht} = 0

Note : The Hessian H\bold{H} is a symmetric matrix and thus tTH=Ht\bold{t^TH=Ht} (don’t try to prove this by symbolic manipulation, but by writing down the matrix as a bunch of column vectors) .

Thus, we get t=H1(f(xn))\bold{t=-H}^{-1}(\nabla f(\bold{x_n})) . Putting it all together, we get the final equation :

xn+1=xnH1f(xn)\bold{x_{n+1}=x_n-H^{-1}}\nabla f(\bold{x_n})

Applying this general method to f(x)=(y(x))2f(x) = (y(x))^2 gives us xn+1=xn[2y2+2yy]1[2yy]=xn[yyy2+yy]xnx_{n+1}=x_n-[2y'^2 + 2yy'']^{-1}[2yy'] = x_n - [\frac{yy'}{y'^2+yy''}]_{_{x_n}} . Since y(xn)0y(x_n) \approx 0 , so we just have xn+1=xny(xn)y(xn)x_{n+1} = x_n - \frac{y(x_n)}{y'(x_n)}

This is the equation that you usually see all around.

To see the general method in action, read my friend’s blog :

Blogs - Optimizing bivariate functions
Understanding optimization of bivariate functions using Newton’s method and L-BFGS.
https://kishan-ved.github.io/Blogs/posts/secondorder/

Condition number

Suppose you are trying to solve an equation f(x)=cf(\bold{x}) = \bold{c} but some numerical method, and after a lot of (nn) iterations, you end up at the approximate value xn\bold{x}_n which gives the value of the function as f(xn)f(\bold{x}_n), you know the relative error in the value of ff , which is f(xi)cc\frac{||f(\bold{x}_i)-\bold{c}||}{\bold{c}}, but you don’t know the relative error in x\bold{x} , namely xnxx\frac{||\bold{x_n-x_\infty}||}{||\bold{x}_\infty||} . To get an upper bound on this error, we define the condition number as

K=maxn(xnxx)(f(xi)cc)K = \max_n\frac{(\frac{||\bold{x_n-x_\infty}||}{||\bold{x}_\infty||})}{(\frac{||f(\bold{x}_i)-\bold{c}||}{\bold{c}})}

For example, for a matrix equation f(x)=Ax=c\bold{f(x) = Ax=c} , if the error in x\bold{x} is δx\delta\bold{x} , then as

(δxx)(δff)=Axx(Aδxδx)1σmax  σmin1\bold{\frac{(\frac{||\delta x||}{||x||})}{(\frac{||\delta f||}{||f||})}} = \bold{\frac{||Ax||}{||x||}(\frac{||A \,\delta x||}{||\delta x||})^{-1}} \leq {\sigma_{max}}\;{\sigma_{min}}^{-1}

where σmax\sigma_{max} and σmin\sigma_{min} are the maximum and minimum singular values for A\bold{A} , we have the condition number as σmaxσmin\dfrac{\sigma_{max}}{\sigma_{min}} .

Gradient Descent

Given a function f(x)f(\bold{x}) to be minimised, step along the negative gradient by doing:

xi+1=xiγi(f(xi))T\bold{x}_{i+1} = \bold{x}_i -\gamma_i (\nabla f_{(\bold{x_i})})^T

The transpose is because we usually consider the gradient to be a row vector.

The step size γi\gamma_i depends on xi\bold{x}_i .

Gradient Descent with Momentum

Rather that viewing the gradient as the velocity (or momentum), if we view it as acceleration (or force), then we get the gradient descent with momentum.

It is able to escape local minimums, unlike the normal gradient descent, and takes bigger steps when it knows it’s going in the right direction, thus completing faster.

We use the change in x\bold{x} in the last update, denoted as Δxi=xixi1\Delta\bold{x}_i = \bold{x}_i-\bold{x}_{i-1} as well to calculate the next step, like this :

Δxi+1=xi+1xi=γi(f)(xi))T+αΔxi\Delta\bold{x}_{i+1} = \bold{x}_{i+1}-\bold{x}_{i}=-\gamma_i(\nabla f)_{(\bold{x}_i)})^ T + \alpha\Delta \bold{x}_i

The reason for adding the α<1\alpha < 1 is to include “friction” , because otherwise we can end up in an infinite loop (like a friction-less ball moving in bowl like surface and always reaching the same height when its speed is 0, and thus never stopping (both velocity and acceleration being 0))

Here’s a nice article explaining this (and more variations) visually :

A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam)
Why can AdaGrad escape saddle point? Why is Adam usually better? In a race down different terrains, which will win?
https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c

Step Size

Sure, you can take steps of constant sizes, but you can do more. Say you want to minimise f(x)f(\bold{x}) and you have reached a point xn\bold{x_n} . Now, you want to descend along the gradient again. The question is, till when ? The answer is: till we are descending, that is, till f(x)f(\bold{x}) is decreasing. Basically, till ddγf(xnγf(xn))=0    (f(xnγf(xn)))(f(xn))T=0\frac{d}{d\gamma}f(\bold{x_n}-\gamma \nabla f(\bold{x_n})) = 0 \iff (\nabla f(\bold{x_n}-\gamma \nabla f(\bold{x_n})))(\nabla f(\bold{x_n}))^T = 0 . The solution to this equation would be the exact value of optimum step size. But most of the times you don’t have an analytical solution. This can be approximated as (f(xn)γ(f(xn))Hf)(f(xn))T    γ=f(xn)2f(xn)Hf(f(xn))T(\nabla f(\bold{x_n}) - \gamma (\nabla f(\bold{x_n}))\bold{H}_f)(\nabla f(\bold{x_n}))^T \implies \gamma = \frac{||\nabla f(\bold{x_n})||^2}{\nabla f(\bold{x_n})\bold{H}_f(\nabla f(\bold{x_n}))^T} . But a lot of the times you don’t know the hessian, and it is more or less useless, because if you are calculating the hessian anyway, just use Newton’s method, as it will give a better result.

What we really need is a cheap way to find when γ\gamma should increase and when not. This is where backtracking line search comes in.

You start with γ=1\gamma = 1 and reduce γ\gamma by a fixed scaling factor β<1 \beta < 1 while f(xnγf(xn))f(xn)γ2f(xn)2f(\bold{x_n}-\gamma\nabla f(\bold{x_n})) \geq f(\bold{x_n}) - \frac{\gamma}{2}||\nabla f(\bold{x_n})||^2 is true. You go to the next step when γ\gamma doesn’t satisfy this inequality. One way to speed this up is to start the iteration with the chose value of γ\gamma in the last step. If it is already small enough to not satisfy the inequality, you keep increasing by β\beta until the last γ\gamma that doesn’t satisfy the condition, or you can just start from γ=1\gamma = 1 again.

Lagrange Multipliers for Equations

This is a rather nice method where the whole idea is that to minimise (locally) a function f(x)f(\bold{x}) under the constraints gi(x)=0g_i(\bold{x}) = 0 , which can be written compactly as g(x)=0\bold{g(x)=0} , you consider the Lagrangian L(x,a)=f(x)+aTg(x)L(\bold{x},\bold{a}) = f(\bold{x})+\bold{a^Tg(x)} and minimise (locally) this new function under no constraints by setting the gradient to 0.

Basically, our new problem becomes f(x0)+a1g1(x0)+a2g2(x0)+  =0\nabla f (\bold{x_0}) + a_1\nabla g_1(\bold{x_0}) + a_2\nabla g_2(\bold{x_0}) + \dots \;= 0 which can be read as “f(x0),g1(x0),g2(x0)\nabla f(\bold{x_0}),\nabla g_1(\bold{x_0}),\nabla g_2(\bold{x_0}) \dots “ are linearly dependent row vectors, still following the old constraints, g(x0)=0\bold{g(x_0)=0}.

Proof :

Consider any point in the locality of minima x0\bold{x_0} of ff , say x1=x0+tb\bold{x_1=x_0+}t\bold{b} with g(x1)g(\bold{x_1}) where b\bold{b} is a unit vector and tt is a scalar. Both t,x1t,\bold{x_1} are in fact functions of b\bold{b} . Now, suppose x1x0\bold{x_1 \to x_0} , then we have limx1x0gi(x1)gi(x0)t=gi(x0)  b=0\lim_{\bold{x_1 \to x_0}}\frac{g_i(\bold{x_1})-g_i(\bold{x_0})}{t} = \nabla g_i(\bold{x_0}) \;\bold{b} = 0.

Now, since x0\bold{x_0} is the solution to the minimisation problem (minimise ff under constraints), thus the function ff must not change on going in any direction (by an infinitesimal distance) where the constraint is being followed, say b\bold{b} for example. What this means is that limt0f(x0+bt)f(x0)t=f(x0)  b=0\lim_{\bold{t \to 0}}\frac{f(\bold{x_0+b}t)-f(\bold{x_0})}{t} = \nabla f(\bold{x_0}) \;\bold{b} = 0 . Reiterating we are saying there should no way to move such that ff would change and g=0\bold{g=0} is still followed, as otherwise, we can just move in that or its opposite direction to decrease ff .

So, what we have just showed is that any unit vector b\bold{b} perpendicular to all of gi(x0)\nabla g_i(\bold{x_0}) is also perpendicular to f(x0)\nabla f(\bold{x_0}) , but that means that f(x0)\nabla f(\bold{x_0}) doesn’t have any component in the null space formed by all of gi(x0)\nabla g_i(\bold{x_0}) , and thus it is the vector space spanned by the set of gi(x0)\nabla g_i(\bold{x_0}) , that is to say f(x0),g1(x0),g2(x0)\nabla f(\bold{x_0}),\nabla g_1(\bold{x_0}),\nabla g_2(\bold{x_0}) \dots are linearly dependent, which is what we wanted to prove.

This method even works for inequality constraints h(x)0h(\bold{x})\leq 0 , since that can be converted to an equality by introducing a new variable xn+1x_{n+1} in the vector x\bold{x} and writing h(x)+xn+12=0h(\bold{x})+ x_{n+1}^2 = 0 . The resultant problem can be simplified a lot and that simplified version has a much more intuitive derivation than the one you would do after applying this trick. We’ll discuss it later.

Gradient descent under equality constraints.

Notice that we talked about moving under a bunch of constraints g(x)=0\bold{g(x)=0} in order to decrease the value of f(x)f(\bold{x}) . This is a lot like gradient descent, except here we aren’t moving along the gradientf(x)\nabla f(\bold{x}) directly, but its projection on the null space formed by gi(x)\nabla g_i(\bold{x}) .

Lagrangian multiplier method for inequality constraints.

Consider the function f(x)f(\bold{x}) to be locally minimised under the constraints gi(x)0g_i\bold{(x)} \leq 0 . For a solution x0\bold{x_0} , first of all, at least one of the inequalities must become an equality, which we usually call as the constraint being tight. This is because if i,  gi(x0)<0\forall i, \;g_i(\bold{x_0}) < 0 , then i,  gi(x)<0\forall i, \;g_i(\bold{x}) < 0 in all directions in the locality of x0\bold{x_0} , and thus we can descend along the gradient to reach a point in this locality that follows all constraints and has a smaller value of ff . So now, WLOG, suppose that gi(x0)=0  ikg_i(\bold{x_0}) = 0\; \forall i \leq k and gi(x0)<0  i>kg_i(\bold{x_0}) < 0 \;\forall i > k , then moving a little around x0\bold{x_0} doesn’t affect the inequalities gi(x0)<0  i>kg_i(\bold{x_0}) < 0 \;\forall i > k . So we can move following the constraints gi(x0)=0  ikg_i(\bold{x_0}) = 0\; \forall i \leq k as if the other constraints didn’t exist. Clearly for x0\bold{x_0} to be a local minima, the projection of gradient f(x0)\nabla f(\bold{x_0}) on the null space formed by G={gi(x0)    1ik}G = \{\nabla g_i(\bold{x_0})\;|\;1\leq i\leq k\} should be 0\bold{0} and thus it is linearly dependent on these vectors, which means f(x0)+a1g1(x0)+a2g2(x0)++akgk(x0)=0\nabla f(\bold{x_0})+a_1\nabla g_1(\bold{x_0})+a_2\nabla g_2(\bold{x_0})+\dots + a_k\nabla g_k(\bold{x_0}) = 0. Also, suppose we moved a little along a unit vector b\bold{b} such that gi(x0)b=0    ik  ;  ij\nabla g_i(\bold{x_0})\bold{b} = 0 \;\forall \;i \leq k \;;\; i\neq j , and gj(x0)b<0\nabla g_j(\bold{x_0})\bold{b} <0 (Yes, such a unit vector always exists for linearly independent vectors (a subset of GG in this case) due to the existence of reciprocal system of vectors) , then we have f(x0)b=ajgj(x0)b\nabla f(\bold{x_0})\bold{b} = - a_j\nabla g_j(\bold{x_0})\bold{b} . So if aj<0a_j < 0 , then f(x0)b<0\nabla f(\bold{x_0})\bold{b} < 0 and thus we can step along b\bold{b} to reduce ff . This should not happen and thus we have the additional restriction that ai0  ia_i \geq 0 \;\forall i .

Thus any critical point x0\bold{x_0} in the old problem is also part of a critical point (x0,a0)(\bold{x_0,a_0}) of the function L(x)=f(x)+i=1naigi(x)L(\bold{x}) = f(\bold{x})+\sum_{i=1}^n a_ig_i(\bold{x}) constrained by ai0,gi0    ia_i \geq 0,\,g_i \leq 0 \; \;\forall i . Here we can use the fact that if gj(x0)<0g_j(\bold{x_0}) < 0 , then aja_j has to be necessarily 0. (or else there are points in the locality where LL is bigger and points where it is smaller found by decreasing or increasing aja_j , and thus we are not at the minimum). So, in effect, in our partial differential equations, we are still considering the gradients of only those constraint functions which evaluate to 0 at x0\bold{x_0} , that is their constraint is tight.

Of course the fact that gi(x0)<0    ai=0g_i(\bold{x_0})<0 \implies a_i = 0 also tells us that ai>0    ai0    gi(x0)0    gi(x0)=0a_i > 0 \implies a_i \neq 0 \implies g_i(\bold{x_0}) \nless 0 \implies g_i(\bold{x_0}) = 0 . Basically, at least one of aia_i or gi(x0)g_i(\bold{x_0}) must be 0. Thus you can divide this problem in 2n12^n-1 cases based on whether ai=0a_i=0 or ai>0a_i > 0 , and solve each case by setting gradient to 0. Remember that the case where gi(x0)<0  i    ai=0  ig_i(\bold{x_0}) < 0 \;\forall i \implies a_i = 0 \;\forall i is not allowed and this was established at the very start

Gradient descent under general constraints

If you have reached a point x1\bold{x_1} somewhere in the process, then to decide the direction in which you should step, you consider the set G0G_0 which contains the gradients of all the constraint functions for an equality constraint. You add to this set the gradient of any constraint function g1g_1 for the inequality constraint g1(x0)0g_1(\bold{x_0})\leq 0 for which the constraint is tight and the projection of the gradient f\nabla f on the null space of G0G_0 has a negative dot product with g1\nabla g_1 to get the updated set G1G_1. Then you add another such tight inequality constraint function’s gradient, which has a negative dot product with the projection of f\nabla f on null space of G1G_1 , and thus get the updated set G2G_2. You keep repeating this process until you can’t find another constraint function that satisfies this rule. The projection of f\nabla f on the null space of this final set, say GkG_k gives us the direction to move in.

Notice that here we are quite literally constraining the gradient by not letting its projection, which would give us the optimum direction to move in under only the constraints involved in the set GiG_i , have a negative dot product with the gradient of any other tight constraint function . If it was negative, we would break the constraint by descending along this projection as then gig_i would increase on such a step and become positive. It’s basically like we update our optimal direction to move in as more constraints are put on us.

Convex function

A function f:DCf:D \to C is convex iff f(αx2+(1α)x1)αf(x2)+(1α)f(x1)      α[0,1]  x1,x2Df(\alpha\bold{x_2}+(1-\alpha)\bold{x_1}) \leq \alpha f(\bold{x_2}) + (1-\alpha)f(\bold{x_1}) \;\;\forall\;\alpha\in [0,1] \;\forall \bold{x_1,x_2}\in D .

For example, an upward quadratic is a convex function .

This inequality is called Jensen’s inequality, and is the definition of a convex function, not its property .A concave function is where the inequality is switched to \geq.

Consider x2=x1+x\bold{x_2=x_1+x} , then for a convex function, we have f(x1+αx)α(f(x1+x)f(x1))+f(x1)f(\bold{x_1 + \alpha x}) \leq \alpha (f(\bold{x_1 + x}) - f(\bold{x_1})) +f(\bold{x_1}) . This can be rewritten as f(x1+αx)f(x1)αf(x1+x)f(x1)\frac{f(\bold{x_1+\alpha x})-f(\bold{x_1})}{\alpha} \leq f(\bold{x_1 + x}) - f(\bold{x_1}) . From here, it’s easy to see that as αx0\alpha \bold{x} \to \bold{0} , we get f(x1)xf(x1+x)f(x1)\nabla f(\bold{x_1}) \bold{x} \leq f(\bold{x_1 + x}) - f(\bold{x_1}) . Since this is a general property, we can also write f(x1+x)(x)f((x1+x)+(x))f(x1+x)\nabla f(\bold{x_1+x}) (-\bold{x}) \leq f(\bold{(x_1 + x)+(-x)}) - f(\bold{x_1+x}) , or simply f(x1+x)xf(x1+x)f(x1)\nabla f(\bold{x_1+x})\bold{x} \geq f(\bold{x_1+x}) - f(\bold{x_1}) . What this means is that f(x1+x)xf(x1)x    x1,xD\nabla f(\bold{x_1+x})\bold{x} \geq \nabla f(\bold{x_1})\bold{x} \;\;\forall \bold{x_1,x}\in D , or rewritten, (f(x1+x)f(x1))x0    x1,xD(\nabla f(\bold{x_1+x})- \nabla f(\bold{x_1}))\bold{x} \geq 0 \;\;\forall \bold{x_1,x}\in D . Thus it’s also true for x=tb\bold{x=}t\bold{b} where b\bold{b} is some vector. Thus (f(x1+tb)f(x1)t)b0(\frac{\nabla f(\bold{x_1}+t\bold{b})-\nabla f(\bold{x_1})}{t})\bold{b} \geq 0 . It’s easy to see that as t0t \to 0 , the thing inside the bracket is just bTHf\bold{b^TH}_f . Thus you have bTHb0    b\bold{b^THb} \geq 0 \;\forall \;\bold{b} , and thus the hessian is positive semi-definite.

A constrained problem of minimising f(x)f(\bold{x}) under the constraints g(x)=0\bold{g(x)=0} is called a convex problem if f(x),gi(x)f(\bold{x}),g_i(\bold{x}) are convex functions.

Linear Programming

Consider a the function f(x)=cTxf(\bold{x})=\bold{c^Tx} . In order to minimise ff under the constraint Axb\bold{Ax\leq b} , we consider the Lagrangian L(x,a)=cTx+aT(Axb)=xT(c+ATa)aTbL(\bold{x,a}) = \bold{c^Tx+a^T(Ax-b)} = \bold{x^T(c+A^Ta)-a^Tb} and maximise it under the constraints that ai0gia_i \geq 0 \geq g_i . Taking the gradient wrt x\bold{x} and setting to 0, we get c+ATa=0\bold{c+A^Ta=0} .

Thus we just have to maximise L(x,a)=aTbL(\bold{x,a})=\bold{-a^Tb} as a function of a\bold{a} under the restriction that ai0a_i \geq 0 .