Continuous Optimisation
Newton-Ralphson method
When you search this on YouTube, you get explanations of how it works for approximating the solutions to an equation, say . This can also be rephrased as minimising . This is when the method is applied in its original form. The whole idea behind the method is that for a scalar function , you can approximate as a quadratic in by Taylor approximation around , like this :
and then minimise that approximation to get the next iteration, by setting the gradient wrt to be 0, like this :
Note : The Hessian is a symmetric matrix and thus (don’t try to prove this by symbolic manipulation, but by writing down the matrix as a bunch of column vectors) .
Thus, we get . Putting it all together, we get the final equation :
Applying this general method to gives us . Since , so we just have
This is the equation that you usually see all around.
To see the general method in action, read my friend’s blog :
Condition number
Suppose you are trying to solve an equation but some numerical method, and after a lot of () iterations, you end up at the approximate value which gives the value of the function as , you know the relative error in the value of , which is , but you don’t know the relative error in , namely . To get an upper bound on this error, we define the condition number as
For example, for a matrix equation , if the error in is , then as
where and are the maximum and minimum singular values for , we have the condition number as .
Gradient Descent
Given a function to be minimised, step along the negative gradient by doing:
The transpose is because we usually consider the gradient to be a row vector.
The step size depends on .
Gradient Descent with Momentum
Rather that viewing the gradient as the velocity (or momentum), if we view it as acceleration (or force), then we get the gradient descent with momentum.
It is able to escape local minimums, unlike the normal gradient descent, and takes bigger steps when it knows it’s going in the right direction, thus completing faster.
We use the change in in the last update, denoted as as well to calculate the next step, like this :
The reason for adding the is to include “friction” , because otherwise we can end up in an infinite loop (like a friction-less ball moving in bowl like surface and always reaching the same height when its speed is 0, and thus never stopping (both velocity and acceleration being 0))
Here’s a nice article explaining this (and more variations) visually :
Step Size
Sure, you can take steps of constant sizes, but you can do more. Say you want to minimise and you have reached a point . Now, you want to descend along the gradient again. The question is, till when ? The answer is: till we are descending, that is, till is decreasing. Basically, till . The solution to this equation would be the exact value of optimum step size. But most of the times you don’t have an analytical solution. This can be approximated as . But a lot of the times you don’t know the hessian, and it is more or less useless, because if you are calculating the hessian anyway, just use Newton’s method, as it will give a better result.
What we really need is a cheap way to find when should increase and when not. This is where backtracking line search comes in.
You start with and reduce by a fixed scaling factor while is true. You go to the next step when doesn’t satisfy this inequality. One way to speed this up is to start the iteration with the chose value of in the last step. If it is already small enough to not satisfy the inequality, you keep increasing by until the last that doesn’t satisfy the condition, or you can just start from again.
Lagrange Multipliers for Equations
This is a rather nice method where the whole idea is that to minimise (locally) a function under the constraints , which can be written compactly as , you consider the Lagrangian and minimise (locally) this new function under no constraints by setting the gradient to 0.
Basically, our new problem becomes which can be read as “ “ are linearly dependent row vectors, still following the old constraints, .
Proof :
Consider any point in the locality of minima of , say with where is a unit vector and is a scalar. Both are in fact functions of . Now, suppose , then we have .
Now, since is the solution to the minimisation problem (minimise under constraints), thus the function must not change on going in any direction (by an infinitesimal distance) where the constraint is being followed, say for example. What this means is that . Reiterating we are saying there should no way to move such that would change and is still followed, as otherwise, we can just move in that or its opposite direction to decrease .
So, what we have just showed is that any unit vector perpendicular to all of is also perpendicular to , but that means that doesn’t have any component in the null space formed by all of , and thus it is the vector space spanned by the set of , that is to say are linearly dependent, which is what we wanted to prove.
This method even works for inequality constraints , since that can be converted to an equality by introducing a new variable in the vector and writing . The resultant problem can be simplified a lot and that simplified version has a much more intuitive derivation than the one you would do after applying this trick. We’ll discuss it later.
Gradient descent under equality constraints.
Notice that we talked about moving under a bunch of constraints in order to decrease the value of . This is a lot like gradient descent, except here we aren’t moving along the gradient directly, but its projection on the null space formed by .
Lagrangian multiplier method for inequality constraints.
Consider the function to be locally minimised under the constraints . For a solution , first of all, at least one of the inequalities must become an equality, which we usually call as the constraint being tight. This is because if , then in all directions in the locality of , and thus we can descend along the gradient to reach a point in this locality that follows all constraints and has a smaller value of . So now, WLOG, suppose that and , then moving a little around doesn’t affect the inequalities . So we can move following the constraints as if the other constraints didn’t exist. Clearly for to be a local minima, the projection of gradient on the null space formed by should be and thus it is linearly dependent on these vectors, which means . Also, suppose we moved a little along a unit vector such that , and (Yes, such a unit vector always exists for linearly independent vectors (a subset of in this case) due to the existence of reciprocal system of vectors) , then we have . So if , then and thus we can step along to reduce . This should not happen and thus we have the additional restriction that .
Thus any critical point in the old problem is also part of a critical point of the function constrained by . Here we can use the fact that if , then has to be necessarily 0. (or else there are points in the locality where is bigger and points where it is smaller found by decreasing or increasing , and thus we are not at the minimum). So, in effect, in our partial differential equations, we are still considering the gradients of only those constraint functions which evaluate to 0 at , that is their constraint is tight.
Of course the fact that also tells us that . Basically, at least one of or must be 0. Thus you can divide this problem in cases based on whether or , and solve each case by setting gradient to 0. Remember that the case where is not allowed and this was established at the very start
Gradient descent under general constraints
If you have reached a point somewhere in the process, then to decide the direction in which you should step, you consider the set which contains the gradients of all the constraint functions for an equality constraint. You add to this set the gradient of any constraint function for the inequality constraint for which the constraint is tight and the projection of the gradient on the null space of has a negative dot product with to get the updated set . Then you add another such tight inequality constraint function’s gradient, which has a negative dot product with the projection of on null space of , and thus get the updated set . You keep repeating this process until you can’t find another constraint function that satisfies this rule. The projection of on the null space of this final set, say gives us the direction to move in.
Notice that here we are quite literally constraining the gradient by not letting its projection, which would give us the optimum direction to move in under only the constraints involved in the set , have a negative dot product with the gradient of any other tight constraint function . If it was negative, we would break the constraint by descending along this projection as then would increase on such a step and become positive. It’s basically like we update our optimal direction to move in as more constraints are put on us.
Convex function
A function is convex iff .
For example, an upward quadratic is a convex function .
This inequality is called Jensen’s inequality, and is the definition of a convex function, not its property .A concave function is where the inequality is switched to .
Consider , then for a convex function, we have . This can be rewritten as . From here, it’s easy to see that as , we get . Since this is a general property, we can also write , or simply . What this means is that , or rewritten, . Thus it’s also true for where is some vector. Thus . It’s easy to see that as , the thing inside the bracket is just . Thus you have , and thus the hessian is positive semi-definite.
A constrained problem of minimising under the constraints is called a convex problem if are convex functions.
Linear Programming
Consider a the function . In order to minimise under the constraint , we consider the Lagrangian and maximise it under the constraints that . Taking the gradient wrt and setting to 0, we get .
Thus we just have to maximise as a function of under the restriction that .