Vector Calculus

Gradient

For a scalar valued function ff with vector valued input xRn\bold{x} \in \mathbb{R}^n , the derivative is this row vector of shape 1×n1\times n .

dfdx=xf=[fx1  fx2    fxn]\frac{df}{d\bold{x}} = \nabla_\bold{x}f ={\large[}\frac{\partial f}{\partial x_1}\;\frac{\partial f}{\partial x_2}\;\dots\;\frac{\partial f}{\partial x_n}{\large]}

This is the vector along with rate of increase of ff (wrt norm of displacement) is highest.

Jacobian for vectors

For a vector valued function f(x)Rm\bold{f(x)}\in \mathbb{R}^m with vector valued input xRn\bold{x} \in \mathbb{R}^n , the derivative is this matrix of shape m×nm\times n called the Jacobian matrix, or simply the Jacobian :

dfdx=xf=J     where   Jij=fixj{\large\frac{d\bold{f}}{d\bold{x}} }= \nabla_\bold{x}\bold{f} =\bold{J} \;\;\text{ where } \;J_{ij} = \large\frac{\partial f_i}{\partial x_j}

Jacobian for Matrices

For a matrix valued function f(x)Rp×q\bold{f(x)}\in \mathbb{R}^{p\times q} with matrix valued xRm×n\bold{x} \in \mathbb{R}^{m\times n} , the derivative is again the Jacobian J\bold{J} which is now a tensor of shape p×q×m×np\times q\times m\times n , which can’t be written on paper. J\bold{J} is given by its elements instead as :

Jijkl=fijxklJ_{ijkl} = \dfrac{\partial f_{ij}}{\partial x_{kl}}

But this approach is impractical and what is usually done is that the matrices f,x\bold{f,x} are flattened out into vectors of size pqpq and mnmn , and then the Jacobian matrix with respect to these vectors is calculated.

Useful identities for calculating Jacobians

Automatic differentiation

Sometimes the derivative of a complex function (a function composed of many small functions) is even more complex than the function itself. Suppose ff is such a function composed of nn simple functions, as f=fnfn1f2f1f = f_n\circ f_{n-1}\dots f_2\circ f_1 and ff' is composed of m>nm > n simple functions. Then instead of computing the derivative directly, we can keep track of the values of kk succesive functions applied on the current input as xk=fkfk1fk2f2f1(x)x_k = f_k\circ f_{k-1}\circ f_{k-2}\dots f_2\circ f_1(x) . Then using dxkdx=fk(xk1)dxk1dx\dfrac{dx_k}{dx} = f_k'(x_{k-1})\dfrac{dx_{k-1}}{dx} , we calculate the derivatives of all xkx_k and update them by adding the derivative times the step. Thus, in the last iteration of every (full) update, we arrive at f(x)=dxndxf'(x) = \dfrac{dx_n}{dx} . Here, we’ve done only nn simple computations rather than mm computations done when using the explicit form of ff' . Note that it’s important that you update every xix_i only after you have used it to calculate f(xk1)f'(x_{k-1}) . Also in practice, the equation you would use is not dxkdx=fk(xk1)dxk1dx\dfrac{dx_k}{dx} = f_k'(x_{k-1})\dfrac{dx_{k-1}}{dx} , but just Δxk=f(xk1)Δxk1\Delta x_k = f'(x_{k-1})\Delta x_{k-1} .

This can also be generalised to functions composed of vector valued simple functions with vector valued inputs.

Second order derivative wrt a vector

For a scalar valued function f(x)f(\bold{x}) with inputs xRn\bold{x} \in \mathbb{R}^n , the first order derivative is dfdx=[fxi]iT\frac{df}{d\bold{x}}=[\frac{\partial f}{\partial x_i}]_i^T where the square bracket means “stack one below other for different ii” , i.e. create a column vector with such and such entries. Notice that the structure in which the vectors are represented doesn’t matter all that much, the containment does. (if you have ever used the numpy library in python, you know what I’m talking about) So now, the 2nd derivative is ddx([fxi]i)=[2fxixj]ij=H\frac{d}{d\bold{x}}([\frac{\partial f}{\partial x_i}]_i) = [\frac{\partial ^2f}{\partial x_i \partial x_j}]_{ij} = \bold{H} , also known as the Hessian.

You can also think of this as Tf\vec\nabla^T\vec\nabla f where =[x1  x2  xn]\vec\nabla = [\frac{\partial}{\partial x_1}\;\frac{\partial}{\partial x_2}\;\dots\frac{\partial}{\partial x_n}] .

To go to the third derivative, we need a generalised notion of matrices that doesn’t depend on the orientation and other trivialities. thus we use tensors and the outer product, which behaves like matrix multiplication :

Now, we define xk\nabla_{\bold{x}}^k to be the kk-th total derivative wrt x\bold{x} , which is given by taking kk outer products of the nabla operator. Basically, xk=\nabla_\bold{x}^k = \nabla\otimes\nabla\otimes\nabla\otimes\dots\nabla (k times).

Multivariate Taylor series

For an analytical function f(x)f(\bold{x}) , if x=x0+δ\bold{x=x_0+\delta} , then we have the taylor series :

f(x)=k=0kf(x0)k!δkf(\bold{x}) = \sum_{k=0}^\infty\dfrac{\nabla^kf(\bold{x_0})}{k!}\cdot \bold{\delta^k}

Here δk=δδδδ\bold{\delta^k}=\delta\otimes\delta\otimes\delta\otimes\dots \delta is the outer product of δ\delta taken kk times. we are taking the inner product of kf(x0)\nabla^kf(\bold{x_0}) and δk\delta^k as

kf(x0)δk=i1,i2in(kf(x0))[i1,i2in]  δk[i1,i2,i3in]=i1,i2in(kf(x0))[i1,i2in]  δi1δi2δi3δin=i1,i2in(xi1xi2xi3xinf)(x0)  δi1δi2δi3δin\nabla^kf(\bold{x_0}) \cdot \delta^k = \sum_{i_1,i_2\dots i_n}(\nabla^kf(\bold{x_0}))[i_1,i_2\dots i_n]\;\delta^k[i_1,i_2,i_3\dots i_n] = \sum_{i_1,i_2\dots i_n}(\nabla^kf(\bold{x_0}))[i_1,i_2\dots i_n]\;\delta_{i_1}\,\delta_{i_2}\,\delta_{i_3}\dots \delta_{i_n} = \sum_{i_1,i_2\dots i_n}(\frac{\partial}{\partial x_{i_1}}\frac{\partial}{\partial x_{i_2}}\frac{\partial}{\partial x_{i_3}}\dots\frac{\partial}{\partial x_{i_n}}f)_{(\bold{x_0})}\;\delta_{i_1}\,\delta_{i_2}\,\delta_{i_3}\dots \delta_{i_n}