Vector Calculus

Gradient

For a scalar valued function $f$ with vector valued input $\bold{x} \in \mathbb{R}^n$ , the derivative is this row vector of shape $1\times n$ .

\frac{df}{d\bold{x}} = \nabla_\bold{x}f ={\large[}\frac{\partial f}{\partial x_1}\;\frac{\partial f}{\partial x_2}\;\dots\;\frac{\partial f}{\partial x_n}{\large]}

This is the vector along with rate of increase of $f$ (wrt norm of displacement) is highest.

Jacobian for vectors

For a vector valued function $\bold{f(x)}\in \mathbb{R}^m$ with vector valued input $\bold{x} \in \mathbb{R}^n$ , the derivative is this matrix of shape $m\times n$ called the Jacobian matrix, or simply the Jacobian :

{\large\frac{d\bold{f}}{d\bold{x}} }= \nabla_\bold{x}\bold{f} =\bold{J} \;\;\text{ where } \;J_{ij} = \large\frac{\partial f_i}{\partial x_j}

Jacobian for Matrices

For a matrix valued function $\bold{f(x)}\in \mathbb{R}^{p\times q}$ with matrix valued $\bold{x} \in \mathbb{R}^{m\times n}$ , the derivative is again the Jacobian $\bold{J}$ which is now a tensor of shape $p\times q\times m\times n$ , which can’t be written on paper. $\bold{J}$ is given by its elements instead as :

J_{ijkl} = \dfrac{\partial f_{ij}}{\partial x_{kl}}

But this approach is impractical and what is usually done is that the matrices $\bold{f,x}$ are flattened out into vectors of size $pq$ and $mn$ , and then the Jacobian matrix with respect to these vectors is calculated.

Useful identities for calculating Jacobians

Automatic differentiation

Sometimes the derivative of a complex function (a function composed of many small functions) is even more complex than the function itself. Suppose $f$ is such a function composed of $n$ simple functions, as $f = f_n\circ f_{n-1}\dots f_2\circ f_1$ and $f'$ is composed of $m > n$ simple functions. Then instead of computing the derivative directly, we can keep track of the values of $k$ succesive functions applied on the current input as $x_k = f_k\circ f_{k-1}\circ f_{k-2}\dots f_2\circ f_1(x)$ . Then using $\dfrac{dx_k}{dx} = f_k'(x_{k-1})\dfrac{dx_{k-1}}{dx}$ , we calculate the derivatives of all $x_k$ and update them by adding the derivative times the step. Thus, in the last iteration of every (full) update, we arrive at $f'(x) = \dfrac{dx_n}{dx}$ . Here, we’ve done only $n$ simple computations rather than $m$ computations done when using the explicit form of $f'$ . Note that it’s important that you update every $x_i$ only after you have used it to calculate $f'(x_{k-1})$ . Also in practice, the equation you would use is not $\dfrac{dx_k}{dx} = f_k'(x_{k-1})\dfrac{dx_{k-1}}{dx}$ , but just $\Delta x_k = f'(x_{k-1})\Delta x_{k-1}$ .

This can also be generalised to functions composed of vector valued simple functions with vector valued inputs.

Second order derivative wrt a vector

For a scalar valued function $f(\bold{x})$ with inputs $\bold{x} \in \mathbb{R}^n$ , the first order derivative is $\frac{df}{d\bold{x}}=[\frac{\partial f}{\partial x_i}]_i^T$ where the square bracket means “stack one below other for different $i$ ” , i.e. create a column vector with such and such entries. Notice that the structure in which the vectors are represented doesn’t matter all that much, the containment does. (if you have ever used the numpy library in python, you know what I’m talking about) So now, the 2nd derivative is $\frac{d}{d\bold{x}}([\frac{\partial f}{\partial x_i}]_i) = [\frac{\partial ^2f}{\partial x_i \partial x_j}]_{ij} = \bold{H}$ , also known as the Hessian.

You can also think of this as $\vec\nabla^T\vec\nabla f$ where $\vec\nabla = [\frac{\partial}{\partial x_1}\;\frac{\partial}{\partial x_2}\;\dots\frac{\partial}{\partial x_n}]$ .

To go to the third derivative, we need a generalised notion of matrices that doesn’t depend on the orientation and other trivialities. thus we use tensors and the outer product, which behaves like matrix multiplication :

Now, we define $\nabla_{\bold{x}}^k$ to be the $k$ -th total derivative wrt $\bold{x}$ , which is given by taking $k$ outer products of the nabla operator. Basically, $\nabla_\bold{x}^k = \nabla\otimes\nabla\otimes\nabla\otimes\dots\nabla$ (k times).

Multivariate Taylor series

For an analytical function $f(\bold{x})$ , if $\bold{x=x_0+\delta}$ , then we have the taylor series :

f(\bold{x}) = \sum_{k=0}^\infty\dfrac{\nabla^kf(\bold{x_0})}{k!}\cdot \bold{\delta^k}

Here $\bold{\delta^k}=\delta\otimes\delta\otimes\delta\otimes\dots \delta$ is the outer product of $\delta$ taken $k$ times. we are taking the inner product of $\nabla^kf(\bold{x_0})$ and $\delta^k$ as

\nabla^kf(\bold{x_0}) \cdot \delta^k = \sum_{i_1,i_2\dots i_n}(\nabla^kf(\bold{x_0}))[i_1,i_2\dots i_n]\;\delta^k[i_1,i_2,i_3\dots i_n] = \sum_{i_1,i_2\dots i_n}(\nabla^kf(\bold{x_0}))[i_1,i_2\dots i_n]\;\delta_{i_1}\,\delta_{i_2}\,\delta_{i_3}\dots \delta_{i_n} = \sum_{i_1,i_2\dots i_n}(\frac{\partial}{\partial x_{i_1}}\frac{\partial}{\partial x_{i_2}}\frac{\partial}{\partial x_{i_3}}\dots\frac{\partial}{\partial x_{i_n}}f)_{(\bold{x_0})}\;\delta_{i_1}\,\delta_{i_2}\,\delta_{i_3}\dots \delta_{i_n}