Probability

Data as vectors

The usual way you would recieve data is a matrix with rows being observations and columns being features. Every observation can be thought of as a vector, and these vectors are the empirical values for the “random variable” that is a general vector which is a list of such features. For example, for this matrix X\bold{X} :

The random variable xR2\bold{x} \in \mathbb{R}^2 , which is a list of features (scalar random variables), and has the observed values of [1  2]T,[3  4]T,[4  5]T[1\;2]^T,[3\;4]^T,[4 \;5]^T and [5  6]T[5\;6]^T .

Note : Any random vector x\bold{x} is a column vector, unless stated otherwise.

The probability of of x\bold{x} taking a certain value x1\bold{x_1} (in its full domain, not only in the observed values) is denoted as px(x1)p_\bold{x}(\bold{x_1}) , or for our convenience, simply as p(x)p(\bold{x}) .

If the case that the domain of x\bold{x} contains infinite elements and the probability for one value is infinitesimal, p(x)p(\bold{x}) refers to the probability distribution function, but I’ll still refer to it as the “probability” as if we are always taking about a discrete dirtribution.

For an event described by multiple random variables, the probability is written as p(x,y,z,)p(\bold{x,y,z,\dots}) . In such a setting, if there are less random variables inside the bracket than required to describe the event, then we are talking about the probability of any of the events with the specified random variable taking the values in the brackets, regardless of what values the other random variable take. For example, if an event is described by three random variables x,y,z\bold{x,y,z} , then p(x,y)=zp(x,y,z)p(\bold{x,y})=\sum_\bold{z} p(\bold{x,y,z}) .

Since we can always concatenate multiple random variables describing an event in a single big vector, so we only need to talk about 2 random variables at max.

We use ‘ | ‘ to mean “given” it is used to denote conditional probability, defined (yes, I’m using this equation to define the probability, rather than the usual set theory based definition) as

p(x    y)=p(x,y)p(x)p(\bold{x\;|\;y})=\dfrac{p(\bold{x,y})}{p(\bold{x})}

This is the probability of and event having a certain x\bold{x} , given that it has y\bold{y} as specified.

From here it’s easy to derive Baye’s rule.

Bayes’ rule

One good example where it’s used is the Gaussian Naive Bayes’ Classifier.

A simpler example would be the β\beta-distribution, which is probability distribution of the probability θ\theta of a person winning a match, if we know that he has won α\alpha matches and lost β\beta assuming that this probability of winning stays the same in every match. Here we are trying to find p(θ    α,β)p(\theta\;|\;\alpha,\beta) . we already know that p(α,β    θ)=  α+βCαθα(1θ)βp(\alpha,\beta \;|\;\theta) =\; ^{\alpha+\beta}C_\alpha\, \theta^\alpha \,(1-\theta)^\beta and we can find p(α,β)p(\alpha,\beta) by integrating p(α,β    θ)p(\alpha,\beta\;|\;\theta) over θ\theta , which gives us α+βCα(α+βCβ(α+β+1))1=(α+β1)1^{\alpha+\beta}C_\alpha(^{\alpha+\beta}C_\beta(\alpha+\beta+1))^{-1} = (\alpha + \beta - 1)^{-1} and we assume that p(θ)=1      0θ1p(\theta)=1 \;\;\forall\; 0\leq\theta\leq1 and 00 otherwise. This gives us that :

p(θ    α,β)=p(α,β    θ)  p(θ)p(α,β)=(α+β+1)α+βCα  θα(1θ)βp(\theta\;|\;\alpha,\beta) = \dfrac{p(\alpha,\beta \;|\;\theta)\;p(\theta)}{p(\alpha,\beta)} = ({\small\alpha + \beta + 1})^{\alpha + \beta}C_{\alpha}\;\theta^\alpha\,(1-\theta)^\beta

This is called the beta distribution.

Using Bayes’ rule if you have some prior pdf for x\bold{x} , given more data in the form of vectors yi\bold{y}_i used as row vectors in the matrix Y\bold{Y} , you can update x\bold{x} . We want to find the posterior pdf for x\bold{x} , that is p(x    Y)p(\bold{x\;|\;Y}) . This is calculated as p(Y    x)p(x)p(Y)\frac{p(\bold{Y\;|\;x})\,p(\bold{x})}{p(\bold{Y})} . Since the observations are all independent to each other, thus we can break this as ip(yix)p(x)p(yi)\prod_i\frac{p(\bold{y_i|x})\,p(\bold{x})}{p(\bold{y_i})} . This means we need to know p(x    y)=p(y    x)p(x)p(y)p(\bold{x\;|\;y})=\frac{p(\bold{y\;|\;x})\,p(\bold{x})}{p(\bold{y})} first. If y\bold{y} was in the same domain as x\bold{x} , then we could just calculate p(y    x)p(\bold{y\;|\;x}) to be the delta function in x\bold{x} centred around y\bold{y} , and then we would get the same for p(x    y)p(\bold{x\;|\;y}) . So it makes no sense to take them to be in the same set of features. Although, we can certainly take them to be in a set of dependent features.

Statistically Independent variables :

Two random variables x,y\bold{x,y} are statistically independent iff p(x,y)=p(x)p(y)p(\bold{x,y}) = p(\bold{x})\,p(\bold{y}) .

Expected values

For a function f(x)f(\bold{x}) of the random variable x\bold{x} , the expected value is given by :

E[f(x)]=xf(x)p(x)\text{E}[f(\bold{x})] = \sum_{\bold{x}} f(\bold{x})p(\bold{x})

It’s easy to see that E(x)\text{E}(\bold{x}) is the mean of x\bold{x} .

Note : The definition is only valid for discrete distributions

Covariance and variance

Suppose you you want to know how much are 2 feature x,yx,y similar (linearly dependent), given some data in the form of a matrix X\bold{X} with these two features present in the observations, then you can just take the dot product of the column vectors of these two features after subtracting the mean from each entry (This is important to do as if these quantities don’t vary a lot but have high values all the time, then the covariance of such quantities must intuitively be small). Then, divide by the length of the column vectors since you don’t want this value to depend on the size of the data-set. This quantity is called the covariance. Basically,

Cov(x,y)=1ni=1n(x[i]E(x))(y[i]E(y))=E((xE(x))(yE(y)))\text{Cov}(x,y) = \frac{1}{n}\sum_{i=1}^n (x[i]-\text{E}(x))(y[i]-\text{E}(y)) = \text{E}({\small(x-\text{E}(x))(y-\text{E}(y))})

Here [i][i] means “in the ithi^\text{th} observation” where there are nn observations.

And the covariance of a feature with itself is its variance.

Var(x)=Cov(x,x)=1ni=1n(x[i]E(x))2=E((xE(x))2)=E(x2)(E(x))2\text{Var}(x) = \text{Cov}(x,x) =\frac{1}{n}\sum_{i=1}^n (x[i]-\text{E}(x))^2 = \text{E}((x-\text{E}(x))^2) = \text{E}(x^2)-(\text{E}(x))^2

The last equality is obtained by expanding the square inside the summation and using the fact that E(x)\text{E}(x) is a constant.

The covariance can be used as a definition of an inner product between two random variables. Then the induced norm would be the square root of the variance. And the angle θ\theta between two random variables x,yx,y would be given by :

cos(θ)=Cov(x,y)Var(x)Var(y)=corr(x,y)\cos(\theta) = \dfrac{\text{Cov}(x,y)}{\sqrt{\text{Var}(x)}\sqrt{\text{Var}(y)}} = \text{corr}(x,y)

This value is called the “Correlation” .

When the random variable x\bold{x} is a vector, we can express the covariances of all pairs of features xi,xjx_i,x_j in a covariance matrix, which is basically the “Variance” of x\bold{x} .

Var(x)=1nXTX=1ni=1nx[i](x[i])T=1nxxxT=[Cov(x1,x1)Cov(x1,x2)Cov(x1,xn)Cov(x2,x1)Cov(x2,x2)Cov(x2,xn)Cov(xn,x1)Cov(xn,x2)Cov(xn,xn)]\text{Var}(\bold{x}) = \frac{1}{n}\bold{X^TX} = \frac{1}{n}\sum_{i=1}^n \bold{x}[i](\bold{x}[i])^T= \frac{1}{n}\sum_{\bold{x}} \bold{x}\bold{x}^T= \begin{bmatrix} \text{Cov}(x_1,x_1) &\text{Cov}(x_1,x_2) &\dots &\text{Cov}(x_1,x_n) \\ \text{Cov}(x_2,x_1) &\text{Cov}(x_2,x_2) &\dots &\text{Cov}(x_2,x_n) \\ \vdots &\vdots &\ddots &\vdots \\ \text{Cov}(x_n,x_1) &\text{Cov}(x_n,x_2) &\dots &\text{Cov}(x_n,x_n) \end{bmatrix}

Again, X\bold{X} is the matrix representing the data-set, with rows being x[i]T\bold{x}[i]^T

Now, for taking the covariance of two vector valued random variables, x,y\bold{x,y} , we do ;

Cov(x,y)=1nx,y(xE(x))(yE(y))T=E((xE(x))(yE(y))T)=E((xE(x))(yTE(y)T))=E(xyT)E(xE(y)T)E(E(x)yT)+E(E(x)E(y)T)=E(xyT)E(x)E(y)T\text{Cov}(\bold{x,y}) = \frac{1}{n}\sum_\bold{x,y} (\bold{x-\text{E}(x))}(\bold{y-\text{E}(y)})^T = \\\text{E}((\bold{x-\text{E}(x))}(\bold{y-\text{E}(y)})^T) = \\\text{E}((\bold{x}-\text{E}(\bold{x}))(\bold{y}^T-\text{E}(\bold{y})^T)) =\\ \text{E}(\bold{xy^T}) - \text{E}(\bold{x}\text{E}(\bold{y})^T) - \text{E}(\text{E}(\bold{x})\bold{y}^T) + \text{E}(\text{E}(\bold{x})\text{E}(\bold{y})^T) = \\\text{E}(\bold{xy^T}) - \text{E}(\bold{x})\text{E}(\bold{y})^T

Suppose two such random variables are linearly related as y=Ax+b\bold{y = Ax + b}, then the covariance is :

Cov(x,y)=E(x(xTAT+bT))E(x)E(Ax+b)T=E(xxT)AT+E(x)bTE(x)E(x)TATE(x)bT=(E(xxT)E(x)E(x)T)AT=Var(x)AT=ΣAT\text{Cov}(\bold{x,y}) = \text{E}(\bold{x(x^TA^T+b^T)}) - \text{E}(\bold{x})\text{E}(\bold{Ax+b})^T = \text{E}(\bold{xx^T})\bold{A^T} + \text{E}(\bold{x})\bold{b^T} - \text{E}(\bold{x})\text{E}(\bold{x})^T\bold{A^T} - \text{E}(\bold{x})\bold{b^T} \\= (\text{E}(\bold{xx^T})-\text{E}(\bold{x})\text{E}(\bold{x})^T)\bold{A^T} = \text{Var}(\bold{x})\bold{A^T} = \bold{\Sigma A^T}

Here Var(x)=Σ\text{Var}(\bold{x}) = \bold{\Sigma} is the covariance matrix of the features that x\bold{x} is composed of.

And thus ,

Var(y)=Cov(Ax+b,y)=ACov(x,y)=AΣAT\text{Var}(\bold{y}) = \text{Cov}(\bold{Ax+b,y}) = \bold{A}\text{Cov}(\bold{x,y}) = \bold{A\Sigma A^T}

The first equality is easy to prove because Cov(x,y)\text{Cov}(\bold{x,y}) is linear in x\bold{x} for a fixed y\bold{y} .

Sum of variables

For any two random variables, x,y\bold{x,y} , the variable z=x+y\bold{z=x+y} is a random variable with E(z)=E(x)+E(y)\text{E}(\bold{z}) = \text{E}(\bold{x}) + \text{E}(\bold{y}) and

Var(z)=Var(x)+Var(y)+Cov(x,y)+Cov(y,x)\text{Var}(\bold{z}) = \text{Var}(\bold{x}) + \text{Var}(\bold{y}) + \text{Cov}(\bold{x,y}) + \text{Cov}(\bold{y,x})  .

Statistically independent variables are linearly independent

For statistically independent variables xi,xj\bold{x_i,x_j}

Cov(xi,xj)=E((xiE(xi))((xjE(xj))T)=xi,xj(xiE(xi))(xjE(xj))Tp(xi,xj)=xi,xj(xiE(xi))(xjE(xj))Tp(xi)p(xj)=xi(xiE(xi))p(xi)xj(xjE(xj))Tp(xj)=E(xiE(xi))E(xjE(xj))T=00T=0\text{Cov}(\bold{x_i,x_j}) = \text{E}((\bold{x_i}-\text{E}(\bold{x_i}))((\bold{x_j}-\text{E}(\bold{x_j}))^T) = \sum_\bold{x_i,x_j}(\bold{x_i}-\text{E}(\bold{x_i}))(\bold{x_j}-\text{E}(\bold{x_j}))^Tp(\bold{x_i,x_j}) = \sum_\bold{x_i,x_j}(\bold{x_i}-\text{E}(\bold{x_i}))(\bold{x_j}-\text{E}(\bold{x_j}))^Tp(\bold{x_i})p(\bold{x_j}) = \sum_\bold{x_i}(\bold{x_i}-\text{E}(\bold{x_i}))p(\bold{x_i})\sum_\bold{x_j}(\bold{x_j}-\text{E}(\bold{x_j}))^T p(\bold{x_j}) = \text{E}(\bold{x_i}-\text{E}(\bold{x_i})) \text{E}(\bold{x_j}-\text{E}(\bold{x_j}))^T =\bold{ 0 0^T } = \bold{0}

I.I.D. random variables

Random variables x1,x2,x2,,xn\bold{x_1,x_2,x_2,\dots,x_n} are independent and identically distributed if :

It’s easy to see that in a case like this, the variables are also linearly independent .

This is useful when you want you take nn observations with each observation coming from the same distribution and predict beforehand the properties of these observations. For example, the expected value of the computed mean of observations xi\bold{x_i} is the same as the expected value for a single observation, say μ\boldsymbol{\mu}, and the variance is thus E((1nixiμ)(1nixiμ)T)=1n2E((i(xiμi))(i(xiμi))T)=1n2ijE((xiμi)(xjμj)T)=1n2i,jCov(xi,xj)=1n2iCov(xi,xi)=1n2iVar(xi)=1n2(nΣ)=1nΣ\text{E}((\frac{1}{n}\bold{\sum_ix_i} - \boldsymbol{\mu})(\frac{1}{n}\sum_i\bold{x_i} - \boldsymbol{\mu})^T) = \frac{1}{n^2}\text{E}((\sum_i\bold{(x_i-\boldsymbol{\mu_i})})(\sum_i\bold{(x_i-\boldsymbol{\mu_i})})^T) = \frac{1}{n^2}\sum_i\sum_j \text{E}(\bold{(x_i-\boldsymbol{\mu}_i)(x_j-\boldsymbol{\mu}_j)^T}) = \frac{1}{n^2} \sum_{i,j} \text{Cov}(\bold{x_i,x_j}) = \frac{1}{n^2} \sum_{i} \text{Cov}(\bold{x_i,x_i}) = \frac{1}{n^2} \sum_{i} \text{Var}(\bold{x_i}) = \frac{1}{n^2}(n\bold{\Sigma}) = \frac{1}{n}\boldsymbol\Sigma

The 4th equality is because of the linear independence of the variables.

Gaussian distribution

This distribution is important to the next theorem. Although it lacks physical meaning, it has immense mathematical meaning.

For a random variable xRx \in \mathbb{R} , if it follows the Gaussian distribution

p(x)=12πσ2exp((xμ)22σ2) p(\bold{x}) = \frac{1}{\sqrt{2\pi\sigma^2}}\small\exp\large(-\frac{(x-\mu)^2}{2\sigma^2})

then it has mean μ\mu and variance σ2\sigma^2 , where σ=Var(x)\sigma = \sqrt{\text{Var}(x)} is the standard deviation .

To make life simpler, every time we want to say that a variable xx follows a Gaussian distribution with mean and variance μ,σ2\mu,\sigma^2, instead of saying it in words, we write xN(μ,σ2)x \backsim N(\mu,\sigma^2).

For a vector valued random variable xRn\bold{x}\in \mathbb{R}^n , made of statistically independent scalar Gaussian random variables x1,x2,,xnx_1,x_2,\dots,x_n with xiN(μi,σi2)x_i \backsim N(\mu_i,\sigma^2_i) , we have :

p(x)=i=1n12πσi2exp((xiμi)22σi2)=1(2π)niσi2exp(i(xiμi)22σi2)p(\bold{x}) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma_i^2}}\exp(-\frac{(x_i-\mu_i)^2}{2\sigma_i^2}) = \frac{1}{\sqrt{(2\pi)^n\prod_i\sigma_i^2}}\exp(-\sum_i\frac{(x_i-\mu_i)^2}{2\sigma_i^2})

then if E(x)=μ\text{E}(\bold{x}) = \boldsymbol{\mu} and Var(x)=Σ\text{Var}(\bold{x}) = \bold{\Sigma} , we have iσi2=Σ\prod_i \sigma_i^2 = |\bold{\Sigma}| and i(xiμi)2σi2=(xμ)T  Σ1  (xμ)\sum_i\dfrac{(x_i-{\mu}_i)^2}{\sigma^2_i} = \bold{(x-\boldsymbol{\mu})^T\;\Sigma^{-1}\;(x-\boldsymbol{\mu})} , giving us :

p(x)=1(2π)nΣexp(12(xμ)T  Σ1  (xμ))p(\bold{x}) = \frac{1}{\sqrt{(2\pi)^n|\bold{\Sigma}|}}\exp(-\frac{1}{2}\bold{(x-\boldsymbol{\mu})^T\;\Sigma^{-1}\;(x-\boldsymbol{\mu})})

It turns out that this equation is valid even when xi,xjx_i,x_j are NOT { independent OR Gaussian } , but x\bold{x} overall is Gaussian .

Marginals and conditionals of Gaussians

Suppose x,y\bold{x,y} are two random variables, which together (concatenated) form a random variable z=[xT  yT]T\bold{z} = [\bold{x^T\;y^T]^T} which we know to be Gaussian, and suppose we want to find out the distribution of x\bold{x} given the value of y\bold{y} (Perhaps you want to predict something about a point with a specific y\bold{y} ). We do that by taking the conditional distribution p(x    y)p(\bold{x\;|\;y}) . This method isn’t specific to Gaussian distributions, but here we are interested in the result obtained when we consider Gaussian distribution specifically. Basically, to find p(x    y)p(\bold{x\;|\;y}), you need to find p(x)p(\bold{x}) , which we’ll prove to be a Gaussian distribution N(μx,Σx,x)N(\bold{\boldsymbol\mu_x,\Sigma_{x,x}}) .

(Say that Σa,b=Cov(a,b)\bold{\Sigma_{a,b} = \text{Cov}(a,b)} and μa=E(a)\boldsymbol{\mu}_\bold{a} = \text{E}(\bold{a}) for any two random variables a,b\bold{a,b} .)

Rather than brute forcing our way through complex linear algebra, we’ll prove this inductively, by proving this for a scalar valued yy, say the last coordinate of z\bold{z} to get a reduced set of coordinates being in a guassian distribution, and then using the fact that we can do this over an over and over untill we eventually delete an arbitary number of arbitary coordinates, making this true for any general x,y\bold{x,y} .

First let’s write the covariance and mean of z\bold{z} in terms of x\bold{x} and yy

Σz,z=[Σx,xΣx,yΣy,xσy2]     and     μz=[μxμy]\bold{\Sigma_{z,z}} = \begin{bmatrix}\bold{\Sigma_{x,x}} & \bold{\Sigma_{x,y}} \\ \bold{\Sigma_{y,x}} & {\sigma_{y}^2}\end{bmatrix} \;\;\text{ and } \;\;\boldsymbol{\mu}_\bold{z} = \begin{bmatrix}\boldsymbol{\mu}_\bold{x}\\{\mu}_{y} \end{bmatrix}

You are free to verify that the inverse of any positive definite symmetric matrix S=[ABBTD]S = \begin{bmatrix}A & B \\ B^T & D\end{bmatrix} where A,B,DA,B,D are sub-matrices is also positive definite symmetric, say [ABBTD]\begin{bmatrix}A' & B' \\ B'^T & D'\end{bmatrix} .

In our case A=Σx,x,  B=Σx,y,  BT=Σy,x,  D=σy2.A = \bold{\Sigma_{x,x}},\; B = \bold{\Sigma_{x,y}},\; B^T = \bold{\Sigma_{y,x}},\; D = {\sigma_{y}^2}. In favour of my sanity, I’ll use A,B,DA,B,D for the linear algebra rather than bold greek symbols with subscripts all the time. Also, let’s call p=xμxp=\bold{x-\boldsymbol\mu_x} and q=yμyq={y-\mu_y} .

So now we have :

p(x,y)=1(2π)nSexp(12[pT  q]  [ABBTD]  [pq])p(\bold{x,y}) = \frac{1}{\sqrt{(2\pi)^n|S|}}\exp(-\frac{1}{2}[p^T\;q]\; \begin{bmatrix} A' & B' \\ B'^T & D'\end{bmatrix}\;\begin{bmatrix}{p}\\{q}\end{bmatrix})

It’s easy to see that the stuff inside of the exp function (except the -1/2) evaluates to

pTAp+2(pTB)q+Dq2=pTAp(pTB)2D+(q+pTBD)2=pTAppTBTBDp+(q+pTBD)2p^TA'p + 2(p^TB')q + D'q^2 = p^TA'p - \frac{(p^TB')^2}{D'} + (q+\frac{p^TB'}{D'})^2 = p^TA'p - p^T\frac{B'^TB'}{D'}p + (q+\frac{p^TB'}{D'})^2

Now since exp(12(q+pTBD)2)dq=2π\int_{-\infty}^\infty\exp(-\frac{1}{2}(q+\frac{p^TB'}{D'})^2)dq = \sqrt{2\pi} and other stuff is constant wrt qq, thus we only have to worry about pT(ABTBD)pp^T(A'-\frac{B'^TB'}{D'})p inside the exp function (except the -1/2) after the integration is done. Namely, p(x)  α  exp(12(xμx)TM(xμx))p(\bold{x}) \;\alpha \; \exp(-\frac{1}{2}(\bold{x}-\boldsymbol\mu_\bold{x})^T\bold{M}(\bold{x}-\boldsymbol\mu_\bold{x})) where M=ABTBD\bold{M} = A'-\frac{B'^TB'}{D'} is some constant square matrix. Now after normalising the expression to get a pdf, we’ll just end up getting a normal distribution xN(μx,M)\bold{x} \backsim N(\boldsymbol\mu_\bold{x},\bold{M}) . What this means is that M=Σx,x1\bold{M=\Sigma_{x,x}^{-1}} , which isn’t surprising because the way it was defined, it’s basically A1A^{-1} . And thus, using the inductive method I proposed, we can say that for any random variable formed by deleting some coordinates (more than one) of z\bold{z} , say x\bold{x} , has a normal distribution .

Now, moving towards the conditional:

p(x)  α  exp(12(xμx)TΣx,x1(xμx))p(x,y)  α  exp(12(zμz)TΣz,z1(zμz))    p(x    y)  α  exp(12((zμz)TΣz,z1(zμz)    (xμx)TΣx,x1(xμx)))p(\bold{x}) \;\alpha\; \exp(-\frac{1}{2}(\bold{x-\boldsymbol\mu_x})^T\bold{\Sigma_{x,x}}^{-1}(\bold{x-\boldsymbol\mu_x})) \\ p(\bold{x,y}) \;\alpha\; \exp(-\frac{1}{2}(\bold{z-\boldsymbol\mu_z})^T\bold{\Sigma_{z,z}}^{-1}(\bold{z-\boldsymbol\mu_z})) \\ \implies p(\bold{x\;|\;y}) \;\alpha\; \exp(-\frac{1}{2}((\bold{z-\boldsymbol\mu_z})^T\bold{\Sigma_{z,z}}^{-1}(\bold{z-\boldsymbol\mu_z})\;-\;(\bold{x-\boldsymbol\mu_x})^T\bold{\Sigma_{x,x}}^{-1}(\bold{x-\boldsymbol\mu_x})))

it’s easy to see that the thing inside the exp function (except the -1/2) is a symmetric quadratic form in x\bold{x}, and can thus be expressed as (xa)TM(xa)\bold{(x-a)^TM(x-a)} for some symmetric matrix M\bold{M} . Thus p(xy)p(\bold{x|y}) is also a normal distribution. To find the parameters of this distribution, you have to get your hands dirty and do a lot of linear algebra involving block matrices. I am going to skip all of that and just give the results to you :

μxy=μx+Σx,yΣy,y1(yμy)Σxy=Σx,xΣx,yΣy,y1Σy,x\bold{\boldsymbol\mu_{x|y}=\boldsymbol\mu_x + \Sigma_{x,y}\Sigma_{y,y}^{-1}(y-\boldsymbol\mu_y)} \\ \bold{\Sigma_{x|y}=\Sigma_{x,x}-\Sigma_{x,y}\Sigma_{y,y}^{-1}\Sigma_{y,x}}

Change of variables

For probability density functions py(y)p_\bold{y}(\bold{y}) and px(x)p_\bold{x}(\bold{x}) for variables x,y\bold{x,y} related as y=f(x)\bold{y} = f(\bold{x}) , we have py(y)dy1dy2dyn=px(x)dx1dx2dx3dxnp_\bold{y}(\bold{y})\,dy_1\,dy_2\dots \,dy_n = p_\bold{x}(\bold{x})\,dx_1\,dx_2\,dx_3\dots \,dx_n

We know that the determinant of the Jacobian of y\bold{y} wrt x\bold{x} gives us the factor by which the volume dx1dx2dx3dxndx_1\,dx_2\,dx_3\dots \,dx_n scales to become the volume dy1dy2dyndy_1\,dy_2\dots \,dy_n ..Thus, we can write py(y)Jf(x)=px(x)p_\bold{y}(\bold{y})|\bold{J}_f(\bold{x})| = p_\bold{x}(\bold{x}) .

Affine Transform of a Gaussian variable

Consider xN(μx,Σx)\bold{x} \backsim N\bold{(\boldsymbol\mu_x,\boldsymbol\Sigma_x)} . Then, for a variable y=Ax+b\bold{y=Ax+b}  , we have

and thus

py(y)=A1px(A1(yb))=A1(2π)nA1Σy(A1)Texp(12(A1(yμy))T(A1Σy(A1)T)1(A1(yμy)))=1(2π)nΣyexp(12(yμy)T(A1)TATΣy1AA1(yμy))=1(2π)nΣyexp(12(yμy)TΣy1(yμy))= p_\bold{y}(\bold{y}) = |\bold{A}|^{-1}p_\bold{x}(\bold{A^{-1}(y-b)}) = \dfrac{|\bold{A^{-1}}|}{\sqrt{(2\pi)^n|\bold{A^{-1}\Sigma_y}(\bold{A^{-1})^T}|}}\exp(-\frac{1}{2} (\bold{A^{-1}(y-\boldsymbol\mu_y))^T}(\bold{A^{-1}\Sigma_y(A^{-1})^T})^{-1}(\bold{A^{-1}(y-\boldsymbol\mu_y}))) = \dfrac{1}{\sqrt{(2\pi)^n|\bold{\Sigma_y}|}}\exp( -\frac{1}{2}\bold{(y-\boldsymbol\mu_y)^T(A^{-1})^T}\bold{A^T\Sigma_y^{-1}A}\bold{A^{-1}(y-\boldsymbol\mu_y})) = \dfrac{1}{\sqrt{(2\pi)^n|\bold{\Sigma_y}|}}\exp( -\frac{1}{2}\bold{(y-\boldsymbol\mu_y)^T}\bold{\Sigma_y^{-1}}\bold{(y-\boldsymbol\mu_y})) =

Thus y\bold{y} also follows the Gaussian distribution,

namely yN(Aμx+b,AΣxAT)\bold{y}\backsim N(\bold{A\boldsymbol\mu_x+b,A\Sigma_xA^T}) .

Standard Gaussian

Consider yN(μ,Σ)\bold{y}\backsim N(\bold{\boldsymbol\mu,\Sigma}) . This is the result of an affine transformation on xN(0,I)\bold{x}\backsim N(\bold{0,I}) given by y=Ax+μ\bold{y = Ax+\boldsymbol\mu} for some A\bold{A} such that AAT=Σ\bold{AA^T=\Sigma} . We know how to find a possible A\bold{A} using the Cholesky Decomposition .

Thus all Gaussian variables are just Affine Transformations of the Standard Gaussian xN(0,I)\bold{x}\backsim N(\bold{0,I}) .