Linear Regression

Consider a bunch of error prone measurements yiy_i made under different conditions conditions xi\bold{x_i} that should have been related linearly as yi=θ1f1(x1i)+θ2f2(x2i)+θnfn(xni)=θTf(xi)y_i = \theta_1 f_1(x_{1i}) + \theta_2 f_2(x_{2i}) + \dots \theta_n f_n(x_{ni}) = \bold{\boldsymbol\theta^Tf(x_i)} , but are instead related as yi=θTf(xi)+ϵiy_i = \boldsymbol\theta\bold{^Tf(x_i)}+\epsilon_i. Usually, to minimise this error, you would make the measurement multiple times in the same conditions and then use the average of those measurements as your yiy_i . In such a condition, due to the central limit theorem, ϵi\epsilon_i would follow a Gaussian distribution N(0,σ2)N(0,\sigma^2) if you had taken infinite number of such measurement under the same conditions . Notice we are assuming that the error, in limit, follows the same distribution for each ii ..basically ϵi\epsilon _i are i.i.d. variables . Thus, the probability distribution for yiy_i would be N(θTf(xi),σ2)N(\boldsymbol\theta^T\bold{f(x_i)},\sigma^2) . This is what should happen for the correct value of θ\boldsymbol\theta .

The main problem is that we don’t know this correct θ\boldsymbol\theta , and so what we do is that we pretend for any general value of θ\boldsymbol\theta that it’s correct, in order to find the probability of our data being the way it is in that case, namely p(y,X    θ)=ip(yi,xi    θ)p(\bold{y,X\;|\;\boldsymbol\theta}) = \prod _ip(y_i,\bold{x_i}\;|\;\boldsymbol\theta) , and then using this probability and Bayes’ rule, we find p(θ    y,X)p(\bold{\boldsymbol\theta\;|\;y,X}), which is the probability that our guess for θ\boldsymbol\theta is correct. Thus, although we don’t know the correct value of θ\boldsymbol\theta , we can find the most probable value, and just go along with it, since it explains the data the best. Note that p(yi,xi    θ)p(y_i,\bold{x_i}\;|\;\boldsymbol\theta) is the probability of the event that we observed yiy_i AND the conditions are xi\bold{x_i} , given that θ\boldsymbol\theta is its true value. This can be written as p(yi    xi,θ)p(xiθ)p(y_i\;|\;\bold{x_i},\boldsymbol\theta)p(\bold{x_i|\boldsymbol\theta}) . It’s safe to say that the conditions we have for the observation xi\bold{x_i} and the way that the observations yiy_i are related to these conditions, which is represented by θ\boldsymbol\theta should be independent. Thus we can write p(xiθ)=p(xi)p(\bold{x_i|\boldsymbol\theta})=p(\bold{x_i}) . And thus p(y,X    θ)=ip(yi,xi    θ)=ip(yi    xi,θ)ip(xi)p(\bold{y,X\;|\;\boldsymbol\theta}) = \prod _ip(y_i,\bold{x_i}\;|\;\boldsymbol\theta) = \prod _ip(y_i\;|\;\bold{x_i},\boldsymbol\theta)\prod_ip(\bold{x_i})  . And so, after using Bayes’ rule, we would have

as all other terms are independent of θ\boldsymbol\theta . So to find the most probable θ\boldsymbol\theta , we just have to maximise p(θ)ip(yi    xi,θ) p(\boldsymbol\theta)\prod _ip(y_i\;|\;\bold{x_i},\boldsymbol\theta) .

Maximum Likelihood Estimate

A common way to simplify this expression is to set the prior distribution p(θ)=1p(\boldsymbol\theta)=1 . The expression thus obtained is L(θ)=ip(yi    xi,θ)L(\boldsymbol\theta) = \prod _ip(y_i\;|\;\bold{x_i},\boldsymbol\theta), which is the Likelihood for θ\boldsymbol\theta. Equivalently, we can minimise the negative log likelihood, which is given by

lnL(θ)=ilnp(yi    xi,θ)-\ln L(\boldsymbol\theta) = -\sum_i \ln p(y_i\;|\;\bold{x_i,\boldsymbol\theta})

.Discarding a bunch of constants, we just have to minimise :

i(yif(xi)Tθ)2=yFθ2=(yFθ)T(yFθ)=yTy2yTFθ+θTFTFθ\sum_i (y_i-\bold{f(x_i)}^T\boldsymbol\theta)^2 = ||\bold{y-F\boldsymbol\theta}||^2 = (\bold{y-F\boldsymbol\theta})^T(\bold{y-F\boldsymbol\theta}) = \bold{y^Ty-2y^TF\boldsymbol\theta+\boldsymbol\theta^TF^TF\boldsymbol\theta}

where F\bold{F} is the matrix made of the row vectors f(xi)\bold{f(x_i)} .

This is the loss function, divided by the number of observations is known as the Means Squared Error.

Differentiating this loss function wrt θ\boldsymbol\theta , we get the gradient

yTF+θTFTF\bold{-y^TF+\boldsymbol\theta^TF^TF}

This can be used for gradient descent directly. But in this case, we can find the solution directly by setting the gradient to 0

θmp=(FTF)1FTy\boldsymbol\theta_{mp}= \bold{(F^TF)^{-1}F^Ty}

Maximum A Posteriori Estimation

Remember how we discarded p(θ)p(\boldsymbol\theta) by setting it to 1 ? We did so because we knew nothing about θ\boldsymbol\theta . If we did know something, then we can get a better estimate. For example, if θ\boldsymbol\theta has a Gaussian distribution N(0,b2I)N(\bold{0},b^2\bold{I}) , then we get a more general estimate :

θmp=(FTF+σ2b2I)1FTy\boldsymbol\theta_{mp}= \bold{(F^TF}+\frac{\sigma^2}{b^2}\bold{I)^{-1}F^Ty}

Setting b2b^2 to \infty , we get the previous estimate.

Linear regression in action

Importing libraries

from matplotlib.pyplot import *
from numpy import *
Example 1

Generating Data

x1 = 10+2*random.random((1000,1))
x2 = 1+3*random.random((1000,1))
sig = 0.3
noise = random.randn(1000)*sig
y = sin(2*pi*x1[:,0])+cos(2*pi*x2[:,0])+noise
X = concatenate((x1,x2),axis=1)
del x1,x2
X,X_test = X[:800,:] , X[800:,:]
y,y_test = y[:800] , y[800:]

Plotting to get a feel

scatter(X[:,0],y,c=X[:,1])
figure()
scatter(X[:,1],y,c=X[:,0])

Now that we have figured out that the relationship is sinusoidal, we can assume y=asin(x1)+bsin(x2)+ccos(x1)+dcos(x2)y = a\sin(x_1)+b\sin(x_2) + c\cos(x_1)+d\cos(x_2) and find the best value of θ=[abcd]T\boldsymbol{\theta} = [a\,b\,c\,d]^T .

def f(X):
    X = 2*pi*X
    return concatenate((sin(X[:,newaxis,0]),sin(X[:,newaxis,1]),cos(X[:,newaxis,0]),cos(X[:,newaxis,1])),axis=1)
F = f(X)
theta = dot(linalg.inv(F.T @ F) @ F.T ,y)
print(theta)
[ 1.02137996 -0.00515224 -0.05019073  1.02662538]

Predicting the values

def predict(X):
    global theta
    F = f(X)
    return dot(F,theta)
print("The error is :")
y_pred = predict(X_test)
print(mean((y_pred-y_test)**2))
The error is :
0.09683797274423349

A 3D plot showing the full picture

%matplotlib
fig = figure()
ax = fig.add_subplot(projection='3d')
X1 = tile(array([linspace(10,12,100)]),(100,1)) 
X2 = tile(array([linspace(1,4,100)]),(100,1)).T
Y = array([[predict(array([[X1[i,j],X2[i,j]]]))[0] for j in range(100)] for i in range(100)])
ax.plot_surface(X1,X2,Y,alpha=1)
ax.scatter3D(X_test[:,0],X_test[:,1],y_test,c='red')
Example 2

Generating Data

x = (2*random.random(100) - 1)*pi
x = x[:,newaxis]
X = ones((100,1))
for n in range(1,4):
    X = concatenate((X,x**n),axis=1)
y = sin(x)

We’ll consider the values of x,x2,x3x,x^2,x^3\dots to be the features (conditions of measurement) and yy to be observations.

theta = dot(linalg.inv(X.T @ X) @ X.T ,y)[:,0]
print(theta)
y_pred = [dot(theta,X[i]) for i in range(shape(X)[0])]
plot(x[:,0],y_pred,'o')
plot(x[:,0],y[:,0],'o')
[ 0.01508739  0.85071025 -0.00283029 -0.09259996]