Sciencing the data - Deriving OLS coefficients (multivariate)

What we are exploring

Deriving a unique, analytical solution to the set of coefficients that minimize the sum of squared residuals.

Summary

The cost function for OLS is the sum of squared residuals, \(\hat{\epsilon}^{\intercal}\hat{\epsilon}\). In order to fit a good linear model, we want to find optimum values for the estimated vector of coefficients, \(\hat{\beta}^*\), that minimizes this cost function.

First we do partial differentiation of the cost function with respect to the coefficients. Finding the coefficient values where the partial differential is equal to zero reveals the stationary points of the cost function. For OLS in particular, we can find a unique solution for the choice of coefficients that can be found analytically. The hessian matrix then further proves that this is a global minima.

Deriving the optimum coefficients

0. Defining the notation

For a sample \(i\), we observe an outcome \(y_i\). \(y\) is a vector of all \(n\) observed outcomes.

\[ \underset{n \times 1} {y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{n-1} \\ y_n \end{bmatrix} \]

We also observe \(k\) features for every sample \(i\). \(X\) is a matrix of these observed features. Note the first column is usually all ones, to include an intercept to optimize (or “bias” term).

\[ \underset{n \times k} {X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1,k-1} & x_{1,k} \\ 1 & x_{21} & \cdots & x_{2,k-1} & x_{2,k} \\ 1 & x_{31} & \cdots & x_{3,k-1} & x_{3,k} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & x_{n-2,1} & \cdots & x_{n-2,k-1} & x_{n-2,k} \\ 1 & x_{n-1,1} & \cdots & x_{n-1,k-1} & x_{n-1,k} \\ 1 & x_{n,1} & \cdots & x_{n,k-1} & x_{n,k} \end{bmatrix} \]

The contribution of each feature to the prediction is estimated by the coefficients \(\hat{\beta}\).

\[ \underset{k \times 1} {\hat{\beta}} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{k-1} \\ \beta_{k} \end{bmatrix} \]

We make predictions, \(\hat{y}\), by calculating the dot product of the features \(X\) and the coefficients \(\hat{\beta}\).

\[ \hat{y} = X \hat{\beta} \]

which is shorthand for this:

\[ \displaylines{ \begin{align} \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \hat{y}_3 \\ \vdots \\ \hat{y}_{n-2} \\ \hat{y}_{n-1} \\ \hat{y}_n \end{bmatrix} & = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1,k-1} & x_{1,k} \\ 1 & x_{21} & \cdots & x_{2,k-1} & x_{2,k} \\ 1 & x_{31} & \cdots & x_{3,k-1} & x_{3,k} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & x_{n-2,1} & \cdots & x_{n-2,k-1} & x_{n-2,k} \\ 1 & x_{n-1,1} & \cdots & x_{n-1,k-1} & x_{n-1,k} \\ 1 & x_{n,1} & \cdots & x_{n,k-1} & x_{n,k} \end{bmatrix} \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \vdots \\ \hat{\beta}_{k-1} \\ \hat{\beta}_{k} \end{bmatrix} \\ \\ & = \begin{bmatrix} \hat{\beta}_0 + \hat{\beta}_{1}x_{1,1} + \cdots + \hat{\beta}_{k-1}x_{1,k-1} + \hat{\beta}_{k}x_{1,k} \\ \hat{\beta}_0 + \hat{\beta}_{1}x_{2,1} + \cdots + \hat{\beta}_{k-1}x_{2,k-1} + \hat{\beta}_{k}x_{2,k} \\ \hat{\beta}_0 + \hat{\beta}_{1}x_{3,1} + \cdots + \hat{\beta}_{k-1}x_{3,k-1} + \hat{\beta}_{k}x_{3,k} \\ \vdots \\ \hat{\beta}_0 + \hat{\beta}_{1}x_{n-2,1} + \cdots + \hat{\beta}_{k-1}x_{2,k-1} + \hat{\beta}_{k}x_{n-2,k} \\ \hat{\beta}_0 + \hat{\beta}_{1}x_{n-1,1} + \cdots + \hat{\beta}_{k-1}x_{2,k-1} + \hat{\beta}_{k}x_{n-1,k} \\ \hat{\beta}_0 + \hat{\beta}_{1}x_{n,1} + \cdots + \hat{\beta}_{k-1}x_{2,k-1} + \hat{\beta}_{k}x_{n,k} \\ \end{bmatrix} \end{align} } \]

The residual is the difference between the true outcome and the model prediction.

\[ \hat{\epsilon} = y_i -\hat{y}_i \]

which is shorthand for this:

\[ \begin{bmatrix} \hat{\epsilon_1} \\ \hat{\epsilon_2} \\ \hat{\epsilon_3} \\ \vdots \\ \hat{\epsilon_{n-2}} \\ \hat{\epsilon_{n-1}} \\ \hat{\epsilon_{n}} \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_{n-2} \\ y_{n-1} \\ y_n \end{bmatrix} - \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \hat{y}_3 \\ \vdots \\ \hat{y}_{n-2} \\ \hat{y}_{n-1} \\ \hat{y}_n \end{bmatrix} \]

Our aim is to find the optimum vector of coefficients, \(\hat{\beta}^*\), that minimizes the sum of squared residuals:

\[ \min_{\beta} \left( \epsilon^{\intercal}\epsilon \right) \]

1. Expand the sum of squared residuals

The first step involves expanding the sum of squared residuals, and substituting in \(X \hat{\beta}\) for \(\hat{y}\). \[ \displaylines{ \begin{align} \sum_{i=1}^n{\hat{\epsilon}_i^2} & = \hat{\epsilon}^{\intercal}\hat{\epsilon} \\& =(y-X\hat{\beta})^{\intercal}(y-X\hat{\beta}) \\& = y^{\intercal}y - y^{\intercal}X\hat{\beta}- \hat{\beta}^{\intercal} X^{\intercal}y+ \hat{\beta}^{\intercal}X^{\intercal}X\hat{\beta} \\& = y^{\intercal}y - 2y^{\intercal}X\hat{\beta} +\hat{\beta}^{\intercal}X^{\intercal}X\hat{\beta} \end{align} } \]

Note we can simply add the two middle terms, since are both scalars:

\[ \displaylines{ y^{\intercal}X\hat{\beta} = \hat{\beta}^{\intercal} X^{\intercal}y \\ \because \underset{1 \times n}{y^{\intercal}} \times \underset{n \times k}{X} \times \hat{\underset{k \times 1}{\beta}} = \hat{\underset{1 \times k}{\beta}^{\intercal}} \times \underset{k \times n}{X^{\intercal}} \times \underset{n \times 1}{y} } \]

2. Partially differentiate RSS with respect to beta

The second step involves partially differentiating the cost function with respect to its parameters, to understand how it changes as the coefficients vary.

\[ \displaylines{ \begin{align} \frac{\partial}{\partial\hat{\beta}}\hat{\epsilon}^{\intercal}\hat{\epsilon} & \equiv \begin{bmatrix} \frac{\partial}{\partial\hat{\beta}_1}\hat{\epsilon}^{\intercal}\hat{\epsilon} \\ \frac{\partial}{\partial\hat{\beta}_2}\hat{\epsilon}^{\intercal}\hat{\epsilon} \\ \vdots \\ \frac{\partial}{\partial\hat{\beta}_k}\hat{\epsilon}^{\intercal}\hat{\epsilon} \end{bmatrix} \\ & = \frac{d}{d\hat{\beta}}( y^{\intercal}y - 2y^{\intercal}X\hat{\beta} +\hat{\beta}^{\intercal}X^{\intercal}X\hat{\beta}) \\ & = 0 - 2X^{\intercal}y +((X^{\intercal}X)\hat{\beta} + (X^{\intercal}X)^{\intercal}\hat{\beta}) \\ & = -2X^{\intercal}y + 2((X^{\intercal}X)\hat{\beta}) \end{align} } \]

Two matrix differentiation rules used here for reference:

\[ \displaylines{ \frac{\partial}{dx}(Ax) = A^{\intercal}x \\ \frac{\partial}{dx}(xAx) = Ax + A^{\intercal}x } \]

And note \(X^{\intercal}X = (X^{\intercal}X)^{\intercal}\) by definition, so we can add the two last terms.

3. Find the coefficient values at the stationary point

Now we find the choices of \(\beta\) where the partial differential is equal to zero. These stationary points for the cost function are either at its maximum or minimum.

For OLS - we actually only find one unique solution!

\[ \displaylines{ \begin{align} \cancel{2}X^{\intercal}y + \cancel{2}((X^{\intercal}X)\hat{\beta}) & = 0 \\ \therefore (X^{\intercal}X)\hat{\beta} & = X^{\intercal}y \\ \therefore \hat{\beta} & = (X^{\intercal}X)^{-1}X^{\intercal}y \end{align} } \]

Note the need to invert \(X^{\intercal}X\). This is only possible for a full rank matrix.

The first term is the (inverse) variance matrix of \(X\). This term normalizes the coefficient with respect to the magnitude of \(X\).

The second term is the covariance matrix between \(X\) and \(y\). This incorporates the linear relationship between the two in the coefficient.

Hence, the coefficient can be interpreted as the estimated change in \(y\) given a one unit change in \(X\).

4. Check the stationary point is a global minimum (hessian matrix)

Finally, we derive the hessian matrix, by double-differentiating the cost function with respect to the coefficients:

\[ \displaylines{ \frac{\partial^2}{\partial\hat{\beta}\partial\hat{\beta}^{\intercal}} \left( \hat{\epsilon}^{\intercal}\hat{\epsilon} \right) \\ = \frac{\partial}{\partial \hat{\beta}^{\intercal}} \left( -2X^{\intercal}y + 2((X^{\intercal}X)\hat{\beta}) \right) \\ = 2(X^{\intercal}X)^{\intercal} } \]

Since \(X^{\intercal}X\) is clearly positive definite, the cost function is convex. Thus, we know our unique solution for \(\beta\) where the partial differential is at zero is indeed a global minimum for the cost function.

Coding up and visualising in Python and OJS

Below we share some simple functions to derive the optimum beta, and visualise the univariate case (with an intercept).

Code

import numpy as np
from scipy.linalg import qr, solve_triangular
import pandas as pd

def quick_matrix_invert(X: np.ndarray) -> np.ndarray:
    """ Find the inverse of a matrix, using QR factorization """
    Q, R = qr(X)
    X_inv = solve_triangular(R, np.identity(X.shape[1])).dot(Q.transpose())
    return X_inv

def derive_coefs(y, X) -> np.ndarray:
    """ This is a function to calculate coefficients """
    XTX = X.T.dot(X)
    XTY = X.T.dot(y)
    XTX_inv = quick_matrix_invert(XTX)
    beta = XTX_inv.dot(XTY)
    return beta

def predict(X,coefs)-> np.ndarray:
    """ This is a function to calculate the predictions """
    return X.dot(coefs)

def RSS(y,X,coefs)-> np.ndarray:
    """ This is a function to calculate the RSS """
    residual = y - predict(X,coefs)
    return residual.T.dot(residual)

def RMSE(y,X,coefs)-> np.ndarray:
    """ This is a function to calculate the RMSE """
    rss = RSS(y,X,coefs)
    rmse = np.sqrt(rss/y.shape[0])
    return rmse

Given the true coefficients of \(\beta = [-0.4,0.8]\), we show how the estimated coefficients approach the true with larger sample sizes:

Code

# Example usage
np.random.seed(1)
n, k = 200, 2
sigma_sq = 1

X = np.hstack([ 
  np.ones(n).reshape(n,1),
  np.random.normal(size=(n,k-1)) 
])
beta = np.array([-0.4,0.8]).reshape(-1,1) # np.random.normal(size=(k,1))

y = X.dot(beta) + np.random.normal(loc=0,scale=sigma_sq,size=(n,1))

step_size = 25
coefs = list()
idx = np.arange(step_size,y.shape[0]+1,step_size).astype(int)
for i in idx:
    ty = y[0:int(i)]
    tX = X[0:int(i),:]
    tcoefs = derive_coefs(ty,tX)
    tyhat = predict(tX,tcoefs)
    trsme = RMSE(ty,tX,tcoefs)
    coefs.append(np.append(tcoefs,trsme))
results = pd.DataFrame(
  coefs, columns=['β0','β1','RMSE']
).set_index(idx).rename_axis('n')
results

	β0	β1	RMSE
n
25	-0.409265	0.670421	1.151094
50	-0.451721	0.541846	1.066863
75	-0.369806	0.666176	1.080135
100	-0.387691	0.758941	1.004279
125	-0.359027	0.717082	1.016090
150	-0.400397	0.743210	1.036587
175	-0.401303	0.758889	1.025304
200	-0.399384	0.744825	1.033993

Here is a surface plot of the RMSE given choices of \(\beta_0\) and \(\beta_1\), for the random sample.

Code

import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.stats import norm

y = y[np.argsort(X[:,1])]
X = X[np.argsort(X[:,1]),:]
coefs = derive_coefs(y,X)
y_hat = predict(X,coefs)

a, b = np.arange(-2,1,0.1), np.arange(-1,3,0.1)
av, bv = np.meshgrid(a, b)
all_coefs = np.dstack([av,bv])
cv = np.zeros(shape=av.shape)

for idx in range(len(all_coefs)):
  for idy in range(len(all_coefs[idx])):
      cv[idx,idy] = np.sqrt(RMSE(y,X,all_coefs[idx,idy].reshape(k,1))[0][0]/n)

fig = go.Figure()

# Surface of RMSE
fig.add_trace(
  go.Surface(
    x=av, y=bv, z=cv, 
    opacity=0.2,
    showscale=False,
    contours = dict(z=dict(show=True,size=0.01,start=0,end=4))
  )
)

# 3d plot of data
fig.add_trace(
  go.Scatter3d(
    x=coefs[0], 
    y=coefs[1],
    z=np.sqrt(RMSE(y,X,coefs)[0]/n),
    mode='markers', name='Optimum choice of coefficients',
    marker=dict(symbol='cross'),
  )
)

fig.update_layout(
  title="Plot of RMSE given choices of β0 and β1",
  scene = dict(
    xaxis_title="β0",
    yaxis_title="β1",
    zaxis_title="RMSE",
  ),
  showlegend=False,
)

fig.show()

And below is an interactive plot, that shows how the residuals change with the choice of coefficients. The RMSE for any particular choice of coefficients is shown in the title.

Code

ojs_define(X = X.tolist(), y=y.tolist(), x1 = X[:,1].tolist(), n=n)

Code

viewof beta0 = Inputs.range(
  [-2, 2], 
  {value: -0.4, step: 0.01, label: "Intercept: β0"}
)
viewof beta1 = Inputs.range(
  [-2, 2], 
  {value: 0.8, step: 0.01, label: "Slope: β1"}
)

Code

math = require("https://cdnjs.cloudflare.com/ajax/libs/mathjs/13.0.0/math.js")
plotly = require("https://cdnjs.cloudflare.com/ajax/libs/plotly.js/2.33.0/plotly.min.js")

coefs = math.matrix([[beta0],[beta1]])

y_hat = math.multiply(math.matrix(X),coefs)

residuals = math.subtract(y_hat,y)

RSS = math.squeeze(
  math.multiply(
    math.transpose(residuals),
    residuals
  )
)

RMSE = math.sqrt( math.multiply(RSS, 1/n) )

data = [
  {
    x: math.squeeze(x1),
    y: math.squeeze(y_hat)._data,
    type: 'scatter',
    mode: 'lines',
    name: 'Predictions',
  },{
    x: math.squeeze(x1),
    y: math.squeeze(y),
    type: 'scatter',
    mode: 'markers',
    name: 'Observations',
  }]

{
  for (let i = 0; i < y.length; i++) {
    data.push({
      x: [x1[i], x1[i]],
      y: [y[i][0], y_hat._data[i][0]],
      type: 'scatter',
      mode: 'lines',
      line: {color: 'gray', width: 0.5, opacity: 0.2},
      showlegend: false
    });
  }
}

Code

plotly.newPlot(
  "2d_plot", 
  data, 
  {title: "RMSE = " + RMSE},
  {responsive: true}
)

–>

Some properties of OLS estimators

The process of minimzing the sum of squares reveals some interesting properties about the regression.

First let’s rearrange the equation for the estimated coefficients, to find that the errors must be orthogonal with the features \(X\):

\[ \displaylines{ \begin{align} \hat{\beta} & = (X^{\intercal}X)^{-1}X^{\intercal}y \\ \Rightarrow (X^{\intercal}X)\hat{\beta} &= X^{\intercal}y \\ & = X^{\intercal} (X^{\intercal} \hat{\beta} + \hat{\epsilon}) \\ \therefore X^{\intercal}\hat{\epsilon} & = 0 \end{align} } \]

We further find that OLS ensures that the predicted outcome \(\hat{y}\) is not correlated with the residuals \(\epsilon\):

Note that this simply ensures the predicitons are uncorrelated with the residuals - the true observations could still be correlated.

\[ \displaylines{ \begin{align} \hat{y}^{\intercal}\epsilon & = (X\hat{\beta})^{\intercal} \hat{\epsilon} \\ & = \hat{\beta}^{\intercal} \underbrace{X^{\intercal}\hat{\epsilon}}_{=0} \\ & = 0 \end{align} } \]

Now if we include an intercept in the regression (i.e. one of the columns of \(X\) must be all ones) then since \(X^{\intercal}\hat{\epsilon} = 0\):

The residuals must sum to zero: \(\sum_{i=1}^n{\hat{\epsilon}_i} = 0\)
The sample mean of the residual must sum to zero: \(\frac{1}{n}\sum_{i=1}^n{\hat{\epsilon}_i} =\bar{\hat{\epsilon}} = 0\)

And since we know the sum of the residuals is zero, it follows that the regression also passes through the mean of the observations \(\bar{X}\) and \(\bar{y}\) too:

\[ \displaylines{ \sum_{i=1}^n{\hat{\epsilon}_i} = \sum_{i=1}^n{y_i -X_i\hat{\beta}} = 0 \\ \therefore \sum_{i=1}^n{y_i} = \sum_{i=1}^n{X_i\hat{\beta}} \\ \therefore \frac{1}{n} \sum_{i=1}^n{y_i} = \bar{y} = \frac{1}{n} \sum_{i=1}^n{X_i\hat{\beta}} = \bar{X}\hat{\beta} } \]

However, it is important to tnote that these properties are enforced by OLS: for example, it does not ensure that the true errors are orthogonal to the features, simply the residuals. To understand the validity of the regression for causal inference, we need to look at the Gauss Markov assumptions.

Final reflections

Unlike logistic regression, or the multiple hidden-layer structure of neural networks, we can “jump” straight to the optimum coefficients for OLS. Why can we do this? Well chiefly its because OLS is a bit of a special case:

The minima is a global minima: The hessian matrix is positive definite, and hence the cost function is strictly convex. This means we know that when a choice of coefficients is found that ensure the partially differentiated cost function is equal zero, this minima is also a global one, not a local one.
There is only one solution for the optimum coefficient: We assume that the matrix is full rank (every feature provides additional predictive power) and that the number of predictors is smaller than the number of obervations. This also means that partially differentiating is okay to do!
A closed-form solution can be found: The predictions are generated from \(X\) using a simple, purely algebraic function, i.e. the sum-product of \(X\) by \(\beta\). This means we can find an analytical solution to the optimal choice \(\beta^*\). Note this often isn’t possible since non-linear activation functions (i.e. link functions) are often transcendental. In this case, we need iterative methods, such as gradient descent or Newton’s method: see this page for more details.

Fin.