Sciencing the data - BLUE coefficients: bias and efficiency

What are we exploring?

Proving that the OLS coefficient is the best linear unbiased estimator (under the Gauss-Markov assumptions).

Are the OLS coefficients “BLUE”?

We find a unique solution to the set of coefficients that minimize the sum of squared residuals analytically (see its derivation here):

\[ \hat{\beta}^{OLS}=(X^{\intercal}X)^{-1}X^{\intercal}y \]

However, how do we know if these coefficients are the best ones we can estimate?

For the estimated coefficients to be the Best Linear Unbiased Estimator (i.e. “BLUE”):

The best estimator has to be unbiased: \(E[\hat{\beta}^*] = \beta\)
And among all possible linear, unbiased estimators, it must have the smallest variance: \(V[\hat{\beta}^{*}] < V[\hat{\beta}^{Z}]\)

We want to ensure our OLS estimate is the best, i.e. that \(\hat{\beta}^{OLS} = \hat{\beta}^{*}\). To achieve this, first we need to confirm it is unbiased. Then given this is true, we can check that the coefficient is most efficient vs all other unbiased estimators.

Gauss Markov Assumptions

Along the way, we will outline the Gauss-Markov assumptions utilised that ensure the OLS coefficient is BLUE.

Setting the scene

The true coefficient and model

\(\beta\) is the true (unobserved) relationship between all the relevant explanatory features, \(X\), and their associated observed outcomes, \(y\). In other words, if we observed the entire population of data, it is the relationship we would find.

Concretely, we assume the outcome is a linear function of all its relevant features. This “true model” perfectly predicts the outcome, except for random noise \(\epsilon\) that influences the observed outcome: \(y = X\beta + \epsilon\)

The estimated coefficient

\(\hat{\beta}\) is our estimated coefficient for the true relationship \(\beta\). In reality, we estimate \(\hat{\beta}\) from the small, finite sample of size \(n\) that is collected, not the whole population. Given any random sample could be collected, we can term the coefficient resulting from the optimum estimation proceedure to be \(\hat{\beta}^*\). We want to understand if \(\hat{\beta}^{OLS} = \hat{\beta}^*\).

The expected estimated coefficient

\(E[\hat{\beta}]\) is the “expected” estimated coefficient. Imagine we repeat the action of estimating the coefficient \(\hat{\beta}\) many times, each time collecting a new sample (where each observation is sampled i.i.d), and recording the value for the estimated coefficient. \(E[\hat{\beta}]\) would then be the average of all of those estimated coefficients. If the OLS coefficient is unbiased, then the expected coefficient estimate should be equal to the true one, \(E[\hat{\beta}^{OLS}]=\beta\).

The variance of the estimated coefficient

\(V[\hat{\beta}]\) is the variance of the estimated coefficient. It determines how much we might expect our estimate \(\hat{\beta}\) to differ from the true \(\beta\) for any sample drawn. Given the OLS coefficient has been shown to be unbiased, if it is BLUE we expect its variance to be lower than another other unbiased choice \(\hat{\beta}^{Z}\). Concretely, we want to find \(V[\hat{\beta}^{OLS}] < V[\hat{\beta}^{Z}]\).

Bias

Often our small finite samples of size \(n\) are not a perfect reflection of the population they are drawn from. This “sampling error” means we might estimate a different relationship between \(X\) and \(y\) than the true relationship of the population, i.e. \(\hat{\beta} \neq \beta\).

However, we should expect our estimated coefficient to be equal to the true value on average. This means we do not want to have a bias towards the estimate being systematically too small or too large, for example. In other words, if we repeated the whole proceedure thousands of times (each time taking new samples, and estimating a coefficient from the new sample) then the average of all the estimated coefficients values should be equal to the true value, i.e. \(E[\hat{\beta}] = \beta\).

Recall that we believe there is a true model that follows the form:

\[ y = X\beta + \epsilon \]

GM1: Linearity

The formula above relies on the first Gauss-Markov assumption - that the dependent variable \(y\) is assumed to be a linear function of the variables \(X\). Note that implies that the proper functional form has been selected (i.e. the relationship is linear) and there are no omitted variables - a huge assumption!

If we substitute this into our estimated coefficient:

\[ \displaylines{ \begin{align} \hat{\beta}^{OLS} & = (X^{\intercal}X)^{-1}X^{\intercal}y \\ & = (X^{\intercal}X)^{-1}X^{\intercal}(X\beta+\epsilon) \\ & = (X^{\intercal}X)^{-1}X^{\intercal}X\beta+(X^{\intercal}X)^{-1}X^{\intercal}\epsilon \\ & = \beta+(X^{\intercal}X)^{-1}X^{\intercal}\epsilon \end{align} } \]

we show that the estimated coefficient \(\hat{\beta}^{OLS}\) will differ from the true value depending on the random error \(\epsilon\) associated with the particular finite sample collected.

Now let’s now take the expectation, to determine when the coefficient is unbiased. In other words, what is the “average” coefficient if we took the sample many times: \[ \displaylines{ \begin{align} E[\hat{\beta}^{OLS}] & = \beta +(X^{\intercal}X)^{-1}X^{\intercal}E[\epsilon] \\ & = \beta & \iff E[\epsilon] & = 0 \end{align} \\ } \]

We find that the coefficient is unbiased as long as the expected error is also zero.

GM2: Strict Exogeneity

The second Gauss-Markov assumption is strict exogeneity, where the expected error is zero for all feature values: \(E[\epsilon|X] = 0\). By definition, the weaker exogeneity statement of \(E[\epsilon] = 0\) is implied by having the expected error conditional being equal to zero.

Efficiency

To be the “best”, the OLS estimator also needs to be efficient. This means that it has the lowest variance of all unbiased estimators. This section looks to prove this.

Coefficient variance for OLS

First, let’s derive the variance from the coefficients estimated using OLS, termed \(V[\hat{\beta}^{OLS}]\). As before, we substitute the true model \(y = X\beta + \epsilon\) into the coefficient estimated through OLS:

\[ \displaylines{ \begin{align} \hat{\beta}^{OLS}-\beta & = \beta + ((X^{\intercal}X)^{-1}X^{\intercal}\epsilon) - \beta \\ & = (X^{\intercal}X)^{-1}X^{\intercal}\epsilon \\ \\ \therefore (\hat{\beta}^{OLS}-\beta)(\hat{\beta}^{OLS}-\beta)^{\intercal} & = ((X^{\intercal}X)^{-1}X^{\intercal}\epsilon)((X^{\intercal}X)^{-1}X^{\intercal}\epsilon)^{\intercal} \\ & = (X^{\intercal}X)^{-1}X^{\intercal}\epsilon\epsilon^{\intercal}X(X^{\intercal}X)^{-1} \\ \\ \therefore V(\hat{\beta}^{OLS}) & = E[(\hat{\beta}^{OLS}-\beta)(\hat{\beta}^{OLS}-\beta)^{\intercal}] \\ & = (X^{\intercal}X)^{-1}X^{\intercal}E[\epsilon\epsilon^{\intercal}]X(X^{\intercal}X)^{-1} \end{align} } \]

This form is sometimes called the sandwich estimator, where \((X^{\intercal}X)^{-1}\) is the “bread”, and X^{}E[^{}]X is the “meat”: see this page for a deep dive into sandwich estimators

Coefficient variance assuming “spherical errors”

We can simplify this further by appling some assumptions to the estimated error variance \(E[\epsilon\epsilon^{\intercal}]\):

\[ \displaylines{ \begin{align} E[\epsilon \epsilon^{\intercal}] & = \begin{bmatrix} E[\epsilon_1^2] & \cdots & E[\epsilon_1\epsilon_n] \\ \vdots & \ddots & \vdots \\ E[\epsilon_n\epsilon_1] & \cdots & E[\epsilon_n^2] \end{bmatrix} \\ \\ & = \begin{bmatrix} \frac{1}{n}\sum_{i=1}^{n}{\epsilon_i^2} & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \frac{1}{n}\sum_{i=1}^{n}{\epsilon_i^2} \end{bmatrix} \\ \\ & = \begin{bmatrix} \hat{\sigma}^2 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \hat{\sigma}^2 \end{bmatrix} \\ \\ & = \hat{\sigma}^2I \end{align} } \]

How can we jump to this result? Well we are making two assumptions:

No serial correlation: \(\rho_{\epsilon_{i},\epsilon_{i \neq j}} = 0\). No correlation between sample errors means that \(E[\epsilon_i \epsilon_{j \neq i}] = 0\), and hence the off-diagonals of the error covariance matrix are zero.
Homoskedasticity: the assumption of uniform error variance for all samples means that \(V[\epsilon_i^2] = V[\epsilon_{j \neq i}^2] = \hat{\sigma}^2\). And our best approximation for \(\hat{\sigma}^2\) is simply taking the average squared error: \(\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}{\epsilon_i^2}\)

GM3: Spherical errors

The third Gauss-Markov assumption is spherical errors, \(E[\epsilon\epsilon^{\intercal}|X] = 0\). This means that the outer product of the expected errors is a scalar matrix, which implies no serial correlation and homoskedasticity.

It is especially important to make the right assumptions about \(E[\epsilon\epsilon^{\intercal}|X]\) as it impacts where our estimate of the standard errors is correct! We will dive into what happens to SE if we violate these assumptions in another post.

Since we now see that \(\hat{\sigma}^2\) is a scalar matrix, we can thus simplfy the variance formula further:

\[ \displaylines{ \begin{align} V(\hat{\beta}^{OLS}) & = (X^{\intercal}X)^{-1}X^{\intercal}E[\epsilon\epsilon^{\intercal}]X(X^{\intercal}X)^{-1} \\ & = (X^{\intercal}X)^{-1}X^{\intercal} \hat{\sigma}^2I X(X^{\intercal}X)^{-1} \\ & = \hat{\sigma}^2 \cancel{(X^{\intercal}X)^{-1}} \cancel{X^{\intercal} X} (X^{\intercal}X)^{-1} \\ & = \hat{\sigma}^2(X^{\intercal}X)^{-1} \end{align} } \]

Formulating an alternative unbiased coefficient

Next step - lets formulate another estimator, \(\hat{\beta}^{z}\), which differs from \(\hat{\beta}^{OLS}\) by a non-zero matrix \(A\). See how they both differ below:

\[ \displaylines{ \begin{align} \hat{\beta}^{OLS} & =(X^{\intercal}X)^{-1}X^{\intercal}y \\ \hat{\beta}^{Z} & =\left((X^{\intercal}X)^{-1}X^{\intercal}+A\right)y \end{align} } \]

Now we need to ensure this new estimator is not biased. So by taking the expectation in the same was as for OLS…

\[ \displaylines{ \begin{align} E[\hat{\beta}^{Z}] & = E\left[ \left((X^{\intercal}X)^{-1}X^{\intercal}+A\right)y \right] \\ & = \left((X^{\intercal}X)^{-1}X^{\intercal}+A\right)(X\beta+ \cancel{E\left[\epsilon \right]}) & \because E[\epsilon] = 0 \\ & = (X^{\intercal}X)^{-1}X^{\intercal}X\beta+AX\beta \\ & = \beta+AX\beta \end{align} } \]

So the estimator is only unbiased iff \(AX=0\). This is important to note when comparing the variance between unbiased coefficients - see below!

Variance of the alternative unbiased coefficient

Just like before, we calculate the variance:

\[ \displaylines{ \begin{align} V[\hat{\beta}^{Z}] & = V\left[ \left((X^{\intercal}X)^{-1}X^{\intercal}+A\right)y \right] \\ & = \left((X^{\intercal}X)^{-1}X^{\intercal}+A\right) V[y] \left((X^{\intercal}X)^{-1}X^{\intercal}+A\right)^{\intercal} \\ & = \hat{\sigma}^2 \left((X^{\intercal}X)^{-1}X^{\intercal}+A\right) \left((X^{\intercal}X)^{-1}X^{\intercal}+A\right)^{\intercal} & \because E[\epsilon \epsilon^{\intercal}|X] = 0 \\ & = \hat{\sigma}^2\left((X^{\intercal}X)^{-1}X^{\intercal}+A\right) \left(X(X^{\intercal}X)^{-1}+A^{\intercal}\right) \\ & = \hat{\sigma}^2 \left( (X^{\intercal}X)^{-1}X^{\intercal} X(X^{\intercal}X)^{-1} + AX(X^{\intercal}X)^{-1} + (X^{\intercal}X)^{-1}X^{\intercal}A^{\intercal} + AA^{\intercal} \right) \\ & = \hat{\sigma}^2 (X^{\intercal}X)^{-1} + \hat{\sigma}^2AA^{\intercal} & \because AX = 0 \\ & = V[\beta^{OLS}] + \hat{\sigma}^2AA^{\intercal} \end{align} } \]

Now since \(AA^{\intercal}\) is surely a positive semi-definite matrix, then we know that \(V[\hat{\beta}^{Z}] \ge V[\hat{\beta}^{OLS}]\).

We have shown that \(\hat{\beta}^{OLS}\) has the smallest variance among all unbiased estimators!

Summarising the Gauss-markov assumptions

Along the way, we showed where assumptions were needed to ensure the OLS coefficient estimation is BLUE.

We actually missed one out, but it is actually quite trivial to see from the OLS coefficient formula:

\[ \hat{\beta}^{OLS}=(X^{\intercal}X)^{-1}X^{\intercal}y \]

GM4: Full rank

The matrix \(X\) must be of full rank \(k\), so that it is possible to invert the matrix \(X^{\intercal}X\). This is equivalent to having no perfect multi-collinearity.

We have now collected our full set of Gauss-Markov assumptions required for the OLS coefficient to be BLUE:

Linearity
Strict Exogeneity
Spherical Errors
Full rank

Coding it up from scratch

Now we will code up the above in Python to show how we can derive the OLS coefficient, show that it is unbiased, and that the confidence intervals are calculated correctly.

First we write an OLS class to calculate the OLS coefficients and their standard errors:

OLS class (source code):

import numpy as np, pandas as pd
from typing import Optional, Tuple
from scipy.linalg import qr, solve_triangular
from scipy.stats import norm, t

class OLS:

    def _convert_types(self, z) -> np.ndarray:
        """ Re-shape and convert y to numpy array to work nicely with rest of functions """
        if type(z) in [pd.DataFrame, pd.Series]:
            z2 = z.to_numpy()
        if type(z) == list:
            z2 = np.array(z)
        if type(z) == np.ndarray:
            z2 = z
        else:
            raise TypeError('Array must be a pandas series/dataframe, numpy array or list')
        return z2
             
    def _get_y(self, y: Optional = None) -> np.ndarray:
        """Re-shape and convert y to numpy array to work nicely with rest of functions"""
        if y is None:
            y = self.y
        return self._convert_types(y).reshape(-1)

    def _get_X(self, X: Optional = None) -> Tuple[np.ndarray, np.ndarray]:
        """Re-shape and convert X to numpy array to work nicely with rest of functions
        Also return names for summarising in coefficient table"""
        if X is None:
            X = self.X
        X2 = self._convert_types(X)
        if type(X) == pd.DataFrame:
            exog_names = np.array(X.columns)
        elif type(X) == pd.Series:
            exog_names = np.array(['Unamed Exog Feature'])
        else:
            exog_names = np.arange(X2.shape[1])
        X2 = X2.reshape(-1,len(exog_names))
        return (X2, exog_names)

    def _clear_fitted_attributes(self):
        """Clears fitted attributes of the class"""
        self.beta, self.RSS, self.beta_se, self.conf_int, self.test_stat, self.p_val = None, None, None, None, None, None
        
    def __init__(
        self, 
        y: Optional[np.ndarray] = None, 
        X: Optional[np.ndarray] = None
        ) -> None:
        """Initializes the OLS class to run an least-squares regression"""
        if y is not None:
            self.y = self._get_y(y)
            self.n = len(self.y)
        if X is not None:
            self.X, self.exog_names = self._get_X(X)
            self.k = self.X.shape[1]
        if y is not None and X is not None and len(self.y) != self.X.shape[0]:
            raise ValueError("y and X must be the same size.")
        self._clear_fitted_attributes()

    def _quick_matrix_invert(self, X: np.ndarray) -> np.ndarray:
        """ Find the inverse of a matrix, using QR factorization """
        Q, R = qr(X)
        X_inv = solve_triangular(R, np.identity(X.shape[1])).dot(Q.transpose())
        return X_inv

    def _check_if_fitted(self):
        """Quick helper function that raises an error if the model has not been fitted already"""
        if self.beta is None:
            raise ValueError('Need to fit the model first - run fit()')
        else:
            return True

    def _estimate_ols_coefs(
        self,
        y: Optional[np.ndarray] = None,
        X: Optional[np.ndarray] = None
    ):
        """Estimates the OLS coefficients given a vector y and matrix X"""
        XTX = X.T.dot(X)
        XTY = X.T.dot(y)
        XTX_inv = self._quick_matrix_invert(XTX)
        coefs = XTX_inv.dot(XTY)
        return coefs, XTX_inv

    def fit(
        self,
        y: Optional[np.ndarray] = None,
        X: Optional[np.ndarray] = None,
    ):
        """Estimates the OLS coefficients given a vector y and matrix X"""
        self._clear_fitted_attributes()
        y = self._get_y(y)
        X, exog_names = self._get_X(X)
        if y is None or X is None:
            raise ValueError('X and y is required for fitting')
        if len(y) != X.shape[0]:
            raise ValueError("y and X must be the same size.")
        self.y, self.X, self.exog_names = y, X, exog_names
        self.n, self.k = X.shape
        self.DoF = self.n - self.k
        self.beta, self.var_X_inv = self._estimate_ols_coefs(y,X)

    def _assess_fit(
        self,
        y: Optional[np.ndarray] = None,
        X: Optional[np.ndarray] = None,
    ) -> float:
        """Returns the unadjusted R^2"""
        self._check_if_fitted()
        if (y is None and X is not None) or (y is not None and X is None):
            raise ValueError('Need to either provide both X and y, (or provide neither and R^2 is based on the X and y used for fitting)')
        else:
            y, (X, exog_names) = self._get_y(y), self._get_X(X)
        y_hat = self.predict(self.X)
        residuals = (y - y_hat).reshape(-1, 1)
        RSS = residuals.T.dot(residuals)
        TSS = (y - y.mean()).T.dot(y - y.mean())
        unadj_r_squared = 1 - RSS/TSS
        
        if (y == self.y).all() and (X == self.X).all():
            self.residuals = residuals
            self.RSS = RSS
            self.TSS = TSS
            self.unadj_r_squared = unadj_r_squared
        return unadj_r_squared

    def _standard_error(self,) -> np.ndarray:
        """Returns the standard errors for the coefficients from the fitted model"""
        if self.RSS is None:
            self._check_if_fitted()
            self._assess_fit()
        sigma_sq = self.RSS / float(self.DoF) * np.identity(len(self.beta))
        var_b = sigma_sq.dot(self.var_X_inv)
        self.beta_se = np.sqrt(np.diag(var_b))
        return self.beta_se

    def _confidence_intervals(self, size = 0.95):
        """Returns the confidence intervals for the coefficients from the fitted model"""
        if self.beta_se is None:
            self._check_if_fitted()
            self._standard_error()
        alpha = 1-(1-size)/2
        self.conf_int = np.array([
            self.beta - t.ppf(alpha, self.DoF) * self.beta_se,
            self.beta + t.ppf(alpha, self.DoF) * self.beta_se
        ])
        return self.conf_int

    def _test_statistic(self) -> np.ndarray:
        """Returns the test statistics for the coefficients from the fitted model"""
        if self.conf_int is None:
            self._check_if_fitted()
            self.conf_int = self._confidence_intervals()
        self.test_stat = self.beta.flatten() / self.beta_se
        return self.test_stat

    def _p_value(self, z_dist: bool = False) -> np.ndarray:
        """Returns the p-values for the coefficients from the fitted model."""
        if self.test_stat is None:
            self._check_if_fitted()
            self.test_stat = self._test_statistic()
        if z_dist:
            self.p_val = [norm.cdf(-abs(z)) + 1 - norm.cdf(abs(z)) for z in self.test_stat]
        else:
            self.p_val = [2 * t.sf(abs(x), self.DoF) for x in self.test_stat]
        return self.p_val    

    def predict(
        self,
        X: Optional[np.ndarray] = None,
    ) -> np.ndarray:
        """Predict values for y. Returns fitted values if X not provided."""
        self._check_if_fitted()
        X2, exog_names = self._get_X(X)
        y_hat = X2.dot(self.beta)
        if X is None:
            self.y_hat = y_hat
        return y_hat

    def summary(self, z_dist: bool = False) -> pd.DataFrame:
        """Returns the coefficients, standard errors, test statistics and p-values in a Pandas DataFrame."""
        if self.p_val is None:
            self._check_if_fitted()
            self._p_value(z_dist)
        summary = pd.DataFrame(
            data={
                'Coefficient': self.beta.flatten(),
                'Standard Error': self.beta_se,
                'Lower bound': self.conf_int[0],
                'Upper bound': self.conf_int[1],
                'test-statistic': self.test_stat,
                'p-value': self.p_val,
            },
            index=self.exog_names,
        )
        return summary

And now let’s use that code, and run 1000 simulations i.e. taking 1000 different samples from the population and running a regression each time:

Code

import numpy as np

np.random.seed(42)
n, k = 50, 2
sigma_sq = 1
beta = np.random.normal(size=(k,1))

bootstraps = 1000
coefs, CI_0, CI_1 = [], np.zeros(bootstraps), np.zeros(bootstraps)
for i in range(bootstraps):
    X = np.hstack([ 
      np.ones(n).reshape(n,1),
      np.random.normal(size=(n,k-1)) 
      ])
    y = X.dot(beta) + np.random.normal(loc=0,scale=sigma_sq,size=(n,1))
    OLS_model = OLS(y,X)
    OLS_model.fit()
    sum = OLS_model.summary()
    coefs.append(sum['Coefficient'].to_list())
    if (beta[0] >= sum.loc[0,'Lower bound'] and beta[0] <= sum.loc[0,'Upper bound']).all():
        CI_0[i] = 1
    if (beta[1] >= sum.loc[1,'Lower bound'] and beta[1] <= sum.loc[1,'Upper bound']).all():
        CI_1[i] = 1

By plotting a histogram of the estimated coefficients, we see that the estimated OLS coefficients center around the true coefficient, so empirically the coefficient appears unbiased:

Code

import plotly.graph_objects as go
fig = go.Figure()

c0 = [c[0] for c in coefs]
c1 = [c[1] for c in coefs]
fig.add_trace(go.Histogram(x=c0, name='Estimated Intercept: β0'))
fig.add_trace(go.Histogram(x=c1, name='Estimated Slope: β1'))
fig.update_layout(
  barmode='overlay', 
  legend=dict(orientation="h"),
  shapes=[dict(
    type='line',
    line=dict(dash='solid',width=1),
    x0=np.mean(c0),x1=np.mean(c0),
    y0=0,y1=0.9,
    xref='x',yref='paper',
    # text="True coefficient = {:,.2f}".format(beta.tolist()[0][0]),
  ), dict(
    type='line',
    line=dict(dash='solid',width=1),
    x0=np.mean(c1),x1=np.mean(c1),
    y0=0,y1=0.9,
    xref='x',yref='paper',
    # text="True coefficient = {:,.2f}".format(beta.tolist()[1][0]),
    )
  ],
   annotations=[dict(
      x = beta.tolist()[0][0],
      y = 70,
      text="True coefficient = {:,.2f}".format(beta.tolist()[0][0]),
   ), dict(
      x = beta.tolist()[1][0],
      y = 70,
      text="True coefficient = {:,.2f}".format(beta.tolist()[1][0]),
   )],
)
fig.update_traces(opacity=0.75)

fig.show()

Further, we see that the true coefficient falls within the confidence interval 95% of the time, so empirically the standard errors appear correctly specified:

Code

fig = go.Figure()

fig.add_trace(
  go.Scatter(
    x=np.arange(bootstraps),
    y=np.cumsum(CI_0)/(np.arange(bootstraps)+1),
    name='Intercept Estimation',
    mode='lines',
    )
  )

fig.add_trace(
  go.Scatter(
    x=np.arange(bootstraps),
    y=np.cumsum(CI_1)/(np.arange(bootstraps)+1),
    name='Slope estimation',
    mode='lines',
    )
  )

fig.update_xaxes(title="number of bootstraps")
fig.update_yaxes(title="% time true coefficient falls within CI")

fig.show()

Fin.