Sandwiches: robust covariance error estimation

Linear Models
Gauss-Markov
Standard errors
Author

Chris Kelly

Published

February 22, 2024

What are we exploring?

Estimating the correct coefficient variance when relaxing homoskedastic error assumptions

Introducing sandwiches

The variance for the OLS coefficient estimator is equal to the following:

\[ \displaylines{ V(\hat{\beta}) = (X^{\intercal}X)^{-1}X^{\intercal}E[\epsilon\epsilon^{\intercal}]X(X^{\intercal}X)^{-1} } \]

This can be though to as a sandwich:

  • The “bread” either side: \((X^{\intercal}X)^{-1}X^{\intercal}\) on the left and its transpose \(X(X^{\intercal}X)^{-1}\) on the right
  • The “meat” in the middle: what we assume for \(E[\epsilon\epsilon^{\intercal}]\)
    • Note that this is the same as the error variance, since \(V[\epsilon]=E[\epsilon\epsilon^{\intercal}]-E[\epsilon]E[\epsilon^{\intercal}]\) and \(E[\epsilon] = 0\)

Our coefficient will only be efficient if these assumptions about the expected error are correct! We will explore what happens when the errors are assumed to be homoskedastic, heteroskedastic or clustered.

Salmon bagel: Spherical Errors 🐟

Usual OLS is efficient if the true model has “spherical errors”. What does this mean in practice?

  • Errors are homoskedastic: \(V(\epsilon_i)=\sigma^2\) for all observations
  • Errors are serially uncorrelated: \(cov(\epsilon_i,\epsilon_{j\neq i})=0\)

What does this look like for \(E[\epsilon\epsilon^{\intercal}]\)?

  • The diagonal of the matrix is a constant value (scalar), \(\sigma^2\)
  • The off-diagonals are all zero

\[ \hat{\sigma}^2\underset{n\times n}{I} = \begin{bmatrix} \hat{\sigma}^2 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & \hat{\sigma}^2 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & \hat{\sigma}^2 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & \hat{\sigma}^2 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & \hat{\sigma}^2 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & \hat{\sigma}^2 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & \hat{\sigma}^2 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & \hat{\sigma}^2 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \hat{\sigma}^2 \end{bmatrix} \]

A good estimation of the constant error variance \(\sigma^2\) is to apply the standard formula to the residuals (i.e. method of moments):

\[ \hat{\sigma^2}=\frac{1}{n-k}\sum{\hat{\epsilon_i}^2} \equiv \frac{\epsilon^{\intercal}\epsilon }{n-k} \]

Thus our “salmon sandwich” is:

\[ \underset{k \times k}{(X^{\intercal}X)}^{-1} \underset{k \times n} {X^{\intercal}} \left(\frac{1}{n-k} \times \underset{1 \times 1}{(\epsilon^{\intercal}\epsilon)} \times\underset{n \times n}{I} \right) \underset{n \times k}{X} \underset{k \times k}{(X^{\intercal}X)}^{-1} \]

Note that under spherical errors, the sandwich can be simplified: \[ \displaylines{ \begin{align} V[\hat{\beta}] & = (X^{\intercal}X)^{-1}X^{\intercal}E[\epsilon\epsilon^{\intercal}]X(X^{\intercal}X)^{-1} \\ & = (X^{\intercal}X)^{-1}X^{\intercal}\sigma^2IX(X^{\intercal}X)^{-1} \\ & = \sigma^2(X^{\intercal}X)^{-1}\cancel{X^{\intercal}X}\cancel{(X^{\intercal}X)^{-1}} \\ & = \sigma^2(X^{\intercal}X)^{-1} \end{align} } \]

In this scenario, the only things that impact the standard error of the coefficient \(\beta_k\) is:

  • The variance of all the residuals, \(\sigma^2\)
  • The variance of the feature \(V(X_k)\)

Ham sarnie: Heteroskedastic errors 🍖

Heteroskedastic correction is needed if:

  • Errors vary for every individual: \(V(\epsilon_i)=\sigma_i^2\) for all observations
  • But they are still independent aka serially uncorrelated: \(cov(\epsilon_i,\epsilon_{j\neq i})=0\)

What does this look like for \(E[\epsilon\epsilon^{\intercal}]\)?

  • The diagonal of the matrix is the estimate of variance which is unique for each observation, \(\sigma_i^2\)

  • The off-diagonals are all zero

    \[ \underset{n \times n}{\sigma^2} = \begin{bmatrix} \sigma_1^2 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & \sigma_2^2 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & \sigma_3^2 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & \sigma_4^2 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & \sigma_5^2 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & \sigma_6^2 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & \sigma_7^2 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & \sigma_8^2 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \sigma_{..}^2\\ \end{bmatrix} \]

  • A good estimation of the vector of heteroskedastic error variances \(\sigma^2\) is again to apply the standard formula to the residuals for each individual:

\[ \hat{\sigma_i^2}=\frac{1}{n-k}\sum{\hat{\epsilon_i}^2} \]

Thus our “ham sandwich” is:

\[ \underset{k \times k}{(X^{\intercal}X)}^{-1} \underset{k \times n}{X^{\intercal}} \left(\frac{1}{n-k} \times \underset{1 \times n}{(\epsilon\odot \epsilon)} ^{\intercal} \times\underset{n \times n}{I} \right) \underset{n \times k}{X} \underset{k \times k}{(X^{\intercal}X)}^{-1} \]

Similar to homoskedastic errors, the things that impacts the standard error of coefficient \(k\) is:

  • The variance of all the individual errors, \(\sigma_i^2\)
  • The variance of the feature \(V(X_k)\)

Cheese roll: Clustered Errors 🧀

Cluster-robust errors are needed if:

  • Errors vary for every individual: \(V(\epsilon_i)=\sigma_i^2\) for all observations i.e. still heteroskedastic
  • AND errors within the same cluster \(C_l\) are serially correlated: \(cov(\epsilon_i,\epsilon_{j}) \neq 0 \text{ if } \epsilon_i,\epsilon_j \in C_l\)
    • Note - errors between clusters are assumed not to be serially correlated though i.e. \(cov(\epsilon_i,\epsilon_{j}) =0 \text{ if } \epsilon_i \in C_l,\epsilon_j \in C_{m \neq l}\)

What does this look like for \(E[\epsilon\epsilon^{\intercal}]\)?

  • The diagonal of the matrix is the estimate of variance which is unique for each observation, \(\sigma_i^2\)
  • The off-diagonals are also populated with the covariance - but only when they are both in the same cluster

Here is an example where observations 1, 2 and 3 are in cluster A, observations 4 and 5 are in cluster B, observations 6, 7 and 8 are in cluster C etc.

\[ \underset{n \times n}{\epsilon\epsilon^{\intercal}} \sim \begin{bmatrix} \epsilon_1^2 & \epsilon_1\epsilon_2 & \epsilon_1\epsilon_3 & 0 & 0 & 0 & 0 & 0 & 0\\ \epsilon_2\epsilon_1 & \epsilon_2^2 & \epsilon_2\epsilon_3 & 0 & 0 & 0 & 0 & 0 & 0\\ \epsilon_3\epsilon_1 & \epsilon_3\epsilon_2 & \epsilon_3^2 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & \epsilon_4^2 & \epsilon_4\epsilon_5 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & \epsilon_5\epsilon_4 & \epsilon_5^2 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & \epsilon_6^2 & \epsilon_6\epsilon_7 & \epsilon_6\epsilon_8 & 0\\ 0 & 0 & 0 & 0 & 0 & \epsilon_7\epsilon_6 & \epsilon_7^2 & \epsilon_7\epsilon_8 & 0\\ 0 & 0 & 0 & 0 & 0 & \epsilon_8\epsilon_6 & \epsilon_8\epsilon_7 & \epsilon_8^2 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \epsilon_{..}^2\\ \end{bmatrix} \]

Additionally, we have to do a finite-sample correction as well on the degrees of freedom, based on the number of clusters \(n_c\)

Thus our cheese sandwich is:

\[ \underset{k \times k}{(X^{\intercal}X)}^{-1} \underset{k \times n}{X^{\intercal}} \left( \underbrace{\frac{n-1}{n-k}\frac{n_c}{n_c-1}}_{\text{Finite correction}} \times \underset{n \times n}{(\epsilon \epsilon^{\intercal})} \right) \underset{n \times k}{X} \underset{k \times k}{(X^{\intercal}X)}^{-1} \]

In this scenario, there are a few additional things that impact the standard error of coefficient \(k\):

  • If errors are correlated within clusters, this will increase the error.
  • If features are correlated within clusters, this will also increase the error (due to the off-diagonals in the error variance matrix)
  • And if both the errors and feature correlations are the same sign, this will also increase the standard error.
  • As well as the variance of the individual errors, \(\sigma_i^2\), and the variance of each feature \(V(X_k)\), as before

Coding it from scratch in python

First we use the OLS parent class we defined previously in the BLUE OLS post:

OLS class (source code):
import numpy as np, pandas as pd
from typing import Optional, Tuple
from scipy.linalg import qr, solve_triangular
from scipy.stats import norm, t

class OLS:

    def _convert_types(self, z) -> np.ndarray:
        """ Re-shape and convert y to numpy array to work nicely with rest of functions """
        if type(z) in [pd.DataFrame, pd.Series]:
            z2 = z.to_numpy()
        if type(z) == list:
            z2 = np.array(z)
        if type(z) == np.ndarray:
            z2 = z
        else:
            raise TypeError('Array must be a pandas series/dataframe, numpy array or list')
        return z2
             
    def _get_y(self, y: Optional = None) -> np.ndarray:
        """Re-shape and convert y to numpy array to work nicely with rest of functions"""
        if y is None:
            y = self.y
        return self._convert_types(y).reshape(-1)

    def _get_X(self, X: Optional = None) -> Tuple[np.ndarray, np.ndarray]:
        """Re-shape and convert X to numpy array to work nicely with rest of functions
        Also return names for summarising in coefficient table"""
        if X is None:
            X = self.X
        X2 = self._convert_types(X)
        if type(X) == pd.DataFrame:
            exog_names = np.array(X.columns)
        elif type(X) == pd.Series:
            exog_names = np.array(['Unamed Exog Feature'])
        else:
            exog_names = np.arange(X2.shape[1])
        X2 = X2.reshape(-1,len(exog_names))
        return (X2, exog_names)

    def _clear_fitted_attributes(self):
        """Clears fitted attributes of the class"""
        self.beta, self.RSS, self.beta_se, self.conf_int, self.test_stat, self.p_val = None, None, None, None, None, None
        
    def __init__(
        self, 
        y: Optional[np.ndarray] = None, 
        X: Optional[np.ndarray] = None
        ) -> None:
        """Initializes the OLS class to run an least-squares regression"""
        if y is not None:
            self.y = self._get_y(y)
            self.n = len(self.y)
        if X is not None:
            self.X, self.exog_names = self._get_X(X)
            self.k = self.X.shape[1]
        if y is not None and X is not None and len(self.y) != self.X.shape[0]:
            raise ValueError("y and X must be the same size.")
        self._clear_fitted_attributes()

    def _quick_matrix_invert(self, X: np.ndarray) -> np.ndarray:
        """ Find the inverse of a matrix, using QR factorization """
        Q, R = qr(X)
        X_inv = solve_triangular(R, np.identity(X.shape[1])).dot(Q.transpose())
        return X_inv

    def _check_if_fitted(self):
        """Quick helper function that raises an error if the model has not been fitted already"""
        if self.beta is None:
            raise ValueError('Need to fit the model first - run fit()')
        else:
            return True

    def _estimate_ols_coefs(
        self,
        y: Optional[np.ndarray] = None,
        X: Optional[np.ndarray] = None
    ):
        """Estimates the OLS coefficients given a vector y and matrix X"""
        XTX = X.T.dot(X)
        XTY = X.T.dot(y)
        XTX_inv = self._quick_matrix_invert(XTX)
        coefs = XTX_inv.dot(XTY)
        return coefs, XTX_inv

    def fit(
        self,
        y: Optional[np.ndarray] = None,
        X: Optional[np.ndarray] = None,
    ):
        """Estimates the OLS coefficients given a vector y and matrix X"""
        self._clear_fitted_attributes()
        y = self._get_y(y)
        X, exog_names = self._get_X(X)
        if y is None or X is None:
            raise ValueError('X and y is required for fitting')
        if len(y) != X.shape[0]:
            raise ValueError("y and X must be the same size.")
        self.y, self.X, self.exog_names = y, X, exog_names
        self.n, self.k = X.shape
        self.DoF = self.n - self.k
        self.beta, self.var_X_inv = self._estimate_ols_coefs(y,X)

    def _assess_fit(
        self,
        y: Optional[np.ndarray] = None,
        X: Optional[np.ndarray] = None,
    ) -> float:
        """Returns the unadjusted R^2"""
        self._check_if_fitted()
        if (y is None and X is not None) or (y is not None and X is None):
            raise ValueError('Need to either provide both X and y, (or provide neither and R^2 is based on the X and y used for fitting)')
        else:
            y, (X, exog_names) = self._get_y(y), self._get_X(X)
        y_hat = self.predict(self.X)
        residuals = (y - y_hat).reshape(-1, 1)
        RSS = residuals.T.dot(residuals)
        TSS = (y - y.mean()).T.dot(y - y.mean())
        unadj_r_squared = 1 - RSS/TSS
        
        if (y == self.y).all() and (X == self.X).all():
            self.residuals = residuals
            self.RSS = RSS
            self.TSS = TSS
            self.unadj_r_squared = unadj_r_squared
        return unadj_r_squared

    def _standard_error(self,) -> np.ndarray:
        """Returns the standard errors for the coefficients from the fitted model"""
        if self.RSS is None:
            self._check_if_fitted()
            self._assess_fit()
        sigma_sq = self.RSS / float(self.DoF) * np.identity(len(self.beta))
        var_b = sigma_sq.dot(self.var_X_inv)
        self.beta_se = np.sqrt(np.diag(var_b))
        return self.beta_se

    def _confidence_intervals(self, size = 0.95):
        """Returns the confidence intervals for the coefficients from the fitted model"""
        if self.beta_se is None:
            self._check_if_fitted()
            self._standard_error()
        alpha = 1-(1-size)/2
        self.conf_int = np.array([
            self.beta - t.ppf(alpha, self.DoF) * self.beta_se,
            self.beta + t.ppf(alpha, self.DoF) * self.beta_se
        ])
        return self.conf_int

    def _test_statistic(self) -> np.ndarray:
        """Returns the test statistics for the coefficients from the fitted model"""
        if self.conf_int is None:
            self._check_if_fitted()
            self.conf_int = self._confidence_intervals()
        self.test_stat = self.beta.flatten() / self.beta_se
        return self.test_stat

    def _p_value(self, z_dist: bool = False) -> np.ndarray:
        """Returns the p-values for the coefficients from the fitted model."""
        if self.test_stat is None:
            self._check_if_fitted()
            self.test_stat = self._test_statistic()
        if z_dist:
            self.p_val = [norm.cdf(-abs(z)) + 1 - norm.cdf(abs(z)) for z in self.test_stat]
        else:
            self.p_val = [2 * t.sf(abs(x), self.DoF) for x in self.test_stat]
        return self.p_val    

    def predict(
        self,
        X: Optional[np.ndarray] = None,
    ) -> np.ndarray:
        """Predict values for y. Returns fitted values if X not provided."""
        self._check_if_fitted()
        X2, exog_names = self._get_X(X)
        y_hat = X2.dot(self.beta)
        if X is None:
            self.y_hat = y_hat
        return y_hat

    def summary(self, z_dist: bool = False) -> pd.DataFrame:
        """Returns the coefficients, standard errors, test statistics and p-values in a Pandas DataFrame."""
        if self.p_val is None:
            self._check_if_fitted()
            self._p_value(z_dist)
        summary = pd.DataFrame(
            data={
                'Coefficient': self.beta.flatten(),
                'Standard Error': self.beta_se,
                'Lower bound': self.conf_int[0],
                'Upper bound': self.conf_int[1],
                'test-statistic': self.test_stat,
                'p-value': self.p_val,
            },
            index=self.exog_names,
        )
        return summary

Now we can define two new classes: * A new class, error_covariance that returns an \(n \times n\) matrix of the estimated errors * a child ols_sandwich class, that adapts the standard errors of the existing OLS class:

Sandwich class (source code):
from ols_blue import OLS
from scipy.sparse import bsr_array
import pandas as pd, numpy as np
from typing import Optional, Tuple

class error_covariance():

    def _homoskedastic(self) -> np.ndarray:
        """Return an error covariance matrix assuming homoskedasticity"""
        e = self.residuals
        ee = np.identity(self.n) * e.T.dot(e)
        ee /= float(self.DoF)
        return ee

    def _heteroskedastic(self, type: str='HC0') -> np.ndarray:
        """Return an error covariance matrix assuming heteroskedasticity (HC0/1)"""
        ee = np.diagflat(self.residuals**2)
        if type == 'HC1':
            ee *= self.n/self.DoF
        return ee

    def _clustered(self, clusters: np.ndarray, finite_sample_correction=True) -> np.ndarray:
        """Return an error covariance matrix assuming clustered errors"""
        ee = bsr_array((self.n, self.n)).toarray()
        for i in range(self.n):
            for j in range(self.n):
                if clusters[i] == clusters[j]:
                    ee[i, j] = self.residuals[i] * self.residuals[j]
        n_cl = len(np.unique(clusters))
        n, k = self.n, self.k
        if finite_sample_correction:
            ee *= ((n - 1) / (n - k)) * (n_cl / (n_cl - 1))
        return ee

class ols_sandwich(OLS, error_covariance):

    def __init__(self, y=None,X=None,residuals=None) -> None:
        """The sandwich class adapts the standard errors of existing OLS regressions"""
        super().__init__(y=y,X=X)

    def _XeeX(self, ee: np.ndarray):
        """Return the meat of the sandwich estimator"""
        return self.X.T.dot(ee).dot(self.X)

    def _sandwich(self, meat: np.ndarray, bread: Optional[np.ndarray] = None) -> None:
        """Helper function to return a 'sandwich' from bread and meat"""
        if bread is None:
            bread = self.var_X_inv
        sandwich = bread.dot(meat).dot(bread)
        beta_se = np.sqrt(np.diag(sandwich))
        return beta_se
    
    def _cluster_robust_fast(self, clusters: np.ndarray):
        """Rather than specifiying the individual errors before calculating the meat, we can instead calculate "mini" meats and sum them up at the end"""
        def cluster_XeeX(cluster_index):
            j = clusters == cluster_index
            _X, _e = self.X[j, :], self.residuals[j]
            _eX = _e.T.dot(_X)
            _XeeX = _eX.T.dot(_eX)
            return _XeeX
        clusters = clusters.flatten()
        cluster_XeeX = [cluster_XeeX(i) for i in np.unique(clusters)]
        n_cl = len(np.unique(clusters))
        n, k = self.n, self.k
        # finite-sample correction factor.    # sum XeeX across all clusters
        XeeX = ((n - 1) / (n - k)) * (n_cl / (n_cl - 1)) * np.sum(cluster_XeeX, axis=0)
        # summed - i.e. requires averaging, so needs many clusters to ensure V is consistent
        # https://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf (p.7-9)
        return XeeX
    
    def _beta_se(self, se_correction: Optional[str | np.ndarray] = None) -> None:
        """Return the standard errors of the coefficients"""
        self.beta_se, self.conf_int, self.test_stat, self.p_val= None, None, None, None
        if type(se_correction) == np.ndarray:
            ee = self._clustered(se_correction)
        elif type(se_correction) == str:
            if se_correction == "homoskedastic":
                ee = self._homoskedastic()
            elif se_correction == "heteroskedastic":
                ee = self._heteroskedastic(type='HC1')
        else:
            raise ValueError(f"""Correction type {se_correction} not supported. Please choose from 'homoskedastic' or 'heteroskedastic', or supply an array for cluster groups""")        
        XeeX = self._XeeX(ee)
        beta_se = self._sandwich(XeeX)
        return beta_se
    
    def summary(self, z_dist: bool = False, se_correction: Optional[str] = None) -> pd.DataFrame:
        """Returns the coefficients, standard errors, test statistics and p-values in a Pandas DataFrame."""
        self._check_if_fitted()
        self._assess_fit()
        if se_correction is not None:
            self.beta_se = self._beta_se(se_correction)
        self._p_value(z_dist)
        summary = pd.DataFrame(
            data={
                'Coefficient': self.beta.flatten(),
                'Standard Error': self.beta_se,
                'Lower bound': self.conf_int[0],
                'Upper bound': self.conf_int[1],
                'test-statistic': self.test_stat,
                'p-value': self.p_val,
            },
            index=self.exog_names,
        )
        return summary

Now let’s test it! We will generate some data, run the OLS model, and compare our sandwich estimators to statsmodels:

Code
np.random.seed(42)
n, k = 50, 2
sigma_sq = 1
beta = np.random.normal(size=(k,1))
X = np.hstack([ 
    np.ones(n).reshape(n,1),
    np.random.normal(size=(n,k-1)) 
    ])
y = X.dot(beta) + np.random.normal(loc=0,scale=sigma_sq,size=(n,1))
cl = np.repeat(np.arange(10), 5)

our_mod = ols_sandwich(y,X)
our_mod.fit()
# our_mod.summary()

import statsmodels.api as sm

stats_mod = sm.OLS(y, X)

First let’s try homoskedastic errors:

Code
pd.DataFrame(
  np.hstack([
    our_mod.summary(se_correction="homoskedastic")['Standard Error'].to_numpy().reshape(-1,1),
    stats_mod.fit().bse.reshape(-1,1)
    ]),
  columns=['Our Standard Errors','StatsModel Standard Errors'],
)
Our Standard Errors StatsModel Standard Errors
0 0.131679 0.131679
1 0.138426 0.138426

And now heteroskedastic:

Code
pd.DataFrame(
  np.hstack([
    our_mod.summary(se_correction='heteroskedastic')['Standard Error'].to_numpy().reshape(-1,1),
    stats_mod.fit(cov_type='HC1').bse.reshape(-1,1)
    ]),
  columns=['Our Standard Errors','StatsModel Standard Errors'],
)
Our Standard Errors StatsModel Standard Errors
0 0.126831 0.126831
1 0.108043 0.108043

And finally, cluster-robust:

Code
pd.DataFrame(
  np.hstack([
    our_mod.summary(se_correction=cl)['Standard Error'].to_numpy().reshape(-1,1),
    stats_mod.fit(cov_type='cluster', cov_kwds={'groups': cl}).bse.reshape(-1,1)
    ]),
  columns=['Our Standard Errors','StatsModel Standard Errors'],
)
Our Standard Errors StatsModel Standard Errors
0 0.106352 0.106352
1 0.067777 0.067777

Some final reflections

Sandwich estimators correct for the variance by adjusting the standard error after OLS coefficient estimation. However, another approach is to correct for the variance before estimation - by first applying a weight matrix to the data before fitting. You can read more about this in the GLS post

Fin.