Sciencing the data - Generalized Least Squares

What are we exploring?

Applying a weight matrix to correct for non-homoskedastic error variance can be more efficient than OLS with sandwich errors, aka GLS.

Introduction

As seen when exploring sandwich estimators, the assumption of homoskedasticity is often violated in real-world data. This can lead to inefficient estimates and incorrect inference too.

Sandwich estimators correct for the variance by adjusting the standard error after OLS coefficient estimation. However, another approach is to correct for the variance before estimation - by first applying a weight matrix to the data before fitting. This is known as Generalized Least Squares (GLS). In fact, weighted least squares (WLS) is a special case of GLS.

The Motivation

First - let’s state that:

\(E(\epsilon|X)=0\)
\(V(\epsilon \epsilon^{\intercal}|X)=\sigma^2\underset{n \times n}{\Omega}\)

This means that although the assumption of endogeneity is not violated, the assumption of homoskedasticity is. And more specifically, that the variance of the error term can be decomposed between into a constant variance \(\sigma^2\) and the error covariance matrix \(\Omega\).

Now according to Cholesky decomposition, if \(\Omega\) is symmetric positive definite, then there exists a lower triangular matrix \(\mathrm{P}\) such that:

\[ \displaylines{ \Omega=(\mathrm{P}^{\intercal}\mathrm{P})^{-1} = \mathrm{P}^{-1}(\mathrm{P}^{\intercal})^{-1} \\ \therefore \mathrm{P} \Omega \mathrm{P}^{\intercal} = I } \]

So, if we transform all variables by \(\mathrm{P}\):

We get the following form: \(\mathrm{P} y = \mathrm{P} X\beta + \mathrm{P} \epsilon\)
Then the expected error is still zero (i.e. consistency): \(E[\mathrm{P} \epsilon] = \mathrm{P} E[\epsilon] = 0\)
But the variance is now homoskedastic: \(V[\mathrm{P} \epsilon] = \mathrm{P} V[\epsilon] \mathrm{P}^{\intercal} = \sigma^2 \mathrm{P} \Omega \mathrm{P}^{\intercal} = \sigma^2 I\)

It might be apparent now that Weighted Least Squares (WLS) is a special case of GLS, where \(\mathrm{P}\) is an error covariance matrix has zero off-digonal elements.

This is the motivation behind GLS. We can transform the data by \(\mathrm{P}\) to make the error variance homoskedastic, and then apply OLS to the transformed data.

Checking the estimator is unbiased and more efficient

From the above, we can use this to jump straight to the solution for the GLS estimator! \[ \displaylines{ \hat{\beta}_{GLS} = [(\mathrm{P} X)^{\intercal}(\mathrm{P} X)]^{-1}[(\mathrm{P} X)^{\intercal}(\mathrm{P} y)] \\ = [X^{\intercal}\mathrm{P}^{\intercal}\mathrm{P} X]^{-1} [X^{\intercal}\mathrm{P}^{\intercal}\mathrm{P} y] \\ = [X^{\intercal}\Omega^{-1} X]^{-1}[X^{\intercal}\Omega^{-1} y] } \]

We can confirm this is unbiased: \[ \displaylines{ \begin{align} \hat{\beta}_{GLS} & = [X^{\intercal}\Omega^{-1} X]^{-1}[X^{\intercal}\Omega^{-1} y] \\ & = [X^{\intercal}\Omega^{-1} X]^{-1}\left[X^{\intercal}\Omega^{-1}(X\beta+\epsilon)\right] \\ & = \left([X^{\intercal}\Omega^{-1} X]^{-1}\left[X^{\intercal}\Omega^{-1}X\right]\right)\beta + \left([X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}\right)\epsilon \\ & = \cancel{\left([X^{\intercal}\Omega^{-1} X]^{-1}\left[X^{\intercal}\Omega^{-1}X\right]\right)}\beta + \left([X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}\right)\epsilon \\ & = \beta + [X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}\epsilon \\ \\ \therefore \mathbb{E}[\hat{\beta}_{GLS}|X] & = \mathbb{E} \left[ \left(\beta + [X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}\epsilon\right) |X\right] \\ & = \beta + \mathbb{E} \left[ \left([X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}\epsilon\right) | X\right] \\ & \equiv \beta + [X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1} \mathbb{E} \left[ \epsilon | X\right] \\ & = \beta & \iff \mathbb{E}[\epsilon|X] = 0 \end{align} } \]

We now show that the GLS estimator is more efficient than OLS. Firstly, we derive the variance of the GLS coefficient:

\[ \displaylines{ \begin{align} \mathbb{V}[\hat{\beta}_{GLS}] & = \mathbb{E}[(\hat{\beta}_{GLS}-\beta)(\hat{\beta}_{GLS}-\beta)^{\intercal}] \\ & = \mathbb{E}[ ([X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}\epsilon) ([X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}\epsilon)^{\intercal} ] \\ & \equiv \mathbb{E}[ ([X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1})\epsilon \epsilon^{\intercal} (\Omega^{-1} X [X^{\intercal}\Omega^{-1} X]^{-1}) ] \\ & = ([X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1}) \underbrace{\mathbb{E}[\epsilon \epsilon^{\intercal}]}_{\text{meat} = \sigma^2\Omega} (\Omega^{-1} X [X^{\intercal}\Omega^{-1} X]^{-1}) \\ & = \sigma^2[X^{\intercal}\Omega^{-1} X]^{-1}X^{\intercal}\Omega^{-1} \cancel{\Omega} \cancel{\Omega^{-1}} X [X^{\intercal}\Omega^{-1} X]^{-1} \\ & = \sigma^2\cancel{[X^{\intercal}\Omega^{-1} X]^{-1}}\cancel{X^{\intercal}\Omega^{-1} X} [X^{\intercal}\Omega^{-1} X]^{-1} \\ & = \sigma^2[X^{\intercal}\Omega^{-1} X]^{-1} \end{align} } \]

So we can now compare that to the variance of the usual OLS estimator:

\[ \displaylines{ \begin{align} \mathbb{V}[\hat{\beta}_{OLS}] & = (X^{\intercal}X)^{-1}X^{\intercal} \underbrace{E[\epsilon\epsilon^{\intercal}]}_{\text{meat} = \sigma^2\Omega} X(X^{\intercal}X)^{-1} \\ & = \sigma^2(X^{\intercal}X)^{-1}X^{\intercal} \Omega X(X^{\intercal}X)^{-1} \\ \\ \therefore \mathbb{V}[\hat{\beta}_{OLS}] - \mathbb{V}[\hat{\beta}_{GLS}] & = \sigma^2 \left[ (X^{\intercal}X)^{-1}X^{\intercal} \Omega X(X^{\intercal}X)^{-1} - [X^{\intercal}\Omega^{-1} X]^{-1} \right] \\ & = \sigma^2 \left[ A \Omega A^{\intercal} \right] > 0 \\ \\ \text{where } A & = (X^′X)^{−1}X^′− (X^′\Omega^{−1}X)^{-1} X^′\Omega^{−1} \end{align} } \]

And since \(\Omega\) is Positive Semi Definite, then \(A \Omega A^{\intercal}\) is also PSD, so the GLS estimator is more efficient than the OLS estimator. For the proof of the last step, see the footnotes¹.

However, this isn’t feasible unless we know what \(\Omega\) is! We usually have to estimate this.

Feasible Generalized Least Squares

The feasible GLS estimator is a two-step process:

Run a normal OLS regression, \(y \sim X\hat{\beta} + \hat{\epsilon}\). Now derive an error covariance matrix using the squared residuals from this OLS regression, \(\hat{\Omega} \sim \mathrm{V}[\hat{\epsilon}]\), .
Estimate \(\mathrm{P}\) using the Cholesky decomposition of \(\Omega\), and transform \(y\) and \(X\) by \(\mathrm{P}\). Then estimate another OLS model on the transformed data: \(\mathrm{P}y \sim [\mathrm{P}X]\beta' + \mathrm{P}\epsilon\)

Whilst GLS is more efficient, FGLS is only asymptotically more efficient - when the error covariance matrix is consistently estimated. In fact, for a small sample size, FGLS can be actually less efficient than OLS - and often it is even biased! It is only for large samples that FGLS would be preferred, as it is consistent.

Thus some authors prefer OLS, and use a sandwich estimator instead. Finally - its worth noting that we can still apply sandwich estimator to FGLS coefficients as well!

Final notes

It is worth noting here that the random effects econometric model is a special case of GLS. The random effects model is a panel data model that assumes that the unobserved heterogeneity within groups is both constant and uncorrelated with the other predictors. is a special case of GLS.

Coding it up from scratch

We can inherit from the OLS class, as this does a lot of the leg work! We need to make two changes though:

Add an extra method to estimate the covariance matrix and apply the Cholesky decomposition.
Change the _estimate_ls_coefs method to utilise this covariance matrix.

First we take the sandwich parent class we defined previously in the sandwich estimator post (note this also inherits from the OLS class - see that post here):

Sandwich class (source code):

from ols_blue import OLS
from scipy.sparse import bsr_array
import pandas as pd, numpy as np
from typing import Optional, Tuple

class error_covariance():

    def _homoskedastic(self) -> np.ndarray:
        """Return an error covariance matrix assuming homoskedasticity"""
        e = self.residuals
        ee = np.identity(self.n) * e.T.dot(e)
        ee /= float(self.DoF)
        return ee

    def _heteroskedastic(self, type: str='HC0') -> np.ndarray:
        """Return an error covariance matrix assuming heteroskedasticity (HC0/1)"""
        ee = np.diagflat(self.residuals**2)
        if type == 'HC1':
            ee *= self.n/self.DoF
        return ee

    def _clustered(self, clusters: np.ndarray, finite_sample_correction=True) -> np.ndarray:
        """Return an error covariance matrix assuming clustered errors"""
        ee = bsr_array((self.n, self.n)).toarray()
        for i in range(self.n):
            for j in range(self.n):
                if clusters[i] == clusters[j]:
                    ee[i, j] = self.residuals[i] * self.residuals[j]
        n_cl = len(np.unique(clusters))
        n, k = self.n, self.k
        if finite_sample_correction:
            ee *= ((n - 1) / (n - k)) * (n_cl / (n_cl - 1))
        return ee

class ols_sandwich(OLS, error_covariance):

    def __init__(self, y=None,X=None,residuals=None) -> None:
        """The sandwich class adapts the standard errors of existing OLS regressions"""
        super().__init__(y=y,X=X)

    def _XeeX(self, ee: np.ndarray):
        """Return the meat of the sandwich estimator"""
        return self.X.T.dot(ee).dot(self.X)

    def _sandwich(self, meat: np.ndarray, bread: Optional[np.ndarray] = None) -> None:
        """Helper function to return a 'sandwich' from bread and meat"""
        if bread is None:
            bread = self.var_X_inv
        sandwich = bread.dot(meat).dot(bread)
        beta_se = np.sqrt(np.diag(sandwich))
        return beta_se
    
    def _cluster_robust_fast(self, clusters: np.ndarray):
        """Rather than specifiying the individual errors before calculating the meat, we can instead calculate "mini" meats and sum them up at the end"""
        def cluster_XeeX(cluster_index):
            j = clusters == cluster_index
            _X, _e = self.X[j, :], self.residuals[j]
            _eX = _e.T.dot(_X)
            _XeeX = _eX.T.dot(_eX)
            return _XeeX
        clusters = clusters.flatten()
        cluster_XeeX = [cluster_XeeX(i) for i in np.unique(clusters)]
        n_cl = len(np.unique(clusters))
        n, k = self.n, self.k
        # finite-sample correction factor.    # sum XeeX across all clusters
        XeeX = ((n - 1) / (n - k)) * (n_cl / (n_cl - 1)) * np.sum(cluster_XeeX, axis=0)
        # summed - i.e. requires averaging, so needs many clusters to ensure V is consistent
        # https://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf (p.7-9)
        return XeeX
    
    def _beta_se(self, se_correction: Optional[str | np.ndarray] = None) -> None:
        """Return the standard errors of the coefficients"""
        self.beta_se, self.conf_int, self.test_stat, self.p_val= None, None, None, None
        if type(se_correction) == np.ndarray:
            ee = self._clustered(se_correction)
        elif type(se_correction) == str:
            if se_correction == "homoskedastic":
                ee = self._homoskedastic()
            elif se_correction == "heteroskedastic":
                ee = self._heteroskedastic(type='HC1')
        else:
            raise ValueError(f"""Correction type {se_correction} not supported. Please choose from 'homoskedastic' or 'heteroskedastic', or supply an array for cluster groups""")        
        XeeX = self._XeeX(ee)
        beta_se = self._sandwich(XeeX)
        return beta_se
    
    def summary(self, z_dist: bool = False, se_correction: Optional[str] = None) -> pd.DataFrame:
        """Returns the coefficients, standard errors, test statistics and p-values in a Pandas DataFrame."""
        self._check_if_fitted()
        self._assess_fit()
        if se_correction is not None:
            self.beta_se = self._beta_se(se_correction)
        self._p_value(z_dist)
        summary = pd.DataFrame(
            data={
                'Coefficient': self.beta.flatten(),
                'Standard Error': self.beta_se,
                'Lower bound': self.conf_int[0],
                'Upper bound': self.conf_int[1],
                'test-statistic': self.test_stat,
                'p-value': self.p_val,
            },
            index=self.exog_names,
        )
        return summary

Now we can define a child GLS class, which utilises the parent classes in two key ways:

Uses the error covariance estimation methods from the sandwich class to estimate \(\Omega\)
Uses the OLS class functionality to estimate the coefficients (and then other methods from OLS and sandwich for the standard errors etc)

Code

class LS(ols_sandwich):

    def __init__(
        self, 
        y: Optional[np.ndarray] = None, 
        X: Optional[np.ndarray] = None,
        omega: Optional[np.ndarray] = None
        ) -> None:
        """Initializes the LS class to run an least-squares regression"""
        super().__init__(y, X)
        self.omega = omega
        self.P = None

    def _estimate_gls_coefs(self, y: np.ndarray, X: np.ndarray, omega: np.ndarray):
        """Estimates the GLS coefficients given a vector y and matrix X"""
        try:
            P = np.linalg.cholesky(omega)
            PX = P.dot(X)
            Py = P.dot(y)
            coefs, XTOX_inv = self._estimate_ols_coefs(Py,PX)
        except:
            omega_inv = np.linalg.inv(omega)
            XTO = X.T.dot(omega_inv)
            XTOX = XTO.dot(X)
            XTOX_inv = self._quick_matrix_invert(XTOX)
            XTOY = XTO.dot(y)
            coefs = XTOX_inv.dot(XTOY)
        return coefs, XTOX_inv
        
    def fit(
        self,
        y: Optional[np.ndarray] = None,
        X: Optional[np.ndarray] = None,
        omega: Optional[np.ndarray] = None,
        fgls = None,
    ):
        self._clear_fitted_attributes()
        y = self._get_y(y)
        X, exog_names = self._get_X(X)
        if y is None or X is None:
            raise ValueError('X and y is required for fitting')
        if len(y) != X.shape[0]:
            raise ValueError("y and X must be the same size.")
        self.y, self.X, self.exog_names = y, X, exog_names
        self.n, self.k = X.shape
        self.DoF = self.n - self.k
        if omega is not None:
            self.omega = omega
        if self.omega is None:
            self.beta, self.var_X_inv = self._estimate_ols_coefs(y,X)
        if self.omega is not None or fgls is not None:
            if self.omega is not None and fgls is not None:
                raise ValueError('Cannot specify both omega and fgls')
            elif fgls is not None:
                self._assess_fit()
                if type(fgls) == str:
                      if fgls == "homoskedastic":
                          self.omega = self._homoskedastic()
                      elif fgls == "heteroskedastic":
                          self.omega = self._heteroskedastic()
                elif type(fgls) == np.ndarray:
                    self.omega = self._clustered(fgls)
            self._clear_fitted_attributes()
            self.beta, self.var_X_inv = self._estimate_gls_coefs(y,X,self.omega)

Footnotes

To show that GLS is more efficient than the OLS estimator, we show that we can rewrite the difference in the variance of the two estimators as a positive semi-definite matrix \(\sigma^2 \left[ A \Omega A^{\intercal} \right]\): \[ \displaylines{ \begin{align} A & = (X^{\intercal}X)^{−1}X^{\intercal}− (X^{\intercal}\Omega^{−1}X)^{-1} X^{\intercal}\Omega^{−1} \\ \therefore A^{\intercal} & = X(X^{\intercal}X)^{−1} − \Omega^{−1} X (X^{\intercal}\Omega^{−1}X)^{-1} & \because \left[\Omega^{-1}\right]^{\intercal} = \Omega^{-1} \\ \\ \Rightarrow A\Omega & = (X^{\intercal}X)^{−1}X^{\intercal}\Omega − (X^{\intercal}\Omega^{−1}X)^{-1} X^{\intercal} \\ \\ \Rightarrow A\Omega A^{\intercal} & = (X^{\intercal}X)^{−1}X^{\intercal}\Omega X(X^{\intercal}X)^{−1} \\ & - (X^{\intercal}\Omega^{−1}X)^{-1} \underbrace{X^{\intercal} X(X^{\intercal}X)^{−1}}_{=I} \\ & - \underbrace{(X^{\intercal}X)^{−1}X^{\intercal}\Omega \Omega^{−1} X}_{=I} (X^{\intercal}\Omega^{−1}X)^{-1} \\ & + (X^{\intercal}\Omega^{−1}X)^{-1}X^{\intercal} \underbrace{\Omega^{−1} X (X^{\intercal}\Omega^{−1}X)^{-1}}_{=I} \\ \\ \Rightarrow \sigma^2 \left[ A \Omega A^{\intercal} \right] & = \underbrace{\sigma^2(X^{\intercal}X)^{−1}X^{\intercal}\Omega X(X^{\intercal}X)^{−1}}_{V[\beta^{OLS}]} - \underbrace{\sigma^2(X^{\intercal}\Omega^{−1}X)^{-1}}_{V[\beta^{GLS}]} \\ & \cancel{-\sigma^2(X^{\intercal}\Omega^{−1}X)^{-1}X^{\intercal} + \sigma^2(X^{\intercal}\Omega^{−1}X)^{-1}X^{\intercal}} \end{align} } \]↩︎