Applying a weight matrix to correct for non-homoskedastic error variance can be more efficient than OLS with sandwich errors, aka GLS.
Introduction
As seen when exploring sandwich estimators, the assumption of homoskedasticity is often violated in real-world data. This can lead to inefficient estimates and incorrect inference too.
Sandwich estimators correct for the variance by adjusting the standard error after OLS coefficient estimation. However, another approach is to correct for the variance before estimation - by first applying a weight matrix to the data before fitting. This is known as Generalized Least Squares (GLS). In fact, weighted least squares (WLS) is a special case of GLS.
This means that although the assumption of endogeneity is not violated, the assumption of homoskedasticity is. And more specifically, that the variance of the error term can be decomposed between into a constant variance \(\sigma^2\) and the error covariance matrix \(\Omega\).
Now according to Cholesky decomposition, if \(\Omega\) is symmetric positive definite, then there exists a lower triangular matrix \(\mathrm{P}\) such that:
So, if we transform all variables by \(\mathrm{P}\):
We get the following form: \(\mathrm{P} y = \mathrm{P} X\beta + \mathrm{P} \epsilon\)
Then the expected error is still zero (i.e. consistency): \(E[\mathrm{P} \epsilon] = \mathrm{P} E[\epsilon] = 0\)
But the variance is now homoskedastic: \(V[\mathrm{P} \epsilon] = \mathrm{P} V[\epsilon] \mathrm{P}^{\intercal} = \sigma^2 \mathrm{P} \Omega \mathrm{P}^{\intercal} = \sigma^2 I\)
It might be apparent now that Weighted Least Squares (WLS) is a special case of GLS, where \(\mathrm{P}\) is an error covariance matrix has zero off-digonal elements.
This is the motivation behind GLS. We can transform the data by \(\mathrm{P}\) to make the error variance homoskedastic, and then apply OLS to the transformed data.
Checking the estimator is unbiased and more efficient
From the above, we can use this to jump straight to the solution for the GLS estimator! \[
\displaylines{
\hat{\beta}_{GLS} =
[(\mathrm{P} X)^{\intercal}(\mathrm{P} X)]^{-1}[(\mathrm{P} X)^{\intercal}(\mathrm{P} y)] \\
=
[X^{\intercal}\mathrm{P}^{\intercal}\mathrm{P} X]^{-1}
[X^{\intercal}\mathrm{P}^{\intercal}\mathrm{P} y] \\
= [X^{\intercal}\Omega^{-1} X]^{-1}[X^{\intercal}\Omega^{-1} y]
}
\]
And since \(\Omega\) is Positive Semi Definite, then \(A \Omega A^{\intercal}\) is also PSD, so the GLS estimator is more efficient than the OLS estimator. For the proof of the last step, see the footnotes1.
However, this isn’t feasible unless we know what \(\Omega\) is! We usually have to estimate this.
Feasible Generalized Least Squares
The feasible GLS estimator is a two-step process:
Run a normal OLS regression, \(y \sim X\hat{\beta} + \hat{\epsilon}\). Now derive an error covariance matrix using the squared residuals from this OLS regression, \(\hat{\Omega} \sim \mathrm{V}[\hat{\epsilon}]\), .
Estimate \(\mathrm{P}\) using the Cholesky decomposition of \(\Omega\), and transform \(y\) and \(X\) by \(\mathrm{P}\). Then estimate another OLS model on the transformed data: \(\mathrm{P}y \sim [\mathrm{P}X]\beta' + \mathrm{P}\epsilon\)
Whilst GLS is more efficient, FGLS is only asymptotically more efficient - when the error covariance matrix is consistently estimated. In fact, for a small sample size, FGLS can be actually less efficient than OLS - and often it is even biased! It is only for large samples that FGLS would be preferred, as it is consistent.
Thus some authors prefer OLS, and use a sandwich estimator instead. Finally - its worth noting that we can still apply sandwich estimator to FGLS coefficients as well!
Final notes
It is worth noting here that the random effects econometric model is a special case of GLS. The random effects model is a panel data model that assumes that the unobserved heterogeneity within groups is both constant and uncorrelated with the other predictors. is a special case of GLS.
Coding it up from scratch
We can inherit from the OLS class, as this does a lot of the leg work! We need to make two changes though:
Add an extra method to estimate the covariance matrix and apply the Cholesky decomposition.
Change the _estimate_ls_coefs method to utilise this covariance matrix.
First we take the sandwich parent class we defined previously in the sandwich estimator post (note this also inherits from the OLS class - see that post here):
Sandwich class (source code):
from ols_blue import OLSfrom scipy.sparse import bsr_arrayimport pandas as pd, numpy as npfrom typing import Optional, Tupleclass error_covariance():def _homoskedastic(self) -> np.ndarray:"""Return an error covariance matrix assuming homoskedasticity""" e =self.residuals ee = np.identity(self.n) * e.T.dot(e) ee /=float(self.DoF)return eedef _heteroskedastic(self, type: str='HC0') -> np.ndarray:"""Return an error covariance matrix assuming heteroskedasticity (HC0/1)""" ee = np.diagflat(self.residuals**2)iftype=='HC1': ee *=self.n/self.DoFreturn eedef _clustered(self, clusters: np.ndarray, finite_sample_correction=True) -> np.ndarray:"""Return an error covariance matrix assuming clustered errors""" ee = bsr_array((self.n, self.n)).toarray()for i inrange(self.n):for j inrange(self.n):if clusters[i] == clusters[j]: ee[i, j] =self.residuals[i] *self.residuals[j] n_cl =len(np.unique(clusters)) n, k =self.n, self.kif finite_sample_correction: ee *= ((n -1) / (n - k)) * (n_cl / (n_cl -1))return eeclass ols_sandwich(OLS, error_covariance):def__init__(self, y=None,X=None,residuals=None) ->None:"""The sandwich class adapts the standard errors of existing OLS regressions"""super().__init__(y=y,X=X)def _XeeX(self, ee: np.ndarray):"""Return the meat of the sandwich estimator"""returnself.X.T.dot(ee).dot(self.X)def _sandwich(self, meat: np.ndarray, bread: Optional[np.ndarray] =None) ->None:"""Helper function to return a 'sandwich' from bread and meat"""if bread isNone: bread =self.var_X_inv sandwich = bread.dot(meat).dot(bread) beta_se = np.sqrt(np.diag(sandwich))return beta_sedef _cluster_robust_fast(self, clusters: np.ndarray):"""Rather than specifiying the individual errors before calculating the meat, we can instead calculate "mini" meats and sum them up at the end"""def cluster_XeeX(cluster_index): j = clusters == cluster_index _X, _e =self.X[j, :], self.residuals[j] _eX = _e.T.dot(_X) _XeeX = _eX.T.dot(_eX)return _XeeX clusters = clusters.flatten() cluster_XeeX = [cluster_XeeX(i) for i in np.unique(clusters)] n_cl =len(np.unique(clusters)) n, k =self.n, self.k# finite-sample correction factor. # sum XeeX across all clusters XeeX = ((n -1) / (n - k)) * (n_cl / (n_cl -1)) * np.sum(cluster_XeeX, axis=0)# summed - i.e. requires averaging, so needs many clusters to ensure V is consistent# https://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf (p.7-9)return XeeXdef _beta_se(self, se_correction: Optional[str| np.ndarray] =None) ->None:"""Return the standard errors of the coefficients"""self.beta_se, self.conf_int, self.test_stat, self.p_val=None, None, None, Noneiftype(se_correction) == np.ndarray: ee =self._clustered(se_correction)eliftype(se_correction) ==str:if se_correction =="homoskedastic": ee =self._homoskedastic()elif se_correction =="heteroskedastic": ee =self._heteroskedastic(type='HC1')else:raiseValueError(f"""Correction type {se_correction} not supported. Please choose from 'homoskedastic' or 'heteroskedastic', or supply an array for cluster groups""") XeeX =self._XeeX(ee) beta_se =self._sandwich(XeeX)return beta_sedef summary(self, z_dist: bool=False, se_correction: Optional[str] =None) -> pd.DataFrame:"""Returns the coefficients, standard errors, test statistics and p-values in a Pandas DataFrame."""self._check_if_fitted()self._assess_fit()if se_correction isnotNone:self.beta_se =self._beta_se(se_correction)self._p_value(z_dist) summary = pd.DataFrame( data={'Coefficient': self.beta.flatten(),'Standard Error': self.beta_se,'Lower bound': self.conf_int[0],'Upper bound': self.conf_int[1],'test-statistic': self.test_stat,'p-value': self.p_val, }, index=self.exog_names, )return summary
Now we can define a child GLS class, which utilises the parent classes in two key ways:
Uses the error covariance estimation methods from the sandwich class to estimate \(\Omega\)
Uses the OLS class functionality to estimate the coefficients (and then other methods from OLS and sandwich for the standard errors etc)
Code
class LS(ols_sandwich):def__init__(self, y: Optional[np.ndarray] =None, X: Optional[np.ndarray] =None, omega: Optional[np.ndarray] =None ) ->None:"""Initializes the LS class to run an least-squares regression"""super().__init__(y, X)self.omega = omegaself.P =Nonedef _estimate_gls_coefs(self, y: np.ndarray, X: np.ndarray, omega: np.ndarray):"""Estimates the GLS coefficients given a vector y and matrix X"""try: P = np.linalg.cholesky(omega) PX = P.dot(X) Py = P.dot(y) coefs, XTOX_inv =self._estimate_ols_coefs(Py,PX)except: omega_inv = np.linalg.inv(omega) XTO = X.T.dot(omega_inv) XTOX = XTO.dot(X) XTOX_inv =self._quick_matrix_invert(XTOX) XTOY = XTO.dot(y) coefs = XTOX_inv.dot(XTOY)return coefs, XTOX_invdef fit(self, y: Optional[np.ndarray] =None, X: Optional[np.ndarray] =None, omega: Optional[np.ndarray] =None, fgls =None, ):self._clear_fitted_attributes() y =self._get_y(y) X, exog_names =self._get_X(X)if y isNoneor X isNone:raiseValueError('X and y is required for fitting')iflen(y) != X.shape[0]:raiseValueError("y and X must be the same size.")self.y, self.X, self.exog_names = y, X, exog_namesself.n, self.k = X.shapeself.DoF =self.n -self.kif omega isnotNone:self.omega = omegaifself.omega isNone:self.beta, self.var_X_inv =self._estimate_ols_coefs(y,X)ifself.omega isnotNoneor fgls isnotNone:ifself.omega isnotNoneand fgls isnotNone:raiseValueError('Cannot specify both omega and fgls')elif fgls isnotNone:self._assess_fit()iftype(fgls) ==str:if fgls =="homoskedastic":self.omega =self._homoskedastic()elif fgls =="heteroskedastic":self.omega =self._heteroskedastic()eliftype(fgls) == np.ndarray:self.omega =self._clustered(fgls)self._clear_fitted_attributes()self.beta, self.var_X_inv =self._estimate_gls_coefs(y,X,self.omega)
Footnotes
To show that GLS is more efficient than the OLS estimator, we show that we can rewrite the difference in the variance of the two estimators as a positive semi-definite matrix \(\sigma^2 \left[ A \Omega A^{\intercal} \right]\): \[
\displaylines{
\begin{align}
A & =
(X^{\intercal}X)^{−1}X^{\intercal}− (X^{\intercal}\Omega^{−1}X)^{-1} X^{\intercal}\Omega^{−1}
\\
\therefore A^{\intercal} & = X(X^{\intercal}X)^{−1} − \Omega^{−1} X (X^{\intercal}\Omega^{−1}X)^{-1}
& \because \left[\Omega^{-1}\right]^{\intercal} = \Omega^{-1}
\\ \\
\Rightarrow A\Omega & =
(X^{\intercal}X)^{−1}X^{\intercal}\Omega − (X^{\intercal}\Omega^{−1}X)^{-1} X^{\intercal}
\\ \\
\Rightarrow A\Omega A^{\intercal}
& = (X^{\intercal}X)^{−1}X^{\intercal}\Omega X(X^{\intercal}X)^{−1} \\
& - (X^{\intercal}\Omega^{−1}X)^{-1} \underbrace{X^{\intercal} X(X^{\intercal}X)^{−1}}_{=I} \\
& - \underbrace{(X^{\intercal}X)^{−1}X^{\intercal}\Omega \Omega^{−1} X}_{=I} (X^{\intercal}\Omega^{−1}X)^{-1} \\
& + (X^{\intercal}\Omega^{−1}X)^{-1}X^{\intercal} \underbrace{\Omega^{−1} X (X^{\intercal}\Omega^{−1}X)^{-1}}_{=I} \\
\\
\Rightarrow \sigma^2 \left[ A \Omega A^{\intercal} \right]
& =
\underbrace{\sigma^2(X^{\intercal}X)^{−1}X^{\intercal}\Omega X(X^{\intercal}X)^{−1}}_{V[\beta^{OLS}]} - \underbrace{\sigma^2(X^{\intercal}\Omega^{−1}X)^{-1}}_{V[\beta^{GLS}]} \\
& \cancel{-\sigma^2(X^{\intercal}\Omega^{−1}X)^{-1}X^{\intercal} + \sigma^2(X^{\intercal}\Omega^{−1}X)^{-1}X^{\intercal}}
\end{align}
}
\]↩︎