However, how do we know if these coefficients are the best ones we can estimate?
For the estimated coefficients to be the Best Linear Unbiased Estimator (i.e. “BLUE”):
The best estimator has to be unbiased: \(E[\hat{\beta}^*] = \beta\)
And among all possible linear, unbiased estimators, it must have the smallest variance: \(V[\hat{\beta}^{*}] < V[\hat{\beta}^{Z}]\)
We want to ensure our OLS estimate is the best, i.e. that \(\hat{\beta}^{OLS} = \hat{\beta}^{*}\). To achieve this, first we need to confirm it is unbiased. Then given this is true, we can check that the coefficient is most efficient vs all other unbiased estimators.
Gauss Markov Assumptions
Along the way, we will outline the Gauss-Markov assumptions utilised that ensure the OLS coefficient is BLUE.
Setting the scene
The true coefficient and model
\(\beta\) is the true (unobserved) relationship between all the relevant explanatory features, \(X\), and their associated observed outcomes, \(y\). In other words, if we observed the entire population of data, it is the relationship we would find.
Concretely, we assume the outcome is a linear function of all its relevant features. This “true model” perfectly predicts the outcome, except for random noise \(\epsilon\) that influences the observed outcome: \(y = X\beta + \epsilon\)
The estimated coefficient
\(\hat{\beta}\) is our estimated coefficient for the true relationship \(\beta\). In reality, we estimate \(\hat{\beta}\) from the small, finite sample of size \(n\) that is collected, not the whole population. Given any random sample could be collected, we can term the coefficient resulting from the optimum estimation proceedure to be \(\hat{\beta}^*\). We want to understand if \(\hat{\beta}^{OLS} = \hat{\beta}^*\).
The expected estimated coefficient
\(E[\hat{\beta}]\) is the “expected” estimated coefficient. Imagine we repeat the action of estimating the coefficient \(\hat{\beta}\) many times, each time collecting a new sample (where each observation is sampled i.i.d), and recording the value for the estimated coefficient. \(E[\hat{\beta}]\) would then be the average of all of those estimated coefficients. If the OLS coefficient is unbiased, then the expected coefficient estimate should be equal to the true one, \(E[\hat{\beta}^{OLS}]=\beta\).
The variance of the estimated coefficient
\(V[\hat{\beta}]\) is the variance of the estimated coefficient. It determines how much we might expect our estimate \(\hat{\beta}\) to differ from the true \(\beta\) for any sample drawn. Given the OLS coefficient has been shown to be unbiased, if it is BLUE we expect its variance to be lower than another other unbiased choice \(\hat{\beta}^{Z}\). Concretely, we want to find \(V[\hat{\beta}^{OLS}] < V[\hat{\beta}^{Z}]\).
Bias
Often our small finite samples of size \(n\) are not a perfect reflection of the population they are drawn from. This “sampling error” means we might estimate a different relationship between \(X\) and \(y\) than the true relationship of the population, i.e. \(\hat{\beta} \neq \beta\).
However, we should expect our estimated coefficient to be equal to the true value on average. This means we do not want to have a bias towards the estimate being systematically too small or too large, for example. In other words, if we repeated the whole proceedure thousands of times (each time taking new samples, and estimating a coefficient from the new sample) then the average of all the estimated coefficients values should be equal to the true value, i.e. \(E[\hat{\beta}] = \beta\).
Recall that we believe there is a true model that follows the form:
\[
y = X\beta + \epsilon
\]
GM1: Linearity
The formula above relies on the first Gauss-Markov assumption - that the dependent variable \(y\) is assumed to be a linear function of the variables \(X\). Note that implies that the proper functional form has been selected (i.e. the relationship is linear) and there are no omitted variables - a huge assumption!
If we substitute this into our estimated coefficient:
we show that the estimated coefficient \(\hat{\beta}^{OLS}\) will differ from the true value depending on the random error \(\epsilon\) associated with the particular finite sample collected.
Now let’s now take the expectation, to determine when the coefficient is unbiased. In other words, what is the “average” coefficient if we took the sample many times: \[
\displaylines{
\begin{align}
E[\hat{\beta}^{OLS}]
& = \beta +(X^{\intercal}X)^{-1}X^{\intercal}E[\epsilon]
\\ & = \beta & \iff E[\epsilon] & = 0
\end{align}
\\
}
\]
We find that the coefficient is unbiased as long as the expected error is also zero.
GM2: Strict Exogeneity
The second Gauss-Markov assumption is strict exogeneity, where the expected error is zero for all feature values: \(E[\epsilon|X] = 0\). By definition, the weaker exogeneity statement of \(E[\epsilon] = 0\) is implied by having the expected error conditional being equal to zero.
Efficiency
To be the “best”, the OLS estimator also needs to be efficient. This means that it has the lowest variance of all unbiased estimators. This section looks to prove this.
Coefficient variance for OLS
First, let’s derive the variance from the coefficients estimated using OLS, termed \(V[\hat{\beta}^{OLS}]\). As before, we substitute the true model \(y = X\beta + \epsilon\) into the coefficient estimated through OLS:
How can we jump to this result? Well we are making two assumptions:
No serial correlation: \(\rho_{\epsilon_{i},\epsilon_{i \neq j}} = 0\). No correlation between sample errors means that \(E[\epsilon_i \epsilon_{j \neq i}] = 0\), and hence the off-diagonals of the error covariance matrix are zero.
Homoskedasticity: the assumption of uniform error variance for all samples means that \(V[\epsilon_i^2] = V[\epsilon_{j \neq i}^2] = \hat{\sigma}^2\). And our best approximation for \(\hat{\sigma}^2\) is simply taking the average squared error: \(\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}{\epsilon_i^2}\)
GM3: Spherical errors
The third Gauss-Markov assumption is spherical errors, \(E[\epsilon\epsilon^{\intercal}|X] = 0\). This means that the outer product of the expected errors is a scalar matrix, which implies no serial correlation and homoskedasticity.
It is especially important to make the right assumptions about \(E[\epsilon\epsilon^{\intercal}|X]\) as it impacts where our estimate of the standard errors is correct! We will dive into what happens to SE if we violate these assumptions in another post.
Since we now see that \(\hat{\sigma}^2\) is a scalar matrix, we can thus simplfy the variance formula further:
Next step - lets formulate another estimator, \(\hat{\beta}^{z}\), which differs from \(\hat{\beta}^{OLS}\) by a non-zero matrix \(A\). See how they both differ below:
The matrix \(X\) must be of full rank \(k\), so that it is possible to invert the matrix \(X^{\intercal}X\). This is equivalent to having no perfect multi-collinearity.
We have now collected our full set of Gauss-Markov assumptions required for the OLS coefficient to be BLUE:
Now we will code up the above in Python to show how we can derive the OLS coefficient, show that it is unbiased, and that the confidence intervals are calculated correctly.
First we write an OLS class to calculate the OLS coefficients and their standard errors:
OLS class (source code):
import numpy as np, pandas as pdfrom typing import Optional, Tuplefrom scipy.linalg import qr, solve_triangularfrom scipy.stats import norm, tclass OLS:def _convert_types(self, z) -> np.ndarray:""" Re-shape and convert y to numpy array to work nicely with rest of functions """iftype(z) in [pd.DataFrame, pd.Series]: z2 = z.to_numpy()iftype(z) ==list: z2 = np.array(z)iftype(z) == np.ndarray: z2 = zelse:raiseTypeError('Array must be a pandas series/dataframe, numpy array or list')return z2def _get_y(self, y: Optional =None) -> np.ndarray:"""Re-shape and convert y to numpy array to work nicely with rest of functions"""if y isNone: y =self.yreturnself._convert_types(y).reshape(-1)def _get_X(self, X: Optional =None) -> Tuple[np.ndarray, np.ndarray]:"""Re-shape and convert X to numpy array to work nicely with rest of functions Also return names for summarising in coefficient table"""if X isNone: X =self.X X2 =self._convert_types(X)iftype(X) == pd.DataFrame: exog_names = np.array(X.columns)eliftype(X) == pd.Series: exog_names = np.array(['Unamed Exog Feature'])else: exog_names = np.arange(X2.shape[1]) X2 = X2.reshape(-1,len(exog_names))return (X2, exog_names)def _clear_fitted_attributes(self):"""Clears fitted attributes of the class"""self.beta, self.RSS, self.beta_se, self.conf_int, self.test_stat, self.p_val =None, None, None, None, None, Nonedef__init__(self, y: Optional[np.ndarray] =None, X: Optional[np.ndarray] =None ) ->None:"""Initializes the OLS class to run an least-squares regression"""if y isnotNone:self.y =self._get_y(y)self.n =len(self.y)if X isnotNone:self.X, self.exog_names =self._get_X(X)self.k =self.X.shape[1]if y isnotNoneand X isnotNoneandlen(self.y) !=self.X.shape[0]:raiseValueError("y and X must be the same size.")self._clear_fitted_attributes()def _quick_matrix_invert(self, X: np.ndarray) -> np.ndarray:""" Find the inverse of a matrix, using QR factorization """ Q, R = qr(X) X_inv = solve_triangular(R, np.identity(X.shape[1])).dot(Q.transpose())return X_invdef _check_if_fitted(self):"""Quick helper function that raises an error if the model has not been fitted already"""ifself.beta isNone:raiseValueError('Need to fit the model first - run fit()')else:returnTruedef _estimate_ols_coefs(self, y: Optional[np.ndarray] =None, X: Optional[np.ndarray] =None ):"""Estimates the OLS coefficients given a vector y and matrix X""" XTX = X.T.dot(X) XTY = X.T.dot(y) XTX_inv =self._quick_matrix_invert(XTX) coefs = XTX_inv.dot(XTY)return coefs, XTX_invdef fit(self, y: Optional[np.ndarray] =None, X: Optional[np.ndarray] =None, ):"""Estimates the OLS coefficients given a vector y and matrix X"""self._clear_fitted_attributes() y =self._get_y(y) X, exog_names =self._get_X(X)if y isNoneor X isNone:raiseValueError('X and y is required for fitting')iflen(y) != X.shape[0]:raiseValueError("y and X must be the same size.")self.y, self.X, self.exog_names = y, X, exog_namesself.n, self.k = X.shapeself.DoF =self.n -self.kself.beta, self.var_X_inv =self._estimate_ols_coefs(y,X)def _assess_fit(self, y: Optional[np.ndarray] =None, X: Optional[np.ndarray] =None, ) ->float:"""Returns the unadjusted R^2"""self._check_if_fitted()if (y isNoneand X isnotNone) or (y isnotNoneand X isNone):raiseValueError('Need to either provide both X and y, (or provide neither and R^2 is based on the X and y used for fitting)')else: y, (X, exog_names) =self._get_y(y), self._get_X(X) y_hat =self.predict(self.X) residuals = (y - y_hat).reshape(-1, 1) RSS = residuals.T.dot(residuals) TSS = (y - y.mean()).T.dot(y - y.mean()) unadj_r_squared =1- RSS/TSSif (y ==self.y).all() and (X ==self.X).all():self.residuals = residualsself.RSS = RSSself.TSS = TSSself.unadj_r_squared = unadj_r_squaredreturn unadj_r_squareddef _standard_error(self,) -> np.ndarray:"""Returns the standard errors for the coefficients from the fitted model"""ifself.RSS isNone:self._check_if_fitted()self._assess_fit() sigma_sq =self.RSS /float(self.DoF) * np.identity(len(self.beta)) var_b = sigma_sq.dot(self.var_X_inv)self.beta_se = np.sqrt(np.diag(var_b))returnself.beta_sedef _confidence_intervals(self, size =0.95):"""Returns the confidence intervals for the coefficients from the fitted model"""ifself.beta_se isNone:self._check_if_fitted()self._standard_error() alpha =1-(1-size)/2self.conf_int = np.array([self.beta - t.ppf(alpha, self.DoF) *self.beta_se,self.beta + t.ppf(alpha, self.DoF) *self.beta_se ])returnself.conf_intdef _test_statistic(self) -> np.ndarray:"""Returns the test statistics for the coefficients from the fitted model"""ifself.conf_int isNone:self._check_if_fitted()self.conf_int =self._confidence_intervals()self.test_stat =self.beta.flatten() /self.beta_sereturnself.test_statdef _p_value(self, z_dist: bool=False) -> np.ndarray:"""Returns the p-values for the coefficients from the fitted model."""ifself.test_stat isNone:self._check_if_fitted()self.test_stat =self._test_statistic()if z_dist:self.p_val = [norm.cdf(-abs(z)) +1- norm.cdf(abs(z)) for z inself.test_stat]else:self.p_val = [2* t.sf(abs(x), self.DoF) for x inself.test_stat]returnself.p_val def predict(self, X: Optional[np.ndarray] =None, ) -> np.ndarray:"""Predict values for y. Returns fitted values if X not provided."""self._check_if_fitted() X2, exog_names =self._get_X(X) y_hat = X2.dot(self.beta)if X isNone:self.y_hat = y_hatreturn y_hatdef summary(self, z_dist: bool=False) -> pd.DataFrame:"""Returns the coefficients, standard errors, test statistics and p-values in a Pandas DataFrame."""ifself.p_val isNone:self._check_if_fitted()self._p_value(z_dist) summary = pd.DataFrame( data={'Coefficient': self.beta.flatten(),'Standard Error': self.beta_se,'Lower bound': self.conf_int[0],'Upper bound': self.conf_int[1],'test-statistic': self.test_stat,'p-value': self.p_val, }, index=self.exog_names, )return summary
And now let’s use that code, and run 1000 simulations i.e. taking 1000 different samples from the population and running a regression each time:
Code
import numpy as npnp.random.seed(42)n, k =50, 2sigma_sq =1beta = np.random.normal(size=(k,1))bootstraps =1000coefs, CI_0, CI_1 = [], np.zeros(bootstraps), np.zeros(bootstraps)for i inrange(bootstraps): X = np.hstack([ np.ones(n).reshape(n,1), np.random.normal(size=(n,k-1)) ]) y = X.dot(beta) + np.random.normal(loc=0,scale=sigma_sq,size=(n,1)) OLS_model = OLS(y,X) OLS_model.fit()sum= OLS_model.summary() coefs.append(sum['Coefficient'].to_list())if (beta[0] >=sum.loc[0,'Lower bound'] and beta[0] <=sum.loc[0,'Upper bound']).all(): CI_0[i] =1if (beta[1] >=sum.loc[1,'Lower bound'] and beta[1] <=sum.loc[1,'Upper bound']).all(): CI_1[i] =1
By plotting a histogram of the estimated coefficients, we see that the estimated OLS coefficients center around the true coefficient, so empirically the coefficient appears unbiased:
Code
import plotly.graph_objects as gofig = go.Figure()c0 = [c[0] for c in coefs]c1 = [c[1] for c in coefs]fig.add_trace(go.Histogram(x=c0, name='Estimated Intercept: β0'))fig.add_trace(go.Histogram(x=c1, name='Estimated Slope: β1'))fig.update_layout( barmode='overlay', legend=dict(orientation="h"), shapes=[dict(type='line', line=dict(dash='solid',width=1), x0=np.mean(c0),x1=np.mean(c0), y0=0,y1=0.9, xref='x',yref='paper',# text="True coefficient = {:,.2f}".format(beta.tolist()[0][0]), ), dict(type='line', line=dict(dash='solid',width=1), x0=np.mean(c1),x1=np.mean(c1), y0=0,y1=0.9, xref='x',yref='paper',# text="True coefficient = {:,.2f}".format(beta.tolist()[1][0]), ) ], annotations=[dict( x = beta.tolist()[0][0], y =70, text="True coefficient = {:,.2f}".format(beta.tolist()[0][0]), ), dict( x = beta.tolist()[1][0], y =70, text="True coefficient = {:,.2f}".format(beta.tolist()[1][0]), )],)fig.update_traces(opacity=0.75)fig.show()
Further, we see that the true coefficient falls within the confidence interval 95% of the time, so empirically the standard errors appear correctly specified:
Code
fig = go.Figure()fig.add_trace( go.Scatter( x=np.arange(bootstraps), y=np.cumsum(CI_0)/(np.arange(bootstraps)+1), name='Intercept Estimation', mode='lines', ) )fig.add_trace( go.Scatter( x=np.arange(bootstraps), y=np.cumsum(CI_1)/(np.arange(bootstraps)+1), name='Slope estimation', mode='lines', ) )fig.update_xaxes(title="number of bootstraps")fig.update_yaxes(title="% time true coefficient falls within CI")fig.show()