Sciencing the data - Generalized Linear Models from scratch

What we are exploring

Deriving (and implementing in python) a generic way to optimize GLMs, by using the exponential family form for Poisson, Bernoulli and Gaussian distributions.

Introduction

There many types of generalized linear regression models: such as linear regression, logistic regression, Poisson regression etc. Every one of these models is made up of a “random component” and a “systematic component”. Each also has a “link function” that combines the random and systematic parts.

To make it easier to contrast and compare, first let’s take a normal (Gaussian) linear regression:

\[ \displaylines{ \mu_i = X_i^{\intercal}\beta \\ y_i \sim N(\mu_i,\sigma^2) } \]

In other words, for every observation \(i\):

The expected value \(\mathbb{E}[y_i|X_i]=\mu_i\) can be perfectly explained: by taking the dot product of the features and the true coefficients, so \(\mu_i = X_i^{\intercal}\beta\).
However, in reality the actual observation \(y_i\) varies around this expected value: its variation follows a normal distribution with uniform variance \(\mathbb{V}[y_i]=\sigma^2\), i.e. \(y_i \sim N(\mu_i,\sigma^2)\)
- This idiosyncratic error, \(y_i - \mu_i\), is irreducible (can never be predicted - it is random)
When regressing, we try to find the best coefficients \(\hat{\beta}\) to predict \(\hat{\mu}_i=X_i^{\intercal}\hat{\beta}\) by minimizing the residuals equally across all observations (so that, hopefully, only the idiosyncratic error remains).

Code to generate Gaussian graph

import numpy as np, pandas as pd
# from math import pi, factorial
from scipy.stats import norm, poisson, gamma
import statsmodels.api as sm
import plotly.graph_objects as go
from plotly.subplots import make_subplots

np.random.seed(42)

# Generate some X
n = 300
X = np.random.rand(n,3).astype('float64') + 2

# Generate some y using true coefficients β
true_beta = np.random.normal(0.5,0.1,3).reshape(-1,1)
eta = X.dot(true_beta)

# Sort data by eta
eta = eta.ravel()
idxs = np.argsort(eta)
X, eta = X[idxs], eta[idxs]

# Estimate coefficients to the data
sigma_sq=0.5
y = np.random.normal(eta,sigma_sq).reshape(-1,1)

fig = make_subplots(
  rows=2,cols=1,
  subplot_titles=(
    "Relationship is identity: E[y|X,β] = μ = X'β",
    "Variance is constant: y ~ N(μ,σ^2)",
    ),
  specs=[
    [{"type": "scatter", "l": 0.2, "r": 0.2, "b":0}],
    [{"type": "scatter3d", "b": 0.1, "t": 0}]
  ],
  row_heights = [2,5],
  horizontal_spacing = 0,
)

fig.add_trace(
  go.Scatter(
    x=eta, 
    y=y.flatten(), 
    mode='markers', name='Observations',
    marker=dict(size=2, color="blue"),
  ),
  row=1, col=1
)

fig.update_xaxes(title=dict(text="μ = X'β", font_size=16),row=1, col=1)
fig.update_yaxes(title=dict(text="y", font_size=16),row=1, col=1)

fig.add_trace(
  go.Scatter(
    x=eta, 
    y=eta,
    line_shape='spline',
    mode='lines',
  ),
  row=1, col=1
)

# 3d plot of data
fig.add_trace(
  go.Scatter3d(
    x=eta, 
    y=y.flatten(),
    z=norm.pdf(y.flatten(), eta, sigma_sq),
    marker=dict(size=1, color="blue"),
    mode='markers', name='Observations'
  ),
  row=2, col=1
)

# 3d line of data
fig.add_trace(
  go.Scatter3d(
    x=eta, 
    y=eta,
    z=norm.pdf(x=eta,loc=eta,scale=sigma_sq),
    line=dict(color="red"),
    mode='lines',
  ),
  row=2, col=1
)

gaus_x = np.arange(3,4.5,0.05)
gaus_y = np.arange(2,6,0.05)
xv, yv = np.meshgrid(gaus_x, gaus_y)
gaus_pdf = norm.pdf(yv, xv, sigma_sq)

fig.add_trace(
  go.Surface(
    x=xv, y=yv, z=gaus_pdf, 
    opacity=0.2,
    colorscale=[(0, 'gray'), (1, 'gray')],
    showscale=False,
    contours = dict(x= {"show": True})
  ),
  row=2, col=1
)

fig.update_layout(
    scene = dict(
      aspectratio=dict(x=1.5,y=1.8,z=1),
      xaxis_title="μ = X'β",
      yaxis_title="y",
      zaxis_title="N(μ,σ^2)",
    ),

    scene_camera = dict(
        eye = dict(x=1.8, y=-1.5, z=0.3),
        center = dict(x=0,y=0,z=-0.35),
    ),
    showlegend=False,
  )

fig.show()

The top plot shows the linear (identity) relationship linking the expected value of \(y_i\) to the prediction \(\mu_i = X_i^{\intercal}\beta\). We see \(y\) is equally dispersed for all values of \(\mu\) (equal to \(\sigma^2\)).

The bottom plot is identical to the first, but adds a third axis with the probability density of the Gaussian distribution. The peak of the probability density is at its expected value \(\mu_i\), the density is symmetric around \(\mu_i\) and the variance is constant \(\sigma^2\).

With this in mind, let’s now go through Poisson regression, to compare and contrast:

Since \(\ln{(\lambda_i)} = X_i^{\intercal}\beta\), then by raising both sides to the exponential it follows that \(\lambda_i = e^{X_i^{\intercal}\beta}\)

\[ \displaylines{ \ln{(\lambda_i)} = X_i^{\intercal}\beta \\ y_i \sim Pois(\lambda_i) \\ } \]

There’s lots of similarities, but lots of differences too:

The expected value \(\mathbb{E}[y_i|X_i]=\lambda_i\) can be perfectly explained: also by taking the dot product of the features and the true coefficients, but then raising it to the exponential as well: \(\lambda_i=e^{X_i^{\intercal}\beta}\)
In reality, the actual observation \(y_i\) varies around the expected value: but this time its variation follows a Poisson distribution. The Poisson variance is equal to its expected value \(\mathbb{V}[y_i] = \lambda_i\): thus we expect higher variance at higher counts.
A Poisson regression permits larger residuals at higher counts vs gaussian regression. Consequently, this impacts how we aim to minimize residuals (i.e. by not shrinking all residuals with equal weight) and hence this also impacts how we estimate our ideal coefficients \(\beta^*\).

Code to generate Poisson graph

# Poisson:
y3 = np.random.poisson(np.exp(eta))

fig = make_subplots(
  rows=2,cols=1,
  subplot_titles=(
    "Relationship is exponential: E[y|X,β] = λ = exp(X'β)",
    "Variance increases with λ: y ~ Pois(λ)",
    ),
  specs=[
    [{"type": "scatter", "l": 0.2, "r": 0.2, "b":0}],
    [{"type": "scatter3d", "b": 0.1, "t": 0}]
  ],
  row_heights = [2,5],
  horizontal_spacing = 0,
)

fig.add_trace(
  go.Scatter(
    x=eta, 
    y=y3.flatten(), 
    mode='markers', name='Observations',
    marker=dict(size=2, color="blue"),
  ),
  row=1, col=1
)

fig.update_xaxes(title=dict(text="ln(λ) = X'β", font_size=16),row=1, col=1)
fig.update_yaxes(title=dict(text="y", font_size=16),row=1, col=1)

fig.add_trace(
  go.Scatter(
    x=eta, 
    y= np.exp(eta),
    line_shape='spline',
    mode='lines', name='Poisson fit',
  ),
  row=1, col=1
)

# 3d plot of data
fig.add_trace(
  go.Scatter3d(
    x=eta, 
    y=y3.flatten(),
    z=poisson.pmf(y3.flatten(), np.exp(eta)),
    marker=dict(size=1, color="blue"),
    mode='markers', name='Observations'
  ),
  row=2, col=1
)

# 3d line of data
fig.add_trace(
  go.Scatter3d(
    x=eta, 
    y=np.exp(eta),
    z=gamma.pdf(x=np.exp(eta),a=np.exp(eta)+1),
    line=dict(color="red"),
    mode='lines', name='Poisson fit',
  ),
  row=2, col=1
)

pois_mu = np.arange(3,4.5,0.05)
pois_z = np.arange(0,100,1)
xv, yv = np.meshgrid(pois_mu, pois_z)
pois_pmf = poisson.pmf(yv, np.exp(xv))

fig.add_trace(
  go.Surface(
    x=xv, y=yv, z=pois_pmf, 
    opacity=0.2,
    colorscale=[(0, 'gray'), (1, 'gray')],
    showscale=False,
    contours = dict(x= {"show": True})
  ),
  row=2, col=1
)

fig.update_layout(
    scene = dict(
      aspectratio=dict(x=1.5,y=1.8,z=1),
      xaxis_title="ln(λ) ~ X'β",
      yaxis_title="y",
      zaxis_title="Pois(λ)",
    ),
    
    scene_camera = dict(
        eye = dict(x=1.8, y=-1.5, z=0.3),
        center = dict(x=0,y=0,z=-0.35),
    ),
    showlegend=False,
  )

fig.show()

The top plot shows the exponential relationship linking the systematic component and the observed count \(y\), where \(y\) is more dispersed at higher values of \(\lambda\) (higher variance).

The bottom plot is identical to the first, but adds a third axis with the probability mass of the Poisson distribution. Again, the higher variance makes the peak of the Poisson probability mass function lower, and its probability density is more spread.

It can now help to start to denote this into its random and systematic components, and their link function: \[ \displaylines{ \underbrace{y_i \sim Pois(\lambda_i)}_{\text{random}} \\ \underbrace{\ln{(\lambda_i)}}_{\text{link function}} = \underbrace{X_i^{\intercal}\beta}_{\text{systematic}} } \]

Let’s dive into this in a bit more detail:

Random component

The random component determines how we want to model the distribution of \(y\). For example, if \(y_i\) is a count outcome, then it could be well suited to a Poisson distribution:

\[ y_i \sim Pois(\lambda_i) \]

The expected rate of the count is \(\lambda_i\). This means we expect \(y_i\) to be around \(\lambda_i\), i.e. \(\mathbb{E}[y_i|X_i]=\lambda_i\). However, for any individual observation \(i\), the actual observed \(y_i\) will vary above and below this expected rate. By using Poisson, we assume the variance is equal to the mean rate: \(\mathbb{V}[y_i|X_i] = \lambda_i\). Thus we permit the dispersion of observations to increase if the expected rate is higher.

This is why it is called the “random component”: since \(y_i\) varies randomly around \(\lambda_i\), following the Poisson distribution, it is a random variable.

But how do we find a good estimation for \(\mathbb{E}[y_i|X_i]=\lambda_i\)? Concretely, how do we best map our independent variables \(X_i\) to \(\lambda_i\)? This is down to our systematic component and link function.

Systematic component and link function

In most cases, the systematic component \(\eta(X)\) is usually just a linear transformation of \(X\), most often the result of multiplying each value by some good fitted coefficients \(\beta\). It is deterministic (non-random), so it is systematic.

In constrast, Generalized Additive Models use a sum of smooth functions of the features instead: \[ \eta(X_i) = \beta_0 + f_1(X_{i1}) + f_2(X_{i2}) + \ldots \] The functions can be parametric (e.g. polynomial) or non-parametric (e.g. smoothing splines).

\[ \eta(X_i) = X_i^{\intercal}\beta \equiv \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \ldots \]

The link function is a way of choosing how to map the systematic component to the natural parameter of the random component. For example, in Poisson regression, we use a log link function, which means the systematic component predicts the natural log of the mean rate of the count.

\[ \ln{(\lambda_i)} = \eta(X) = X^{\intercal}\beta \]

Okay, so now we want to find the best values for \(\beta\). These coefficients will transform our features \(X_i\) to make the best predictions for \(\ln{(\lambda_i)}\), given we want to predict \(y_i\) as accurately as possible (but permit larger residuals when \(\lambda_i\) is larger, following the Poisson distribution).

To achieve this: we can try some initial coefficients, calculate the cost function and its first derivative, update the coefficients, and continue to minimize the cost function through gradient descent.

But how cumbersome and error prone would it be to write bespoke code for every type of distribution we want to model! Wouldn’t it be nice if we can derive a generic representation for the cost function and its first derivative, so that we can re-use the same code for every type of regression?

Exponential Dispersion Family of Distributions

Notation for Generalized linear models

It can be shown that many common distributions can be reformulated into the “Exponential Dispersion Family of Distributions”. This generic representation makes it easier to re-use the same code to run a regression.

First let’s define some generic notation for generalized linear models:

\[ \displaylines{ \underbrace{g{(\xi_i)}}_{\text{link function}} = \underbrace{\eta(X_i)}_{\text{systematic}} \\ \underbrace{y_i \sim f_y(\xi_i,\phi_i)}_{\text{random}} } \]

\(\xi_i\) is a shape parameter, governing the shape of the distribution (e.g. in Poisson, the expected value is the mean \(\xi_i=\lambda_i\)). Or for Bernoulli, \(\xi_i=p_i\). In fact for many distributions, \(\xi_i\) is the expected value of \(y_i\) i.e. \(\xi_i = \mu_i\).

\(\phi\) is a dispersion parameter, governing the spread of the data (e.g. Gaussian has the variance \(\phi = \sigma^2\)). It is not always necessary if the dispersion is determined by the shape parameter already (e.g. in Poisson the variance is already equated to the expected rate \(\lambda_i\)).

Here are some examples of the common forms of Poisson, Bernoulli and Gaussian probability distribution functions \(y \sim f_{\theta_i}(\mu_i)\), along with some choices for link functions \(g(\mu)\) to use in regression:

Type	Probability Density Function:	Link Function for Regression:
Count \(\xi_i = \mu_i \equiv \lambda_i\)	\(y_i \sim Pois(\mu_i)\) \(y_i \sim \frac{\mu_i^{y} \times e^{-\mu_i}}{y!}\)	\(g(\mu) = \ln{[\mu_i]}\) (log-link)
Binary \(\xi_i = \mu_i \equiv p_i\)	\(y_i \sim Bern(\mu_i)\) \(y_i \sim \mu_i^y \times (1-\mu_i)^{(1-y)}\)	\(g(\mu) = \ln{\left[\frac{\mu_i}{1-\mu_i}\right]}\) (logit-link)
Binary \(\xi_i = \mu_i \equiv p_i\)	\(y_i \sim Bern(\mu_i)\) \(y_i \sim \mu_i^y \times (1-\mu_i)^{(1-y)}\)	\(g(\mu) = \Phi^{-1}[\mu_i]\) (probit-link)
Normal \(\xi_i = \mu_i\)	\(y_i \sim N(\mu_i,\sigma^2)\) \(y_i \sim \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\left(\frac{y_i-\mu_i}{\sigma^2}\right)^2}\)	\(g(\mu) = \mu\) (identity-link)

Formulating Gaussian, Bernoulli and Poisson distributions into the exponential family

All of the probability density functions above - Poisson, Gaussian, Bernoulli - can be rewritten into a the exponential family form!

See the footnotes that dive into the derivations for the exponential forms for each of the Poisson¹, Gaussian² and Bernoulli³ distributions.

\[ \displaylines{ \begin{align} f(y;\xi,\phi) & = \exp{ \left\{ \frac{T(y) r(\xi) - b(\xi)}{a(\phi)} + c(y,\phi) \right\} } & \tag{1.1} \\ \\ Pois(y;\xi,\phi) & = \frac{\mu_i^{y_i}e^{-\mu_i}}{y_i!} \\ & \equiv \exp{ \left\{ \frac{ \underbrace{y_i}_{T(y_i)} \underbrace{\ln{ \left[ \xi_i \right]}}_{r(\xi_i)} - \underbrace{\mu_i}_{b(\xi_i)} }{ \underbrace{1}_{a(\phi)} } - \underbrace{\ln{ \left[ y_i! \right] }}_{c(y,\phi)} \right\}} \tag{1.2} \\ \\ Bern(y;\xi,\phi) & = \xi_i^{y_i}(1-\xi_i)^{1-y_i} \\ & \equiv \exp{ \left\{ \frac{ \underbrace{y_i}_{T(y)} \underbrace{ \ln{ \left[\frac{\xi_i}{1-\xi_i} \right] } }_{r(\xi_i)} + \underbrace{ \ln{[1-\xi_i]} }_{b(\xi_i)} }{ \underbrace{1}_{a(\phi)} } + \underbrace{0}_{c(y_i,\phi)} \right\}} \tag{1.3} \\ \\ N(y;\xi,\phi) & = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{ \left\{-\frac{(y_i-\xi_i)^2}{2\sigma^2} \right\} } \\ & \equiv \exp{\left\{ \frac{ \underbrace{y_i}_{T(y_i)} \underbrace{\xi_i}_{r(\xi_i)} - \underbrace{\frac{\mu_i^2}{2}}_{b(\xi_i)} }{ \underbrace{\sigma^2}_{a(\phi)} } - \underbrace{ \frac{1}{2} \left( \frac{y_i^2}{\sigma^2} + \ln{ \left[ 2 \pi \sigma^2 \right]} \right) }_{c(y_i,\phi)} \right\}} \tag{1.4} \end{align} } \]

The “canonical” link function

You may have noticed that \(r(\xi_i)\) looks very familiar… in fact, it is the link functions we commonly use for that regression! For example in logistic regression, we use the logit link function \(g(\xi_i) = \ln{\left[\frac{\xi_i}{1-\xi_i}\right]} = \eta(X)\).

Canonical link function

For any exponential distribution, \(r(\xi_i)\) is the default choice for link function: the so-called “canonical link function”.

In fact, if the canonical link function is used, so \(g(\xi_i)=r(\xi_i)\), and no transformation means \(T(y_i) = y_i\), then we can simplify the formula to its “canonical form”, where \(\theta_i = g(\xi_i)\):

\[ \displaylines{ \begin{align} f(y;\xi,\phi) & = \exp{ \left\{ \frac{T(y) r(\xi) - b(\xi)}{a(\phi)} + c(y,\phi) \right\} } \\ \\ \equiv f(y; \theta_i, \phi) & = \exp{ \left\{ \frac{y \theta - b(\theta)}{a(\phi)} + c(y,\phi) \right\} } & \tag{2.1} \\ \\ & \iff T(y) = y; \text{ } g(\xi) = r(\xi) = \theta \end{align} } \]

Where this falls down is when you don’t use the canonical link function, such as in probit regression. While the canonical link function for Bernoulli is Logit, the Probit link is the inverse cumulative distribution function of the standard normal \(g(\xi_i) = \Phi^{-1}[\xi_i] \neq r(\xi_i)\). So because it is not canonical, it’s a bit more complicated to deal with (which we explore in a later post).

Some intutition into the terms:

We go into detail into the intution behind the other terms in the footnotes⁴, but for now, here is a quick summary:

\(a(\phi)\) normalizes the pdf using the dispersion parameter \(\phi\). E.g. for Gaussian regression, it is the variance of the residuals, \(a(\phi) = \sigma^2\).
\(c(y,\phi)\) is an adjustment function that ensures the pdf sums to one. For example, the exponential form of the Poisson distribution would sum to more than one if it wasn’t included.
\(b(\theta)\) is the integral of the inverse of the link function. This might seem unintuitive right now, but it is easier to see when we derive the generic cost function and minimize it (and you can find a detailed breakdown here⁵).

Applying the exponential families to regression

So far we have reformulated Poisson, Binomial and Gaussian regressions into the exponential family form. So now we want to derive a “generic-use formula” to estimate the best coefficients via maximum likelihood estimation for any GLM.

Cost function

First let’s define our generic cost function in negative log-likelihood form, assuming the canonical link function is used (so in terms of \(\theta_i\)):

Maximising the likelihood is hard, because it involves calculating the total product across every observation. Instead, taking the log likelihood makes everything a sum, far easier to calculate. Also, taking the negative ensures we are looking to minimize the cost.

\[ \displaylines{ \begin{align} L(\theta) & = \prod_{i=1}^n{ f(y_i;\theta_i,\phi) } \\ & = \prod_{i=1}^n{ \exp{ \left\{ \frac{y \theta - b(\theta)}{a(\phi)} + c(y,\phi) \right\} } } \\ \therefore \mathcal{L}(\theta) & = -\ln \left\{ \prod_{i=1}^n{ \exp{ \left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \right\} } } \right\} \\ & = -\sum_{i=1}^n{ \left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \right\} } \\ & = - \frac{1}{a(\phi)} \sum_{i=1}^n{ \bigg\{ y_i \theta_i - b(\theta_i) \bigg\} } - \sum_{i=1}^n{ \bigg\{ c(y_i,\phi) \bigg\}} \end{align} } \]

Optimization

Common methods for optimization include “Newton-Raphson”, “Fisher-Scoring”, “Iteratively-reweighted Least Squares” or “Gradient Descent”. For all of these we need to derive the “score” (or “informant”): the first derivative of the negative log likelihood with respect to \(\theta\). Second order optimization methods, like Newton’s method, also need the second differential: using the inverse hessian as the learning rate instead helps the regression converge faster than gradient descent.

\[ \displaylines{ \begin{align} \beta_{\text{new}} & := \beta_{\text{old}} - \underbrace{\frac{\partial^2 \mathcal{L}(\theta)}{\partial\theta\theta^{\intercal}}^{-1}}_{\text{Learning rate}} \times \underbrace{\frac{\partial \mathcal{L}(\theta)}{\partial\theta}}_{\text{Score}} \\ \\ & \equiv \beta_{\text{old}} - H^{-1} \nabla J \end{align} } \]

Informant (score function)

To run conduct any of these optimizations, we need to derive the “score” (or “informant”): the first derivative of the negative log likelihood. So first we want to minimize this generic cost function with respect to \(\theta\):

It is trival to connect \(\theta\) to the coefficients \(\beta\), since

\[ \displaylines{ \frac{\partial\theta}{\partial\beta_j}=\frac{\partial} {\partial\beta_j}[\eta(X)]=X_j \\ \iff \eta(X)=X^{\intercal}\beta } \]

\[ \displaylines{ \begin{align} \arg \min_\beta \left[ \mathcal{L}(\theta) \right] & = \arg \min_\beta \left[ -\frac{1}{a(\phi)} \sum_{i=1}^n{ \bigg\{ y_i \theta_i - b(\theta_i) \bigg\} } \underbrace{\cancel{ -\sum_{i=1}^n{ \bigg\{ c(y_i,\phi) \bigg\}} }}_{ \because c() \text{ is constant with respect to } \theta } \right] \\ \\ \therefore \frac{\partial \mathcal{L}(\theta)}{\partial\beta_i} & = - \frac{1}{a(\phi)} \sum_{i=1}^n{ \bigg\{ y_i \underbrace{\frac{\partial \theta}{\partial \beta_i}}_{=X_i} - \frac{\partial b(\theta)}{\partial \theta} \underbrace{\frac{\partial \theta}{\partial \beta_i}}_{=X_i} \bigg\} } \\ & \equiv \underbrace{\frac{1}{a(\phi)}}_{\text{Constant}} \sum_{i=1}^n{ \bigg\{ \left( b'(\theta) - y_i \right) X_i \bigg\} } \\ \\ \end{align} } \]

The cost is minimized by differentiating with respect to \(\theta\). Since the derivative of \(c(.)\) is zero, it drops out.

So hopefully this now gives a bit more intuition behind why \(b(\theta)\) is the integral of the inverse canonical link. Since \(b'(\theta)\) is the activation function, we can see that the score function is just the difference between the predicted value and the actual value. This is the same as the gradient of the cost function used for gradient descent!

This also uncovers an interesting property of GLMs - that the average prediction \(b'(\theta)\) must be equal to the average value of \(\bar{y}\) too:

\[ \displaylines{ \begin{align} \frac{\partial \mathcal{L}(\theta)}{\partial\theta} & = 0 \text{ (cost is minimized at stationary point)} \\ & = \cancel{\frac{1}{a(\phi)}} \times \sum_{i=1}^n{ \bigg\{ \left( b'(\theta) - y_i \right) \bigg\} } \\ \therefore \sum_{i=1}^n{ y_i } & = \sum_{i=1}^n{ b'(\theta) } \\ \\ \therefore \frac{\sum_{i=1}^n{ y_i }}{n} & = \mathbb{E}[y] = \frac{\sum_{i=1}^n{ b'(\theta)}}{n} = b'(\theta) \\ \\ \Rightarrow \mathbb{E}[y] & = b'(\theta) \end{align} } \]

The cost is minimized at the stationary point. So by equating to zero, \(a(.)\) drops out as it is a constant.

Hessian

Next, we can derive the Hessian matrix, which is the second derivative of the cost function with respect to \(\beta\).

\[ \displaylines{ \begin{align} \frac{\partial^2 \mathcal{L}(\theta)}{\partial^2\beta_i} & = \frac{\partial}{\partial\beta_i}\left( \frac{1}{a(\phi)} \sum_{i=1}^n{ \bigg\{ \left( b'(\theta) - y_i \right) X_i \bigg\} } \right) \\ & \equiv \frac{\partial}{\partial\beta_i}\left( \frac{1}{a(\phi)} \sum_{i=1}^n{ \bigg\{ b'(\theta) X_i \bigg\} } \right) \\ & = \frac{1}{a(\phi)} \sum_{i=1}^n{ \bigg\{ X_i^{\intercal} \bigg( b''(\theta) \bigg) X_i \bigg\} } & \because \frac{\partial b'(\theta)}{\partial \beta_i} & = \frac{\partial b'(\theta)}{\partial \theta} \times \frac{\partial \theta}{\partial \beta_i} \\ &&& = b''(\theta) \times \frac{\partial}{\partial \beta_i}\left(X_i^{\intercal}\beta\right) \\ &&& = b''(\theta) \times X_i^{\intercal} \end{align} } \]

The cost is minimized by differentiating with respect to \(\beta\). Since the derivative of \(y\) is zero, it drops out.

It’s also worth noting that the variance \(\mathbb{V}[y_i]\) can be shown to be equal to the second derivative of the activation function \(b''(\theta)\), divided by the dispersion function (and this can sometimes be used instead for more stable optimization):

\[ b''(\theta) \equiv \frac{\mathbb{V}[y]}{a(\phi)} \]

Now we have all the parts we need to optimize the coefficients for any GLM using maximum likelihood estimation.

Coding it up from scratch

Well done on getting this far! We can now start to write out some python to create this ourselves. We will start by creating a parent class for exponential dispersion families, and then create child classes for the Poisson, Bernoulli and Gaussian distributions specifically.

Parent class for the canonical form:

First, let’s create a generic parent class then that we can use for any canonical exponential dispersion family. Note that this contains the MLE proceedure to minimize the cost function for regression (we will explain this in a future post - just know for now that this uses maximum likelihood estimation to optimize the coefficients).

Exponential Family Regression Parent Class (source code):

import numpy as np

class exponential_regression_parent():

    def __init__(self,seed=0,y=None,X=None,beta=None):
      """
      seed: Scalar sets seed for random coefficient initialization
      y:    Dependent variable. A 1d vector of numeric values. Optional.
      X:    Independent variables. A 2d matrix of numeric values. Optional.
      beta: Initial coefficients. A 1d vector of numeric values. Optional.
      """
      self.seed = seed
      self.y = y
      self.X = X
      self.beta = beta

    def _initialize(self,y=None,X=None):
        if y is None:
            if self.y is None:
                raise ValueError('Please provide y')
            else:
                y = self.y
        if X is None:
            if self.X is None:
                raise ValueError('Please provide X')
            else:
                X = np.array(self.X).reshape(y.shape[0],-1)
        self.y = np.array(y).reshape(-1,1)
        self.X = np.array(X).reshape(y.shape[0],-1)
        self.n, self.k = X.shape
        np.random.seed(self.seed)
        if self.beta is None:
            self.beta = np.random.normal(0, 0.5, self.k).reshape((self.k,1))

    def _theta(self, theta=None):
        """ helper function """
        if theta is None:
            theta = self.theta()
        return theta

    def _phi(self,phi=None):
        """ helper function """
        if phi is None:
            phi = self.phi()
        return phi

    def negative_log_likelihood(self):
        """ Negative log likelihood i.e. the current cost"""
        y = self.y
        theta, phi = self._theta(), self._phi()
        a, b, c = self.a, self.b, self.c
        log_likelihood = ( y * theta - b(theta) ) / a(phi) + c(y,phi)
        L = -1 * np.sum(log_likelihood)
        return L

    def informant(self, theta=None, phi=None):
        """ First derivative of the cost function with respect to theta """
        y, X, a, b, db = self.y, self.X, self.a, self.b, self.db
        theta, phi = self._theta(theta), self._phi(phi)
        dJ = (1/a(phi)) * X.T.dot( db( theta ) - y )
        return dJ
    
    def hessian(self, theta=None, phi=None, use_stable=True):
        """ Second derivative of the cost function with respect to theta """
        X, y = self.X, self.y
        a, d2b = self.a, self.d2b
        theta, phi = self._theta(theta), self._phi(phi)
        if use_stable:
            meat = a(phi)**-2 * np.diagflat((self.db(theta)-y)**2) # using residual variance
        else:
            meat = a(phi)**-1 * np.diagflat(self.d2b(theta)) # using second differential directly
        d2L = X.T.dot(meat).dot(X)
        return d2L

    def update_beta(self, use_stable_hessian=True):
        """ A single step towards optimizing beta """
        X, beta = self.X, self.beta.copy()
        theta, phi = self._theta(), self._phi()
        learning_rate = np.linalg.inv(self.hessian(theta, use_stable=True))
        dL = self.informant(theta, phi)
        beta -= learning_rate.dot(dL)
        return beta

    def fit(self, y=None, X=None, max_iter=100, epsilon=1e-8, use_stable_hessian=True):
        """ Fit the model using Newton Raphson Optimization """
        self._initialize(y,X)
        for i in range(max_iter):
            old_beta = self.beta.copy()
            new_beta = self.update_beta(use_stable_hessian)
            self.beta = new_beta
            if (np.abs(new_beta - old_beta)/(0.1 + np.abs(new_beta)) <= epsilon).all():
                print("Fully converged by iteration " + str(i))
                break
            if (i == max_iter):
                print("Warning - coefficients did not fully converge within " + str(max_iter) + " iterations.")

    def predict(self, X=None):
        """ Predicted value of y uses the activation function, b'(θ) """
        if X is None:
            X = self.X
        y_hat = self.db( self.theta(X=X) )
        return y_hat

And now we have our base class, we can utilise it for specific classes of the Poisson, Bernoulli and Gaussian distributions:

Gaussian (normal)

Now we can write our Gaussian class:

Exponential Regression Gaussian Class (source code):

from glm_exponential_dispersion_parent import exponential_regression_parent

import numpy as np
from math import pi

class gaussian(exponential_regression_parent):

  def __init__(self,seed = 0,**kwargs):
      super().__init__(seed, **kwargs)

  def theta(self, X=None):
        """ As canonical form, theta == link function. 
        So we simply equate θ = g(µ) = η(X)= X'β """
        beta = self.beta
        if X is None:
            X = self.X
        return X.dot(beta)

  def b(self,theta):
      """ b(θ) = 0.5*θ^2: Integral of the activation function """
      return 0.5*theta**2
  
  def db(self,theta):
      """ b'(θ) = θ: Activation function is the identity """
      return theta

  def d2b(self,theta):
      """ b''(θ) = 1: Differential of the activation function is just one """
      return np.ones(theta.shape)

  def phi(self):
      """ φ = σ^2: Dispersion measure is variance
      It is estimated from the residuals, i.e. s = RSS / DoF
      """
      y, n, k = self.y, self.n, self.k
      y_hat = self.predict(self.X)
      var = np.sum( (y - y_hat)**2 ) / n # / ( n - k )
      return var
      
  def a(self,phi):
      """ a(φ) = σ^2: The variance of y """
      return phi

  def c(self,y,phi):
      """ Adjustment needed so pdf sums to one """
      return -0.5 * ( y**2/phi + np.log(2*pi*phi))

Now we can test the gaussian regression on some dummy data. We also compare this with OLS derived coefficients (and check against statsmodels) to see if we are correct:

Testing with gaussian regression

# Fit the model and predict y
model_gaus = gaussian()
model_gaus.fit(y,X)

# Compare with OLS
XtX = X.transpose().dot(X)
Xty = X.transpose().dot(y).flatten()
ls_beta = (np.linalg.pinv(XtX).dot(Xty)).reshape(-1,1)

# Compare with statsmodels
glm_stats_model = sm.GLM(y, X, family=sm.families.Gaussian())
results = glm_stats_model.fit()

# Compare coefficients
pd.DataFrame(
  np.hstack([model_gaus.beta,results.params.reshape(-1,1),ls_beta]),
  columns=['True coefficients','StatsModel coefficients','OLS coefficients']
)

Fully converged by iteration 11

	True coefficients	StatsModel coefficients	OLS coefficients
0	0.704655	0.704655	0.704655
1	0.302300	0.302300	0.302300
2	0.507925	0.507925	0.507925

We find identical coefficients! For more information on why least-squares and MLE derive the same result for gaussian regression, see this page.

We similarly find that the log likelihoods are the same:

Code

# Compare log likelihoods:
pd.DataFrame(
  data = [[-1*model_gaus.negative_log_likelihood(),results.llf]],
  columns=['Our Log Likelihood','StatsModel Log Likelihood']
)

	Our Log Likelihood	StatsModel Log Likelihood
0	-203.441508	-203.441508

Logistic

Now let’s try the same for a logistic regression:

Exponential Regression Bernoulli Class (source code):

from glm_exponential_dispersion_parent import exponential_regression_parent

import numpy as np

class bernoulli(exponential_regression_parent):

  def __init__(self,seed = 0, ols_beta_init = False, **kwargs):
      """
      If using a pure Netwon method, this can fail to converge if initialized coefficients are too far from the neighbourhood of the global minima.
      One option is to initialize coefficients using OLS (i.e. linear probabiity model, lpm) and optimize from there.
      Otherwise, by default, it uses a more stable calculation for the variance of y.
      """
      if ols_beta_init:
        XtX = X.transpose().dot(X)
        Xty = X.transpose().dot(y2).flatten()
        lpm_beta = np.linalg.pinv(XtX).dot(Xty)
        super().__init__(seed, **kwargs, beta=lpm_beta.reshape(-1,1))
      else:
        super().__init__(seed, **kwargs)

  def theta(self, X=None):
        """ As canonical form, theta == link function. 
        So we simply equate θ = g(µ) = η(X)= X'β """
        beta = self.beta
        if X is None:
            X = self.X
        return X.dot(beta)

  def b(self,theta):
      """ b(θ) = ln[1+exp{θ}]: integral of the activation function """
      return np.log(1 + np.exp(theta))
  
  def db(self,theta):
      """ b'(θ) = Λ(θ) = 1/(1+exp{-θ}): Activation function is logistic (inverse of the logit-link function)"""
      return (1+np.exp(-theta))**-1
      # return np.exp(theta) * (1+np.exp(theta))**-1
      
  def d2b(self,theta):
      """ b''(θ) = Λ(θ)*(1-Λ(θ)): Differential of the activation function is the logistic distribution """
      _db = self.db(theta)
      return _db*(1-_db)
      # return np.exp(theta) * (1 + np.exp(theta))**-2

  def phi(self):
      """ Not needed - one parameter distribution """
      return 1
  
  def a(self,phi):
      """ Not needed - one parameter distribution """
      return 1
  
  def c(self,y,phi):
      """ No adjustment needed (pdf already sums to one) """
      return 0

Testing the Bernoulli regression

y2 = (y > y.mean()).astype(int)

glm2 = sm.GLM(y2, X, family=sm.families.Binomial())
results2 = glm2.fit(  method='newton')

model_bern = bernoulli(X=X,y=y2)
model_bern.fit()

pd.DataFrame(
  np.hstack([model_bern.beta,results2.params.reshape(-1,1)]),
  columns=['Bernoulli coefficients','StatsModel coefficients']
)

Fully converged by iteration 10

	Bernoulli coefficients	StatsModel coefficients
0	0.371638	0.371638
1	-0.709617	-0.709617
2	0.345754	0.345754

Again, we find identical coefficients! We similarly find that the log likelihoods are the same:

Code

# Compare log likelihoods:
pd.DataFrame(
  data = [[-1*model_bern.negative_log_likelihood(),results2.llf]],
  columns=['Our Log Likelihood','StatsModel Log Likelihood']
)

	Our Log Likelihood	StatsModel Log Likelihood
0	-205.580169	-205.580169

For a deep-dive into Logistic regression, and its formulation as a latent-variable model, see this page.

Poisson

Now let’s try the same for a Poisson regression:

Exponential Regression Poisson Class (source code):

from glm_exponential_dispersion_parent import exponential_regression_parent

import numpy as np
from math import factorial

class poisson(exponential_regression_parent):

  def __init__(self,seed = 0, **kwargs):
      super().__init__(seed, **kwargs)

  def theta(self, X=None):
        """ As canonical form, theta == link function. 
        So we simply equate θ = g(µ) = η(X)= X'β """
        beta = self.beta
        if X is None:
            X = self.X
        return X.dot(beta)

  def b(self,theta):
      """ b(θ) = exp{θ}: integral of the activation function """
      return np.exp(theta)
  
  def db(self,theta):
      """ b'(θ) = exp{θ}: Activation function is exponential (inverse of the log-link function)"""
      return np.exp(theta)

  def d2b(self,theta):
      """ b''(θ) = exp{θ}: Differential of the activation function """
      return np.exp(theta)
  
  def phi(self):
      """ Not needed - one parameter distribution """
      return 1
  
  def a(self,phi):
      """ Not needed - one parameter distribution """
      return 1
  
  def c(self,y,phi):
      """ Adjustment needed so pdf sums to one. """
      # # Optionally use Stirling's approximation to deal easily with large factorials
      # def approx_ln_factorial(n):
      #     return n * np.log(n) - n
      y = y.flatten()
      return -np.log(np.array([factorial(v) for v in y.tolist()]).astype(float)).reshape(-1,1)

Testing the Poisson regression

model_pois = poisson()
model_pois.fit(y3,X)

glm3 = sm.GLM(y3, X, family=sm.families.Poisson())
results3 = glm3.fit()

pd.DataFrame(
  np.hstack([model_pois.beta,results3.params.reshape(-1,1),]),
  columns=['Poisson coefficients','StatsModel coefficients']
)

Fully converged by iteration 16

	Poisson coefficients	StatsModel coefficients
0	0.530279	0.530279
1	0.340200	0.340200
2	0.628620	0.628620

Again, we find identical coefficients! We similarly find that the log likelihoods are the same:

Code

# Compare log likelihoods:
pd.DataFrame(
  data = [[-1*model_pois.negative_log_likelihood(),results3.llf]],
  columns=['Our Log Likelihood','StatsModel Log Likelihood']
)

	Our Log Likelihood	StatsModel Log Likelihood
0	-999.666329	-999.666329

Fin.

Footnotes

Rearranging Poisson into exponential form, with mean rate of \(\lambda_i = \mu_i = \xi_i\): \[ \displaylines{ \begin{align} f_y(\xi_i,\phi) & = \frac{\mu_i^{y_i}e^{-\mu_i}}{y_i!} \\ & = \exp{ \left\{ \ln{ \left[ \frac{\mu_i^{y_i} \times e^{-\mu_i}}{y_i!} \right]} \right\}} \\ & = \exp{ \left\{ \ln{ \left[ \mu_i^{y_i} \right]} + \ln{ \left[ e^{-\mu_i} \right] } - \ln{ \left[ y_i! \right] } \right\}} \\ & = \exp{ \left\{ y_i \ln{ \left[ \mu_i \right]} -\mu_i - \ln{ \left[ y_i! \right] } \right\}} \\ & \equiv \exp{ \left\{ \frac{ \underbrace{y_i}_{T(y_i)} \underbrace{\ln{ \left[ \mu_i \right]}}_{r(\xi_i)} - \underbrace{\mu_i}_{b(\xi_i)} }{ \underbrace{1}_{a(\phi)} } - \underbrace{\ln{ \left[ y_i! \right] }}_{c(y,\phi)} \right\}} \\ & = \exp{ \left\{ \frac{T(y_i) r(\xi_i) - b(\xi_i)}{a(\phi)} + c(y_i,\phi) \right\} } \end{align} } \]↩︎
Rearranging Gaussian into exponential form, with expected value \(\mu_i=\xi_i\): \[ \displaylines{ \begin{align} f(y;\theta,\phi) & = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{ \left\{-\frac{(y_i-\mu_i)^2}{2\sigma^2} \right\} } \\ & = \exp{\left\{ \ln{ \left[ \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{ \left\{ -\frac{(y_i-\mu_i)^2}{2\sigma^2} \right\} } \right]} \right\}} \\ & = \exp{\left\{ \ln{ \left[ \frac{1}{\sqrt{2 \pi \sigma^2}} \right]} - \frac{(y_i-\mu_i)^2}{2\sigma^2} \right\}} \\ & = \exp{\left\{ \cancel{\ln{ \left[ 1 \right]}} + \ln{ \left[ 2 \pi \sigma^2 \right]}^{-1/2} - \frac{y_i^2+\mu_i^2-2y\mu_i}{2\sigma^2} \right\}} \\ & = \exp{\left\{ - \frac{1}{2}\ln{ \left[ 2 \pi \sigma^2 \right]} - \frac{y_i^2}{2\sigma^2} - \frac{\frac{\mu_i^2}{2}}{\sigma^2} + \frac{\cancel{2}y_i\mu_i}{\cancel{2}\sigma^2} \right\}} \\ & = \exp{\left\{ \frac{ y \underbrace{\mu_i}_{\theta_i} - \underbrace{\frac{\mu_i^2}{2}}_{b(\theta_i)} }{ \underbrace{\sigma^2}_{a(\phi)} } - \underbrace{ \frac{1}{2} \left( \frac{y_i^2}{\sigma^2} + \ln{ \left[ 2 \pi \sigma^2 \right]} \right) }_{c(y_i,\phi)} \right\}} \\ & = \exp{ \left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \right\} } \end{align} } \]↩︎
Rearranging Bernoulli into exponential form, with expected success probability \(p_i=\mu_i=\xi_i\): \[ \displaylines{ \begin{align} f(y;\theta,\phi) & = \mu_i^{y_i}(1-\mu_i)^{1-y_i} \\ & = \exp{ \left\{\ln{\left[\mu_i^{y_i}(1-\mu_i)^{(1-y_i)} \\\right]} \right\}} \\ & = \exp{ \left\{ \ln{\left[\mu_i^{y_i}\right]} + \ln{\left[(1-\mu_i)^{(1-y_i)}\right]} \right\}} \\ & = \exp{ \left\{ y_i \ln{[\mu_i]} + (1-y_i)\ln{[1-\mu_i]} \right\}} \\ & = \exp{ \left\{ y_i \ln{ \left[\frac{\mu_i}{1-\mu_i} \right]} + \ln{[1-\mu_i]} \right\}} \\ & = \exp{ \left\{ \frac{ \underbrace{y_i}_{T(y)} \underbrace{ \ln{ \left[\frac{\mu_i}{1-\mu_i} \right] } }_{r(\xi_i)} + \underbrace{ \ln{[1-\mu_i]} }_{b(\xi_i)} }{ \underbrace{1}_{a(\phi)} } + \underbrace{0}_{c(y_i,\phi)} \right\}} \\ & = \exp{ \left\{ \frac{y_i \underbrace{ \ln{ \left[\frac{\mu_i}{1-\mu_i} \right] } }_{\theta_i} + \underbrace{ \ln{[1-\mu_i]} }_{b(\theta_i)} }{ \underbrace{1}_{a(\phi)} } + \underbrace{0}_{c(y_i,\phi)} \right\}} \\ & = \exp{ \left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \right\} } \end{align} } \] And showing how \(\ln[1-p] = -\ln[1+e^{\theta}]\) \[ \displaylines{ \begin{align} \theta & = \ln{\left[\frac{p}{1-p} \right]} \\ \text{ (1) Put } p \text{ in terms of } \theta \text{:} \\ \therefore e^{\theta} & = \frac{p}{1-p} & \text{raise by }e \\ \therefore e^{\theta}-pe^{\theta} & = p & \times (1-p) \\ \therefore e^{\theta} & = p(1+e^\theta) & + pe^\theta \\ \therefore p & = \frac{e^{\theta}}{1+e^\theta} & \div (1+e^\theta) \\\\ \\ \text{ (2) Substitute in } p = \frac{e^{\theta}}{1+e^\theta} \\ \Rightarrow \ln[1-p] & = \ln\left[1-\frac{e^{\theta}}{1+e^\theta} \right] \\ & \equiv \ln\left[\frac{1+e^\theta}{1+e^\theta}-\frac{e^{\theta}}{1+e^\theta} \right] \\ & = \ln\left[\frac{1}{1+e^\theta} \right] \\ & \equiv \ln\left[(1+e^\theta \right)^{-1}] \\ & \equiv -\ln\left[1+e^\theta \right] \end{align} } \]↩︎

Parameter / Function	intuition
\(T(y)\) is a transformation of \(y\)	It is the identity for all the distributions we are looking at, so can be ignored for now.
\(\xi_i\) is the “shape parameter”	Governs the shape of the distribution (e.g. in Poisson, it is the expected value \(\xi_i=\lambda_i\)). Or for Bernoulli, \(\xi_i=p_i\). In fact for many distributions, \(\xi_i\) is the expected value of \(y_i\).
\(r(\xi_i)\) is the “canonical link function”	The function that maps the expected value of \(y_i\) to the linear combination of the features. E.g. for Poisson Regression, \(r(\xi_i) = \ln[\xi_i] = \eta(X)\).
\(b(\xi_i)\) is integral of the “activation function”	And if using the canonical form, this is the same as th integral of the inverse link function. It is easier to see this by running through the examples in footnote 5.
\(\phi\) is the “dispersion parameter”,	A parameter for the expected dispersion of \(y_i\) around \(\mu_i\). For example, Gaussian regression has \(\phi=\sigma\), the standard deviation of all the residuals (assuming homoskedasticity). It is not needed for one parameter distributions, like Poisson, where we already assume the variance \(\mathbb{V}[y_i] = \lambda_i\) (i.e. already determined by \(\xi_i\)).
\(a(.)\) is a normalizing function	A function that normalizes the pdf using the dispersion parameter \(\phi\). E.g. for Gaussian regression, it is the variance of the residuals, \(a(\phi) = \sigma^2\). Again this isn’t needed for one parameter distributions.
\(c(.)\) an adjustment function	A function that adjusts the normalized pdf so that it sums to one. For example, the exponential form of the Poisson distribution would sum to more than one if it wasn’t included.

↩︎

Here is a break down of the integral of the canonical inverse-link function for each distribution:
Poisson:
- The canonical link is log: \(r(\xi_i) = \ln[\xi_i] = \eta(X)\)
- So the inverse-link is exponential: \(\xi_i = b'(\eta(X)) = r^{-1}(\eta(X))=e^{\eta(X)}\)
- So \(r^{-1}(\xi_i) \equiv b'(\xi_i) = e^{\xi_i}\)
- So the integral is: \(b(\xi_i)=\int{b'(\xi_i)\,d\xi_i}=e^{\xi_i}\)
Gaussian:
- The canonical link is the identity: \(r(\xi_i) = \\xi_i = \eta(X)\)
- So the inverse-link is the identity: \(\xi_i = b'(\eta(X)) = r^{-1}(\eta(X))=\eta(X)\)
- So \(r^{-1}(\xi_i) \equiv b'(\xi_i) = \xi_i\)
- So the integral is: \(b(\xi_i)=\int{b'(\xi_i)\,d\xi_i}= \frac{\xi_i^2}{2}\)
Bernoulli:
- The canonical link is logit: \(r(\xi_i) = \ln\left[ \frac{\xi_i}{1-\xi_i} \right] = \eta(X)\)
- So the inverse-link is logistic: \(\xi_i = b'(\eta(X)) = r^{-1}(\eta(X))=\frac{e^{\eta(X)}}{1+e^{\eta(X)}}\)
- So \(r^{-1}(\xi_i) \equiv b'(\xi_i) = \frac{e^{\xi_i}}{1+e^{\xi_i}}\)
- So the integral is: \(b(\xi_i)=\int{b'(\xi_i)\,d\xi_i}=\ln{(1+e^{\xi_i})}\)
Another way to think about this is to notice that \(b'(\xi_i)\) is the “activation function” in the outer layer of a neural network i.e. is the function that maps the output of the network to the final prediction. Have a look at the section about the score function for further intuition for why this is the integral ↩︎