Deriving (and implementing in python) a generic way to optimize GLMs, by using the exponential family form for Poisson, Bernoulli and Gaussian distributions.
Introduction
There many types of generalized linear regression models: such as linear regression, logistic regression, Poisson regression etc. Every one of these models is made up of a “random component” and a “systematic component”. Each also has a “link function” that combines the random and systematic parts.
To make it easier to contrast and compare, first let’s take a normal (Gaussian) linear regression:
The expected value \(\mathbb{E}[y_i|X_i]=\mu_i\) can be perfectly explained: by taking the dot product of the features and the true coefficients, so \(\mu_i = X_i^{\intercal}\beta\).
However, in reality the actual observation \(y_i\) varies around this expected value: its variation follows a normal distribution with uniform variance \(\mathbb{V}[y_i]=\sigma^2\), i.e. \(y_i \sim N(\mu_i,\sigma^2)\)
This idiosyncratic error, \(y_i - \mu_i\), is irreducible (can never be predicted - it is random)
When regressing, we try to find the best coefficients \(\hat{\beta}\) to predict \(\hat{\mu}_i=X_i^{\intercal}\hat{\beta}\) by minimizing the residuals equally across all observations (so that, hopefully, only the idiosyncratic error remains).
Code to generate Gaussian graph
import numpy as np, pandas as pd# from math import pi, factorialfrom scipy.stats import norm, poisson, gammaimport statsmodels.api as smimport plotly.graph_objects as gofrom plotly.subplots import make_subplotsnp.random.seed(42)# Generate some Xn =300X = np.random.rand(n,3).astype('float64') +2# Generate some y using true coefficients βtrue_beta = np.random.normal(0.5,0.1,3).reshape(-1,1)eta = X.dot(true_beta)# Sort data by etaeta = eta.ravel()idxs = np.argsort(eta)X, eta = X[idxs], eta[idxs]# Estimate coefficients to the datasigma_sq=0.5y = np.random.normal(eta,sigma_sq).reshape(-1,1)fig = make_subplots( rows=2,cols=1, subplot_titles=("Relationship is identity: E[y|X,β] = μ = X'β","Variance is constant: y ~ N(μ,σ^2)", ), specs=[ [{"type": "scatter", "l": 0.2, "r": 0.2, "b":0}], [{"type": "scatter3d", "b": 0.1, "t": 0}] ], row_heights = [2,5], horizontal_spacing =0,)fig.add_trace( go.Scatter( x=eta, y=y.flatten(), mode='markers', name='Observations', marker=dict(size=2, color="blue"), ), row=1, col=1)fig.update_xaxes(title=dict(text="μ = X'β", font_size=16),row=1, col=1)fig.update_yaxes(title=dict(text="y", font_size=16),row=1, col=1)fig.add_trace( go.Scatter( x=eta, y=eta, line_shape='spline', mode='lines', ), row=1, col=1)# 3d plot of datafig.add_trace( go.Scatter3d( x=eta, y=y.flatten(), z=norm.pdf(y.flatten(), eta, sigma_sq), marker=dict(size=1, color="blue"), mode='markers', name='Observations' ), row=2, col=1)# 3d line of datafig.add_trace( go.Scatter3d( x=eta, y=eta, z=norm.pdf(x=eta,loc=eta,scale=sigma_sq), line=dict(color="red"), mode='lines', ), row=2, col=1)gaus_x = np.arange(3,4.5,0.05)gaus_y = np.arange(2,6,0.05)xv, yv = np.meshgrid(gaus_x, gaus_y)gaus_pdf = norm.pdf(yv, xv, sigma_sq)fig.add_trace( go.Surface( x=xv, y=yv, z=gaus_pdf, opacity=0.2, colorscale=[(0, 'gray'), (1, 'gray')], showscale=False, contours =dict(x= {"show": True}) ), row=2, col=1)fig.update_layout( scene =dict( aspectratio=dict(x=1.5,y=1.8,z=1), xaxis_title="μ = X'β", yaxis_title="y", zaxis_title="N(μ,σ^2)", ), scene_camera =dict( eye =dict(x=1.8, y=-1.5, z=0.3), center =dict(x=0,y=0,z=-0.35), ), showlegend=False, )fig.show()
The top plot shows the linear (identity) relationship linking the expected value of \(y_i\) to the prediction \(\mu_i = X_i^{\intercal}\beta\). We see \(y\) is equally dispersed for all values of \(\mu\) (equal to \(\sigma^2\)).
The bottom plot is identical to the first, but adds a third axis with the probability density of the Gaussian distribution. The peak of the probability density is at its expected value \(\mu_i\), the density is symmetric around \(\mu_i\) and the variance is constant \(\sigma^2\).
With this in mind, let’s now go through Poisson regression, to compare and contrast:
Since \(\ln{(\lambda_i)} = X_i^{\intercal}\beta\), then by raising both sides to the exponential it follows that \(\lambda_i = e^{X_i^{\intercal}\beta}\)
There’s lots of similarities, but lots of differences too:
The expected value \(\mathbb{E}[y_i|X_i]=\lambda_i\) can be perfectly explained: also by taking the dot product of the features and the true coefficients, but then raising it to the exponential as well: \(\lambda_i=e^{X_i^{\intercal}\beta}\)
In reality, the actual observation \(y_i\) varies around the expected value: but this time its variation follows a Poisson distribution. The Poisson variance is equal to its expected value \(\mathbb{V}[y_i] = \lambda_i\): thus we expect higher variance at higher counts.
A Poisson regression permits larger residuals at higher counts vs gaussian regression. Consequently, this impacts how we aim to minimize residuals (i.e. by not shrinking all residuals with equal weight) and hence this also impacts how we estimate our ideal coefficients \(\beta^*\).
The top plot shows the exponential relationship linking the systematic component and the observed count \(y\), where \(y\) is more dispersed at higher values of \(\lambda\) (higher variance).
The bottom plot is identical to the first, but adds a third axis with the probability mass of the Poisson distribution. Again, the higher variance makes the peak of the Poisson probability mass function lower, and its probability density is more spread.
It can now help to start to denote this into its random and systematic components, and their link function: \[
\displaylines{
\underbrace{y_i \sim Pois(\lambda_i)}_{\text{random}} \\
\underbrace{\ln{(\lambda_i)}}_{\text{link function}} = \underbrace{X_i^{\intercal}\beta}_{\text{systematic}}
}
\]
Let’s dive into this in a bit more detail:
Random component
The random component determines how we want to model the distribution of \(y\). For example, if \(y_i\) is a count outcome, then it could be well suited to a Poisson distribution:
\[
y_i \sim Pois(\lambda_i)
\]
The expected rate of the count is \(\lambda_i\). This means we expect \(y_i\) to be around \(\lambda_i\), i.e. \(\mathbb{E}[y_i|X_i]=\lambda_i\). However, for any individual observation \(i\), the actual observed \(y_i\) will vary above and below this expected rate. By using Poisson, we assume the variance is equal to the mean rate: \(\mathbb{V}[y_i|X_i] = \lambda_i\). Thus we permit the dispersion of observations to increase if the expected rate is higher.
This is why it is called the “random component”: since \(y_i\) varies randomly around \(\lambda_i\), following the Poisson distribution, it is a random variable.
But how do we find a good estimation for \(\mathbb{E}[y_i|X_i]=\lambda_i\)? Concretely, how do we best map our independent variables \(X_i\) to \(\lambda_i\)? This is down to our systematic component and link function.
Systematic component and link function
In most cases, the systematic component \(\eta(X)\) is usually just a linear transformation of \(X\), most often the result of multiplying each value by some good fitted coefficients \(\beta\). It is deterministic (non-random), so it is systematic.
In constrast, Generalized Additive Models use a sum of smooth functions of the features instead: \[
\eta(X_i) = \beta_0 + f_1(X_{i1}) + f_2(X_{i2}) + \ldots
\] The functions can be parametric (e.g. polynomial) or non-parametric (e.g. smoothing splines).
The link function is a way of choosing how to map the systematic component to the natural parameter of the random component. For example, in Poisson regression, we use a log link function, which means the systematic component predicts the natural log of the mean rate of the count.
Okay, so now we want to find the best values for \(\beta\). These coefficients will transform our features \(X_i\) to make the best predictions for \(\ln{(\lambda_i)}\), given we want to predict \(y_i\) as accurately as possible (but permit larger residuals when \(\lambda_i\) is larger, following the Poisson distribution).
To achieve this: we can try some initial coefficients, calculate the cost function and its first derivative, update the coefficients, and continue to minimize the cost function through gradient descent.
But how cumbersome and error prone would it be to write bespoke code for every type of distribution we want to model! Wouldn’t it be nice if we can derive a generic representation for the cost function and its first derivative, so that we can re-use the same code for every type of regression?
Exponential Dispersion Family of Distributions
Notation for Generalized linear models
It can be shown that many common distributions can be reformulated into the “Exponential Dispersion Family of Distributions”. This generic representation makes it easier to re-use the same code to run a regression.
First let’s define some generic notation for generalized linear models:
\(\xi_i\) is a shape parameter, governing the shape of the distribution (e.g. in Poisson, the expected value is the mean \(\xi_i=\lambda_i\)). Or for Bernoulli, \(\xi_i=p_i\). In fact for many distributions, \(\xi_i\) is the expected value of \(y_i\) i.e. \(\xi_i = \mu_i\).
\(\phi\) is a dispersion parameter, governing the spread of the data (e.g. Gaussian has the variance \(\phi = \sigma^2\)). It is not always necessary if the dispersion is determined by the shape parameter already (e.g. in Poisson the variance is already equated to the expected rate \(\lambda_i\)).
Here are some examples of the common forms of Poisson, Bernoulli and Gaussian probability distribution functions \(y \sim f_{\theta_i}(\mu_i)\), along with some choices for link functions \(g(\mu)\) to use in regression:
You may have noticed that \(r(\xi_i)\) looks very familiar… in fact, it is the link functions we commonly use for that regression! For example in logistic regression, we use the logit link function \(g(\xi_i) = \ln{\left[\frac{\xi_i}{1-\xi_i}\right]} = \eta(X)\).
Canonical link function
For any exponential distribution, \(r(\xi_i)\) is the default choice for link function: the so-called “canonical link function”.
In fact, if the canonical link function is used, so \(g(\xi_i)=r(\xi_i)\), and no transformation means \(T(y_i) = y_i\), then we can simplify the formula to its “canonical form”, where \(\theta_i = g(\xi_i)\):
Where this falls down is when you don’t use the canonical link function, such as in probit regression. While the canonical link function for Bernoulli is Logit, the Probit link is the inverse cumulative distribution function of the standard normal \(g(\xi_i) = \Phi^{-1}[\xi_i] \neq r(\xi_i)\). So because it is not canonical, it’s a bit more complicated to deal with (which we explore in a later post).
Some intutition into the terms:
We go into detail into the intution behind the other terms in the footnotes4, but for now, here is a quick summary:
\(a(\phi)\) normalizes the pdf using the dispersion parameter \(\phi\). E.g. for Gaussian regression, it is the variance of the residuals, \(a(\phi) = \sigma^2\).
\(c(y,\phi)\) is an adjustment function that ensures the pdf sums to one. For example, the exponential form of the Poisson distribution would sum to more than one if it wasn’t included.
\(b(\theta)\) is the integral of the inverse of the link function. This might seem unintuitive right now, but it is easier to see when we derive the generic cost function and minimize it (and you can find a detailed breakdown here5).
Applying the exponential families to regression
So far we have reformulated Poisson, Binomial and Gaussian regressions into the exponential family form. So now we want to derive a “generic-use formula” to estimate the best coefficients via maximum likelihood estimation for any GLM.
Cost function
First let’s define our generic cost function in negative log-likelihood form, assuming the canonical link function is used (so in terms of \(\theta_i\)):
Maximising the likelihood is hard, because it involves calculating the total product across every observation. Instead, taking the log likelihood makes everything a sum, far easier to calculate. Also, taking the negative ensures we are looking to minimize the cost.
Common methods for optimization include “Newton-Raphson”, “Fisher-Scoring”, “Iteratively-reweighted Least Squares” or “Gradient Descent”. For all of these we need to derive the “score” (or “informant”): the first derivative of the negative log likelihood with respect to \(\theta\). Second order optimization methods, like Newton’s method, also need the second differential: using the inverse hessian as the learning rate instead helps the regression converge faster than gradient descent.
To run conduct any of these optimizations, we need to derive the “score” (or “informant”): the first derivative of the negative log likelihood. So first we want to minimize this generic cost function with respect to \(\theta\):
It is trival to connect \(\theta\) to the coefficients \(\beta\), since
The cost is minimized by differentiating with respect to \(\theta\). Since the derivative of \(c(.)\) is zero, it drops out.
So hopefully this now gives a bit more intuition behind why \(b(\theta)\) is the integral of the inverse canonical link. Since \(b'(\theta)\) is the activation function, we can see that the score function is just the difference between the predicted value and the actual value. This is the same as the gradient of the cost function used for gradient descent!
This also uncovers an interesting property of GLMs - that the average prediction \(b'(\theta)\) must be equal to the average value of \(\bar{y}\) too:
The cost is minimized by differentiating with respect to \(\beta\). Since the derivative of \(y\) is zero, it drops out.
It’s also worth noting that the variance \(\mathbb{V}[y_i]\) can be shown to be equal to the second derivative of the activation function \(b''(\theta)\), divided by the dispersion function (and this can sometimes be used instead for more stable optimization):
Now we have all the parts we need to optimize the coefficients for any GLM using maximum likelihood estimation.
Coding it up from scratch
Well done on getting this far! We can now start to write out some python to create this ourselves. We will start by creating a parent class for exponential dispersion families, and then create child classes for the Poisson, Bernoulli and Gaussian distributions specifically.
Parent class for the canonical form:
First, let’s create a generic parent class then that we can use for any canonical exponential dispersion family. Note that this contains the MLE proceedure to minimize the cost function for regression (we will explain this in a future post - just know for now that this uses maximum likelihood estimation to optimize the coefficients).
Exponential Family Regression Parent Class (source code):
import numpy as npclass exponential_regression_parent():def__init__(self,seed=0,y=None,X=None,beta=None):""" seed: Scalar sets seed for random coefficient initialization y: Dependent variable. A 1d vector of numeric values. Optional. X: Independent variables. A 2d matrix of numeric values. Optional. beta: Initial coefficients. A 1d vector of numeric values. Optional. """self.seed = seedself.y = yself.X = Xself.beta = betadef _initialize(self,y=None,X=None):if y isNone:ifself.y isNone:raiseValueError('Please provide y')else: y =self.yif X isNone:ifself.X isNone:raiseValueError('Please provide X')else: X = np.array(self.X).reshape(y.shape[0],-1)self.y = np.array(y).reshape(-1,1)self.X = np.array(X).reshape(y.shape[0],-1)self.n, self.k = X.shape np.random.seed(self.seed)ifself.beta isNone:self.beta = np.random.normal(0, 0.5, self.k).reshape((self.k,1))def _theta(self, theta=None):""" helper function """if theta isNone: theta =self.theta()return thetadef _phi(self,phi=None):""" helper function """if phi isNone: phi =self.phi()return phidef negative_log_likelihood(self):""" Negative log likelihood i.e. the current cost""" y =self.y theta, phi =self._theta(), self._phi() a, b, c =self.a, self.b, self.c log_likelihood = ( y * theta - b(theta) ) / a(phi) + c(y,phi) L =-1* np.sum(log_likelihood)return Ldef informant(self, theta=None, phi=None):""" First derivative of the cost function with respect to theta """ y, X, a, b, db =self.y, self.X, self.a, self.b, self.db theta, phi =self._theta(theta), self._phi(phi) dJ = (1/a(phi)) * X.T.dot( db( theta ) - y )return dJdef hessian(self, theta=None, phi=None, use_stable=True):""" Second derivative of the cost function with respect to theta """ X, y =self.X, self.y a, d2b =self.a, self.d2b theta, phi =self._theta(theta), self._phi(phi)if use_stable: meat = a(phi)**-2* np.diagflat((self.db(theta)-y)**2) # using residual varianceelse: meat = a(phi)**-1* np.diagflat(self.d2b(theta)) # using second differential directly d2L = X.T.dot(meat).dot(X)return d2Ldef update_beta(self, use_stable_hessian=True):""" A single step towards optimizing beta """ X, beta =self.X, self.beta.copy() theta, phi =self._theta(), self._phi() learning_rate = np.linalg.inv(self.hessian(theta, use_stable=True)) dL =self.informant(theta, phi) beta -= learning_rate.dot(dL)return betadef fit(self, y=None, X=None, max_iter=100, epsilon=1e-8, use_stable_hessian=True):""" Fit the model using Newton Raphson Optimization """self._initialize(y,X)for i inrange(max_iter): old_beta =self.beta.copy() new_beta =self.update_beta(use_stable_hessian)self.beta = new_betaif (np.abs(new_beta - old_beta)/(0.1+ np.abs(new_beta)) <= epsilon).all():print("Fully converged by iteration "+str(i))breakif (i == max_iter):print("Warning - coefficients did not fully converge within "+str(max_iter) +" iterations.")def predict(self, X=None):""" Predicted value of y uses the activation function, b'(θ) """if X isNone: X =self.X y_hat =self.db( self.theta(X=X) )return y_hat
And now we have our base class, we can utilise it for specific classes of the Poisson, Bernoulli and Gaussian distributions:
Gaussian (normal)
Now we can write our Gaussian class:
Exponential Regression Gaussian Class (source code):
from glm_exponential_dispersion_parent import exponential_regression_parentimport numpy as npfrom math import piclass gaussian(exponential_regression_parent):def__init__(self,seed =0,**kwargs):super().__init__(seed, **kwargs)def theta(self, X=None):""" As canonical form, theta == link function. So we simply equate θ = g(µ) = η(X)= X'β """ beta =self.betaif X isNone: X =self.Xreturn X.dot(beta)def b(self,theta):""" b(θ) = 0.5*θ^2: Integral of the activation function """return0.5*theta**2def db(self,theta):""" b'(θ) = θ: Activation function is the identity """return thetadef d2b(self,theta):""" b''(θ) = 1: Differential of the activation function is just one """return np.ones(theta.shape)def phi(self):""" φ = σ^2: Dispersion measure is variance It is estimated from the residuals, i.e. s = RSS / DoF """ y, n, k =self.y, self.n, self.k y_hat =self.predict(self.X) var = np.sum( (y - y_hat)**2 ) / n # / ( n - k )return vardef a(self,phi):""" a(φ) = σ^2: The variance of y """return phidef c(self,y,phi):""" Adjustment needed so pdf sums to one """return-0.5* ( y**2/phi + np.log(2*pi*phi))
Now we can test the gaussian regression on some dummy data. We also compare this with OLS derived coefficients (and check against statsmodels) to see if we are correct:
Testing with gaussian regression
# Fit the model and predict ymodel_gaus = gaussian()model_gaus.fit(y,X)# Compare with OLSXtX = X.transpose().dot(X)Xty = X.transpose().dot(y).flatten()ls_beta = (np.linalg.pinv(XtX).dot(Xty)).reshape(-1,1)# Compare with statsmodelsglm_stats_model = sm.GLM(y, X, family=sm.families.Gaussian())results = glm_stats_model.fit()# Compare coefficientspd.DataFrame( np.hstack([model_gaus.beta,results.params.reshape(-1,1),ls_beta]), columns=['True coefficients','StatsModel coefficients','OLS coefficients'])
Fully converged by iteration 11
True coefficients
StatsModel coefficients
OLS coefficients
0
0.704655
0.704655
0.704655
1
0.302300
0.302300
0.302300
2
0.507925
0.507925
0.507925
We find identical coefficients! For more information on why least-squares and MLE derive the same result for gaussian regression, see this page.
We similarly find that the log likelihoods are the same:
Exponential Regression Bernoulli Class (source code):
from glm_exponential_dispersion_parent import exponential_regression_parentimport numpy as npclass bernoulli(exponential_regression_parent):def__init__(self,seed =0, ols_beta_init =False, **kwargs):""" If using a pure Netwon method, this can fail to converge if initialized coefficients are too far from the neighbourhood of the global minima. One option is to initialize coefficients using OLS (i.e. linear probabiity model, lpm) and optimize from there. Otherwise, by default, it uses a more stable calculation for the variance of y. """if ols_beta_init: XtX = X.transpose().dot(X) Xty = X.transpose().dot(y2).flatten() lpm_beta = np.linalg.pinv(XtX).dot(Xty)super().__init__(seed, **kwargs, beta=lpm_beta.reshape(-1,1))else:super().__init__(seed, **kwargs)def theta(self, X=None):""" As canonical form, theta == link function. So we simply equate θ = g(µ) = η(X)= X'β """ beta =self.betaif X isNone: X =self.Xreturn X.dot(beta)def b(self,theta):""" b(θ) = ln[1+exp{θ}]: integral of the activation function """return np.log(1+ np.exp(theta))def db(self,theta):""" b'(θ) = Λ(θ) = 1/(1+exp{-θ}): Activation function is logistic (inverse of the logit-link function)"""return (1+np.exp(-theta))**-1# return np.exp(theta) * (1+np.exp(theta))**-1def d2b(self,theta):""" b''(θ) = Λ(θ)*(1-Λ(θ)): Differential of the activation function is the logistic distribution """ _db =self.db(theta)return _db*(1-_db)# return np.exp(theta) * (1 + np.exp(theta))**-2def phi(self):""" Not needed - one parameter distribution """return1def a(self,phi):""" Not needed - one parameter distribution """return1def c(self,y,phi):""" No adjustment needed (pdf already sums to one) """return0
For a deep-dive into Logistic regression, and its formulation as a latent-variable model, see this page.
Poisson
Now let’s try the same for a Poisson regression:
Exponential Regression Poisson Class (source code):
from glm_exponential_dispersion_parent import exponential_regression_parentimport numpy as npfrom math import factorialclass poisson(exponential_regression_parent):def__init__(self,seed =0, **kwargs):super().__init__(seed, **kwargs)def theta(self, X=None):""" As canonical form, theta == link function. So we simply equate θ = g(µ) = η(X)= X'β """ beta =self.betaif X isNone: X =self.Xreturn X.dot(beta)def b(self,theta):""" b(θ) = exp{θ}: integral of the activation function """return np.exp(theta)def db(self,theta):""" b'(θ) = exp{θ}: Activation function is exponential (inverse of the log-link function)"""return np.exp(theta)def d2b(self,theta):""" b''(θ) = exp{θ}: Differential of the activation function """return np.exp(theta)def phi(self):""" Not needed - one parameter distribution """return1def a(self,phi):""" Not needed - one parameter distribution """return1def c(self,y,phi):""" Adjustment needed so pdf sums to one. """# # Optionally use Stirling's approximation to deal easily with large factorials# def approx_ln_factorial(n):# return n * np.log(n) - n y = y.flatten()return-np.log(np.array([factorial(v) for v in y.tolist()]).astype(float)).reshape(-1,1)
It is the identity for all the distributions we are looking at, so can be ignored for now.
\(\xi_i\) is the “shape parameter”
Governs the shape of the distribution (e.g. in Poisson, it is the expected value \(\xi_i=\lambda_i\)). Or for Bernoulli, \(\xi_i=p_i\). In fact for many distributions, \(\xi_i\) is the expected value of \(y_i\).
\(r(\xi_i)\) is the “canonical link function”
The function that maps the expected value of \(y_i\) to the linear combination of the features. E.g. for Poisson Regression, \(r(\xi_i) = \ln[\xi_i] = \eta(X)\).
\(b(\xi_i)\) is integral of the “activation function”
And if using the canonical form, this is the same as th integral of the inverse link function. It is easier to see this by running through the examples in footnote 5.
\(\phi\) is the “dispersion parameter”,
A parameter for the expected dispersion of \(y_i\) around \(\mu_i\). For example, Gaussian regression has \(\phi=\sigma\), the standard deviation of all the residuals (assuming homoskedasticity). It is not needed for one parameter distributions, like Poisson, where we already assume the variance \(\mathbb{V}[y_i] = \lambda_i\) (i.e. already determined by \(\xi_i\)).
\(a(.)\) is a normalizing function
A function that normalizes the pdf using the dispersion parameter \(\phi\). E.g. for Gaussian regression, it is the variance of the residuals, \(a(\phi) = \sigma^2\). Again this isn’t needed for one parameter distributions.
\(c(.)\) an adjustment function
A function that adjusts the normalized pdf so that it sums to one. For example, the exponential form of the Poisson distribution would sum to more than one if it wasn’t included.
Here is a break down of the integral of the canonical inverse-link function for each distribution: Poisson:
The canonical link is log: \(r(\xi_i) = \ln[\xi_i] = \eta(X)\)
So the inverse-link is exponential: \(\xi_i = b'(\eta(X)) = r^{-1}(\eta(X))=e^{\eta(X)}\)
So \(r^{-1}(\xi_i) \equiv b'(\xi_i) = e^{\xi_i}\)
So the integral is: \(b(\xi_i)=\int{b'(\xi_i)\,d\xi_i}=e^{\xi_i}\)
Gaussian:
The canonical link is the identity: \(r(\xi_i) = \\xi_i = \eta(X)\)
So the inverse-link is the identity: \(\xi_i = b'(\eta(X)) = r^{-1}(\eta(X))=\eta(X)\)
So \(r^{-1}(\xi_i) \equiv b'(\xi_i) = \xi_i\)
So the integral is: \(b(\xi_i)=\int{b'(\xi_i)\,d\xi_i}= \frac{\xi_i^2}{2}\)
Bernoulli:
The canonical link is logit: \(r(\xi_i) = \ln\left[ \frac{\xi_i}{1-\xi_i} \right] = \eta(X)\)
So the inverse-link is logistic: \(\xi_i = b'(\eta(X)) = r^{-1}(\eta(X))=\frac{e^{\eta(X)}}{1+e^{\eta(X)}}\)
So \(r^{-1}(\xi_i) \equiv b'(\xi_i) = \frac{e^{\xi_i}}{1+e^{\xi_i}}\)
So the integral is: \(b(\xi_i)=\int{b'(\xi_i)\,d\xi_i}=\ln{(1+e^{\xi_i})}\)
Another way to think about this is to notice that \(b'(\xi_i)\) is the “activation function” in the outer layer of a neural network i.e. is the function that maps the output of the network to the final prediction. Have a look at the section about the score function for further intuition for why this is the integral↩︎