Where does the gaussian distribution come form? We derive if from first principles i.e. proving the Herschel-Maxwell theorem.
Setting the scene
Imagine we are throwing darts, aiming at a circular target. The closer you get to the centre, the more points you get.
Imagine you miss the centre by 5cm. It doesn’t matter whether you were 5cm too high or low, 5cm too far left or right, or a combination of the two (e.g. 3cm too far left, 4cm too far right, which by pythagoras is also \(\sqrt{3^2+4^2}=5\text{cm}\) away from the centre) - hitting those regions are all equally likely.
Code
import numpy as npimport plotly.graph_objects as gofig = go.Figure()# Define the pointspoints = [(5,0), (0,5), (0,-5), (3,4), (-4,-3)]# Add the points to the plotfig.add_trace( go.Scatter( x=[p[0] for p in points], y=[p[1] for p in points], mode='markers', marker =dict(symbol='cross',color='black', size=20) ))# Add lines to the plotfor point in points: fig.add_trace( go.Scatter( x=[0, point[0]/2], y=[0, point[1]/2], mode='lines', line=dict(color='#636363'), ) )# Add arrows to the plotfor point in points: fig.add_annotation( x=point[0] , y=point[1], ax=0, ay=0, xref='x', yref='y', axref='x', ayref='y', showarrow=True, arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor='#636363' )# Define the circler = np.sqrt(5**2)theta = np.linspace(0, 2*np.pi, 100)x = r * np.cos(theta)y = r * np.sin(theta)# Add the circle to the plotfig.add_trace( go.Scatter( x=x, y=y, mode='lines', line=dict(dash='dot',color='#636363') ))# Update axeslayout_options =dict(range=[-6, 6], dtick=1, gridwidth=1,)fig.update_xaxes(**layout_options)fig.update_yaxes(**layout_options)fig.update_layout( showlegend=False, title={'text': 'All shots 5cm from the centre are equally likely to happen','x': 0.5,'xanchor': 'center', }, scene =dict(aspectratio=dict(x=1, y=1),aspectmode="manual"), autosize=False, width=600, height=600,)fig.show()
Now having said that, the distribution of horizontal distances from the centre are completely independent of the distribution of vertical distances. We are just trying to aim for the middle: getting less or more accurate horizontally doesn’t affect how accurate we are vertically.
Let’s denote the horizontal and vertical distances from the centre \(X\) and \(Y\). Now here is the magic - after taking many shots, we would find that the distribution of vertical distances from the centre of the target would follow a normal distribution!
We would also find the horizontal distances form the centre of the target would follow a normal distribution too!
Code
from math import pi, sqrt, expsigma=3def _normal_pdf(x,mu=0.0,sigma=1.0): result = ( 1/(sigma * sqrt(2*pi)) ) * ( exp(-0.5* ((x-mu)/sigma)**2 ) )return resultdef normal_pdf(x,mu=0.0,sigma=1.0): result = [_normal_pdf(i,mu,sigma) for i in x]return np.array(result)# Define the range of values for x and yX = np.linspace(-10,10,100)zeros = np.zeros(len(X))ones = np.ones(len(X))P_X = normal_pdf(X,sigma=sigma)X2 = np.linspace(-8,8,100)Y2 = np.linspace(-6,6,100)P_X5 = _normal_pdf(5,sigma=sigma)P_X05 = np.linspace(0,P_X5,100)fig = go.Figure()line_style =dict(color='black',dash='dot')fig.add_trace( go.Scatter3d(x=X,y=zeros,z=P_X,mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=zeros,y=X,z=P_X,mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=X2,y=Y2,z=P_X,mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=Y2,y=X2,z=P_X,mode='lines',line=line_style))marker_style =dict(symbol='cross',color='black')fig.add_trace( go.Scatter3d(x=ones*5,y=zeros,z=zeros, marker=marker_style))fig.add_trace( go.Scatter3d(x=zeros,y=ones*5,z=zeros, marker=marker_style))fig.add_trace( go.Scatter3d(x=zeros,y=ones*-5,z=zeros, marker=marker_style))fig.add_trace( go.Scatter3d(x=ones*3,y=ones*4,z=zeros, marker=marker_style))fig.add_trace( go.Scatter3d(x=ones*-4,y=ones*-3,z=zeros, marker=marker_style))fig.add_trace( go.Scatter3d(x=ones*5,y=zeros,z=P_X05, mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=zeros,y=ones*5,z=P_X05, mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=zeros,y=ones*-5,z=P_X05, mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=ones*3,y=ones*4,z=P_X05, mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=ones*-4,y=ones*-3,z=P_X05, mode='lines',line=line_style))fig.add_trace( go.Scatter3d(x=x,y=y,z=ones*P_X5, mode='lines',line=line_style))camera =dict( up=dict(x=0, y=0, z=1), center=dict(x=0, y=0, z=0), eye=dict(x=0.5, y=-1.2, z=1))fig.update_layout( showlegend=False, scene_camera=camera, scene=dict( xaxis=dict(nticks=20), yaxis=dict(nticks=20), aspectratio=dict(x=1, y=1,z=1), aspectmode="manual", ),)fig.show()
Seems like very, very few assumptions needed to know that the horizontal and vertical distances will follow a normal distribution? That’s true! But it works. This is the Herschel-Maxwell theorem.
Herschel-Maxwell theorem
Given two random variables, \(X\) and \(Y\), with a joint distribution function of \(P_{X,Y}(dX,dY)\):
\(P_{X,Y}\) is invariant to real rotations
\(X, Y\) are independent
Then both \(X\) and \(Y\) are normally distributed.
And this is the theorem that we will prove in this post.
First step - combining the two assumptions
The horizontal and vertical distances from the centre are two random variables, \(X\) and \(Y\), that are independently and identically generated from a distribution with identical population parameters (mean of zero and finite, constant variance).
We do not expect a bias in aiming too high or too low. Hence the expected values of \(X\) and \(Y\) (their means) are zero: \(\mathbb{E}[X]=0\text{; }\mathbb{E}[Y]=0\)
Also, we want to hit closer to the centre. Hence the probability density must diminish with large values of \(|x|\), as it is more likely we get closer - so the variance of these distances is finite. Also, by making the variance constant, we assume that we do not get better over time. \(\mathbb{V}[X]=c_X\text{; }\mathbb{V}[Y]=c_Y\)
Let’s now derive the normal distribution from first principles. We rewrite the Herschel-Maxwell theorem above as our starting point:
The probability of hitting a point on the target is radially symmetric
The horizontal and vertical distances from the centre are independent (unrelated)
What does this really mean? Well (1) means there is “radial symmetry”. In other words, the likelihood of a shot ending in some region is only based on how close it is to the centre: it doesn’t matter how far the distance is along the horizontal or vertical axis (\(X\) or \(Y\)), just the combined distance (i.e. the radius from the centre).
Concretely - the joint probability between \(X\) and \(Y\) for two shots landing in any region equidistant from the origin is the same. It depends only on the distance from the centre, not the direction from the centre. So the probability density function can be simplified to \(\omega\), as a function of the radius. And since the radius is the hypotenuse of a right-angled triangle, we know from pythagoras that \(r = \sqrt{x^2+y^2}\):
The second bullet point (2) - is that there is no correlation between whether you shoot too high/low and whether you shoot too far left/right. This assumption of independence is a crucial, and commonly used assumption in statistics.
Now you might be thinking at this point - why start with two distributions? It would be easier to derive the normal distribution from just one variable, \(X\). However, initially providing a second independent variable \(Y\) (which we now remove) provides a shortcut to derive the normal distribution for \(X\), so it is a useful step.
Since \(Y\) is i.i.d, then the probability density \(P_Y(Y=0)\) is constant regardless of \(X\) (i.e. \(P_Y(Y=0) = \gamma\) where \(\gamma\) is some constant). We can similarly denote the constant \(P_X(0) = \lambda\).
In other words - we can represent the joint pdf of \(X\) and \(Y\) solely in terms of \(P_X(z)\): \[
\displaylines{
\begin{align}
\omega (\sqrt{x^2+y^2}) & = P_X(x) \times P_Y(y) \\
\equiv
\gamma P_X(\sqrt{x^2+y^2}) & = P_X(x) \times
\frac{\gamma}{\lambda}P_X(y)
\end{align}
}
\]
And we can simplify this further, using a helper function \(h(z)\): \[
\displaylines{
\begin{align}
\gamma P_X(\sqrt{x^2+y^2}) & = P_X(x) \times
\frac{\gamma}{\lambda}P_X(y) \\
\Rightarrow P_X(\sqrt{x^2+y^2}) & = P_X(x) \times
\frac{P_X(y)}{\lambda} & \div \gamma \\
\Rightarrow \frac{P_X(\sqrt{x^2+y^2})}{\lambda} & = \frac{P_X(x)}{\lambda} \times
\frac{P_X(y)}{\lambda} & \div \lambda \\
\equiv h(x^2+y^2) & = h(x^2) \times h(y^2) & \text{where } h(z) = \frac{P_X(\sqrt{z})}{\lambda} \\
\end{align}
}
\]
We now have everything in terms of one function, \(h(z)\). We now look to find a form for \(h(z)\) that satisfies this criteria.
Satisfying the equality using exponentials
An exponential function is a good candidate to ensure that multiplying each pdf together is the same as raising it to the sum of their powers:
Note that for any base \(b\), we can reformulate it in terms of the natural number \(e\)1, i.e. setting \(h(z^2) = e^{kz^2}\)
Furthermore - we know that the integral between \(-\infty\) and \(+\infty\) must equal one, since this is the full range of values that \(x\) could take, and the probabilities must sum to one.
The integral \(\int_{-\infty}^{\infty}{e^{-z^2}}{dz} = \sqrt{\pi}\) is a special case: of the gaussian integral. See this page if you are interested in its derivation.
Reformulating to include a standard deviation term
So we are closer to deriving our standard normal probability density function, but we have a \(\lambda\) constant instead of the constant variance \(\sigma^2\). It might not be surprising then that it turns out this variance is a function of \(\lambda\).
To help see this, consider the point \(x=0\), where we find that the height of the pdf is \(\lambda\) (i.e. \(P_X(0) = \lambda e^{-\pi \lambda^20^2} = \lambda\)). So the higher the value of \(\lambda\), the larger the probability of getting closer to the centre (since the cdf must equal one), and hence the smaller the variance. In other words - the variance has to be a function of \(\lambda\).
To understand the relationship between \(\lambda\) and \(\sigma^2\), let’s plug \(P_X(x) = \lambda e^{-\pi \lambda^2x^2}\) into the standard calculation for the variance of any pdf:
Which gives us the normal distribution (where \(\mu=0\)).
Fin.
Footnotes
For any base \(b\), we can reformulate it in terms of the natural number \(e\)\[
\displaylines{
\begin{align}
b^{x^2} & = e^{kx^2} \\
\therefore \ln{b^{x^2}} & = kx^2 \\
\therefore x^2 \ln{b} & = kx^2 \\
\therefore \ln{b} & = k \\
\end{align}
}
\]↩︎