Sciencing the data - Deriving the Central Limit Theorem

What are we exploring?

Showing why the sample mean of independent and identically distributed random variables converges to a normal distribution as the sample size tends to infinity.

Intro

The central limit theorem states that, under the right conditions, the distribution of the sample mean \(\bar{X}_n\) converges to a normal distribution as the sample size \(n \rightarrow \infty\).

This is a fundamental result in statistics, and is the reason why the normal distribution is so widely used in hypothesis testing and confidence intervals.

But why does the sample mean converge to a normal distribution? In this post, we’ll derive the classical Central Limit Theorem from first principles.

Defining terms

For the classical CLT, we state that if we sample a large number \(n\) of independent observations from the same (read identical) distribution of the random variable \(X\), and calculate the mean \(\bar{X}_n\) of this sample, then not only would the expected mean tend to the true mean \(\lim_{n \rightarrow \infty}\mathbb{E}[\bar{X}_n] \rightarrow \mu\) as per weak law of large numbers, but the this sample mean will also vary around the true mean following to the normal distribution: \(\lim_{n \rightarrow \infty}\bar{X}_n \sim N(\mu,\frac{\sigma^2}{n})\)

First, we define a random variable \(X\), that is independently and identically distributed with mean \(\mu\) and variance \(\sigma^2\):

\[ X_i \overset{iid}{\sim}(\mu,\sigma^2) \]

Note that we make no assumptions about its distribution at all: for example, it could be bernoulli distributed, with \(\mu = p\) and \(\sigma^2 = p(1-p)\). But according to the CLT, for any underlying distribution, we will always find that its sample mean will follow a normal distribution.

Now let’s define the sample mean of \(n\) observations of \(X\) as \(\bar{X}_n\):

\[ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n}X_i \]

So we want to determine if the distribution of \(\bar{X}_n\) converges to a normal distribution as \(n \rightarrow \infty\).

Characteristic functions

Characteristic functions are the Fourier transformation of the probability density functions of random variables. Characteristic functions can be used to fully describe the probability distributions they transform, and can be reverse-transformed perfectly to the original probability distribution too.

So why is this relevant? Well they can be used as an alternative (read: easier) route to derive analytical results, such as the central limit theorem, rather than through the probability density functions directly. And if we can derive the central limit theorem using characteristic functions of random variables, then we can safely infer that the same applies if perfectly reverse-transformed to the probability distributions of those same random variables.

For any random variable \(X\), the characteristic function is a transformation defined as:

\[ \phi_X(t) = \mathbb{E}[e^{itX}] \]

where \(i=\sqrt{-1}\) i.e. the imaginary unit, and \(t\) is a real number.

As an example, ket’s derive the characteristic function of the bernoulli distribution:

\[ \displaylines{ \begin{align} \phi_X(t) & = \mathbb{E}[e^{itX}] \\ \\ & = \int_{-\infty}^{\infty} e^{itx} \times \mathbb{P}(X=x) \\ & \equiv \left[ e^{itx} \times \mathbb{P}(X=0) \right] + \left[ e^{itx} \times \mathbb{P}(X=1) \right] \\ \\ & = \left( e^{it(0)} \times (1-p) \right) + \left( e^{it(1)} \times p \right) \\ & = (1-p) + p e^{it} \end{align} } \]

And more usefully for our case, here is the characteristic function of the normal distribution: \[ \displaylines{ \begin{align} \phi_X(t) & = \mathbb{E}[e^{itX}] \\ & = \int_{-\infty}^{\infty} e^{itx} \times \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \, dx \\ & = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{itx -\frac{(x-\mu)^2}{2\sigma^2} \right\} } \, dx \\ & = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{ -\frac{1}{2\sigma^2} \left(-2\sigma^2itx + \left( x^2 + \mu^2 - 2x\mu \right) \right) \right\} } \, dx \\ & = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{ -\frac{1}{2\sigma^2} \left( \underbrace{(x - \mu - \sigma^2 it)^2}_{ x^2 + \mu^2 - \sigma^4t^2 - 2\sigma^2itx - 2x\mu + 2\mu i\sigma^2t } + \sigma^4t^2 + 2\mu i\sigma^2t \right) \right\} } \, dx \\ & = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{ -\frac{1}{2\sigma^2} \left(x - \mu + i\sigma^2t \right)^2 - \frac{1}{2}\sigma^2t^2 + i\mu t \right\} } \, dx \\ & = \frac{1}{\sqrt{2\pi}} \exp\left\{ i\mu t - \frac{\sigma^2t^2}{2} \right\} \underbrace{ \int_{-\infty}^{\infty} \exp{\left\{ -\frac{1}{2\sigma^2} \left(x - \mu + i\sigma^2t \right)^2 \right\} } \, dx }_{\text{Gaussian integral} \Rightarrow\sqrt{2\pi}} \\ \\ & = \exp\left\{i\mu t - \frac{\sigma^2t^2}{2} \right\} \end{align} } \]

See this post on the gaussian integral for more details for why \(\int_{-\infty}^{+\infty}{e^{-z^2}\,dx}=\sqrt{\pi}\)

Tip

So we are looking to show that the characteristic function of \(\bar{X}_n\) converges to this same characteristic function for the normal distribution: \(\lim_{n \rightarrow \infty}\mathbb{E}[e^{it\bar{X}}] = \exp\left\{i\mu t - \frac{\sigma^2t^2}{2} \right\}\)

Characteristic function of the sample means:

Motivation to use the scaled, centered sample mean

It might seem easiest to use the sample mean directly. But let’s look at its variance:

\[ \displaylines{ \begin{align} \mathbb{V}[\bar{X}_n] & = \mathbb{V}\left[\frac{1}{n}\sum_{i=1}^{n}X_i\right] \\ & = \frac{1}{n^2}\sum_{i=1}^{n}\mathbb{V}[X_i] \\ & = \frac{1}{n^2}\sum_{i=1}^{n}\sigma^2 \\ & = \frac{\sigma^2}{n} \\ \\ \therefore \mathbb{V}[\bar{X}_n] & \rightarrow 0 \quad \text{as} \quad n \rightarrow \infty \end{align} } \]

This reveals that the variance of the distribution decreases as the sample size \(n\) increases. This is a nice (and intuitive) result: the larger the sample, the closer the sample mean is likely to be to the true population.

Howwever, it does make it more tricky to prove that the sample mean converges to a normal distribution: since as we increase \(n\), the variance shinks, so the variance is not constant. Instead, if we scale and center the sample mean by \(\sqrt{n}\), then we ensure its variance is a constant \(\sigma^2\) for a given sample size \(n\):

\[ \displaylines{ \begin{align} Z_n & = \sqrt{n}\left(\bar{X}_n - \mu\right) \\ \therefore \mathbb{E}[Z_n] & = \sqrt{n}\left(\mathbb{E}[\bar{X}_n] - \mu\right) \\ & = \sqrt{n}\left(\mu - \mu\right) \\ & = 0 \\ \therefore \mathbb{V}[Z_n] & = \mathbb{V}\left[\sqrt{n}\left(\bar{X}_n - \mu\right)\right] \\ & = n\mathbb{V}[\bar{X}_n] & \because \mu \text{ is a constant} \\ & = n\frac{\sigma^2}{n} \\ & = \sigma^2 \end{align} } \]

And since this is just a scalar transformation, we know it applies to the underlying distribution too.

Characteristic function of the scaled, centered sample mean

We derive the characteristic function as follows:

\[ \displaylines{ \begin{align} \phi_{\bar{Z}_n}(t) & = \mathbb{E}[e^{it\bar{Z}_n}] \\ & = \mathbb{E}\left[e^{it\sqrt{n}\left(\bar{X}_n - \mu\right)}\right] \\ & = \mathbb{E}\left[e^{it\sqrt{n}\left(\sum_{i=0}^n\frac{1}{n} \left(X_i - \mu \right) \right)}\right] \\ & = \mathbb{E}\left[e^{it\left(\sum_{i=0}^n\frac{1}{\sqrt{n}} \left(X_i - \mu \right) \right)}\right] \\ & = \mathbb{E}\left[\prod_{i=1}^{n}e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right] & \because e^{(a+b)}=e^a\times e^b \\ & = \prod_{i=1}^{n}\mathbb{E}\left[e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right] & \text{as independent} \\ & = \left(\mathbb{E}\left[e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right]\right)^n & \text{as identically distributed} \end{align} } \]

So we see that the characteristic function of the scaled centered sample mean is the characteristic function of the random variable \(X_i\) minus \(\mu\), divided by \(\sqrt{n}\), and raised to the \(n\)th power.

If we just focus on the bit inside the brackets for now, this can be reformulated as a taylor series (if you want a refresher on Taylor series¹, see the appendix):

\[ \displaylines{ \begin{align} & \mathbb{E}\left[\e^{\left\{\frac{it}{\sqrt{n}}(X_i-\mu)\right\}}\right] \\ = & \mathbb{E}\left[\sum_{k=0}^{\infty} \frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^k}{k!}\right] & \because e^x = \sum_{k=0}^{\infty}{\frac{x^k}{k!}} \\ = & \mathbb{E}\left[\sum_{k=0}^{\infty} \frac{ \left( (it)^k(X_i-\mu)\right)^k}{k! \times n^{k/2}}\right] \\ = & \mathbb{E}\left[ \frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^0}{0!} + \frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^1}{1!} + \frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^2}{2!} + \frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^3}{3!} + \dots \right] \\ = & \mathbb{E}\left[ 1 + \frac{it}{n^{1/2}}(X_i-\mu) - \frac{t^2}{2n}(X_i-\mu)^2 + \frac{it^3}{6n^{3/2}}(X_i-\mu)^3 + \dots \right] \\ = & \mathbb{E}[1] + \frac{it}{n^{1/2}}\underbrace{\mathbb{E}[(X_i-\mu)]}_{\mathbb{E}[X_i] = \mu \quad \therefore \mathbb{E}[X_i] - \mu = 0} - \frac{t^2}{2n}\underbrace{\mathbb{E}[(X_i-\mu)^2]}_{\mathbb{V}[X_i] = \sigma^2} + \frac{it^3}{6n^{3/2}}\mathbb{E}[(X_i-\mu)^2] + \ldots \\ = & 1 - \frac{1}{n} \left( \frac{t^2\sigma^2}{2} - \frac{it^3}{6n^{1/2}}\mathbb{E}[(X_i-\mu)^3] - \ldots \right) \end{align} } \]

And we can now plug that into the characteristic function for \(Z_n\), and take it to the limit as we increase the sample size \(n\) to infinity:

\[ \displaylines{ \begin{align} \phi_{\bar{Z}_n}(t) & = \left(\mathbb{E}\left[e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right]\right)^n \\ \\ \lim_{n \rightarrow \infty} \phi_{\bar{Z}_n}(t) & = \lim_{n \rightarrow \infty} \left( 1 - \frac{1}{n} \left( \frac{t^2\sigma^2}{2} - \underbrace{\frac{it^3}{6n^{1/2}}\mathbb{E}[(X_i-\mu)^3] - \ldots}_{ \rightarrow 0 \text{ as } n \rightarrow \infty } \right) \right)^n \\ \\ & = \exp{\left\{ \frac{-t^2\sigma^2}{2} \right\}} & \because e^x = \lim_{n \rightarrow \infty} \left(1 + \frac{x}{n}\right)^n \end{align} } \]

And we now see this is identical to the characteristic function of the normal distribution where \(\mu = 0\).

Fin.

Final remarks

This CLT derivation relied on the independence and identical distribution of the samples \(X_i\). This is why these are the most common assumptions used for normal approximation, such as confidence intervals in an experiment.

However, there are other versions of the CLT that do not need the same assumptions, which we will explore in another post.

Footnotes

Taylor Series

The Taylor series is a way to approximate a function as an infinite sum of terms. The \(n\)th term of the Taylor series is given by: \[ \displaylines{ \begin{align} f(x) & = f(a) + \frac{f^1(a)}{1!}(x-a) + \frac{f^2(a)}{2!}(x-a)^2 + \ldots + \\ & = \sum_{n=0}^{\infty}{ \frac{f^n(a)}{n!}(x-a)^n } \end{align} } \]

Here is a plot of the Taylor series approximation of a function \(f(x) = 20 - 17x - 6x^2 + 8x^3 + 3x^4 + 4x^5 + x^6\) around the point \(a=-2.5\):
Code
```
import plotly.graph_objects as go
from math import factorial
import numpy as np

x = np.linspace(-4.5,2.5,100)
c = [20, -17, -6, 8, 3, 4, 1]
y = [np.sum([c*x**n for n,c in enumerate(c)]) for x in x]
a = -2.5

true_plot = go.Scatter(x=x,y=y,mode='lines',name="True function")

dc = c.copy()
y_hat = np.zeros(len(x))
for i in range(len(c)):    
    coef = np.sum([c*a**n for n,c in enumerate(dc)])/factorial(i)
    y_hat += coef*(x-a)**i
    if i == 0:
        fig = go.Figure(data = [
            go.Scatter(mode='markers',x=x,y=y_hat,name="Taylor approximation"),
            true_plot
            ])
        frames = []
    frames.append(go.Frame(
        data=[go.Scatter(x=x,y=y_hat,name="Taylor approximation"), true_plot], name= f'frame{i}'
    ))
    dc = [n*c for n,c in enumerate(dc)][1:]

fig.frames = frames

updatemenus = [dict(
        buttons = [dict(args = [None, {
            "frame": {"duration": 800, "redraw": False}, "fromcurrent": True, "transition": {"duration": 300}
            }],
            label = "Play", method = "animate"),
            dict(args = [[None], {
                "frame": {"duration": 0, "redraw": False},"mode": "immediate","transition": {"duration": 0}}],
                label = "Pause",method = "animate")
                ],
        direction = "left", pad = {"r": 10, "t": 87}, showactive = False, type = "buttons",
        x = 0.1, xanchor = "right", y = 0, yanchor = "top"
    )]  

sliders = [dict(steps = [
    dict(method= 'animate',args= [
        [f'frame{i}'],
        dict(mode= 'immediate',frame= dict(duration=400, redraw=False),transition=dict(duration= 0))
        ],
        label=f'{i}') for i in range(len(frames))], 
        active=0,transition= dict(duration= 0),y=0, x=0,
        currentvalue=dict(font=dict(size=12), prefix='polynomial: ', visible=True, xanchor= 'center'),len=1.0)
        ]

fig.update_layout(updatemenus=updatemenus,sliders=sliders)
```
So how does this relate to the exponential function? Well the exponential function can be written as a Taylor series:

\[ e^{x} = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots \]

Still not convinced? Well remember that \(\frac{d(e^x)}{dx}=e^x\) by definition. So if we take the derivative of the Taylor series of \(e^x\), we should get back to the original function:

\[ \displaylines{ \begin{align} \frac{d(e^x)}{dx} & = \frac{d}{dx} \left(1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots \right) \\ & = 0 + 1 + \frac{2x}{2!} + \frac{3x^2}{3!} + \ldots \\ & = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots \\ & = e^x \end{align} } \]↩︎

Intro

Defining terms

Characteristic functions

Characteristic function of the sample means:

Motivation to use the scaled, centered sample mean

Characteristic function of the scaled, centered sample mean

Final remarks

Footnotes

Taylor Series