Showing why the sample mean of independent and identically distributed random variables converges to a normal distribution as the sample size tends to infinity.
Intro
The central limit theorem states that, under the right conditions, the distribution of the sample mean \(\bar{X}_n\) converges to a normal distribution as the sample size \(n \rightarrow \infty\).
This is a fundamental result in statistics, and is the reason why the normal distribution is so widely used in hypothesis testing and confidence intervals.
But why does the sample mean converge to a normal distribution? In this post, we’ll derive the classical Central Limit Theorem from first principles.
Defining terms
For the classical CLT, we state that if we sample a large number \(n\) of independent observations from the same (read identical) distribution of the random variable \(X\), and calculate the mean \(\bar{X}_n\) of this sample, then not only would the expected mean tend to the true mean \(\lim_{n \rightarrow \infty}\mathbb{E}[\bar{X}_n] \rightarrow \mu\) as per weak law of large numbers, but the this sample mean will also vary around the true mean following to the normal distribution: \(\lim_{n \rightarrow \infty}\bar{X}_n \sim N(\mu,\frac{\sigma^2}{n})\)
First, we define a random variable \(X\), that is independently and identically distributed with mean \(\mu\) and variance \(\sigma^2\):
\[
X_i \overset{iid}{\sim}(\mu,\sigma^2)
\]
Note that we make no assumptions about its distribution at all: for example, it could be bernoulli distributed, with \(\mu = p\) and \(\sigma^2 = p(1-p)\). But according to the CLT, for any underlying distribution, we will always find that its sample mean will follow a normal distribution.
Now let’s define the sample mean of \(n\) observations of \(X\) as \(\bar{X}_n\):
\[
\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n}X_i
\]
So we want to determine if the distribution of \(\bar{X}_n\) converges to a normal distribution as \(n \rightarrow \infty\).
Characteristic functions
Characteristic functions are the Fourier transformation of the probability density functions of random variables. Characteristic functions can be used to fully describe the probability distributions they transform, and can be reverse-transformed perfectly to the original probability distribution too.
So why is this relevant? Well they can be used as an alternative (read: easier) route to derive analytical results, such as the central limit theorem, rather than through the probability density functions directly. And if we can derive the central limit theorem using characteristic functions of random variables, then we can safely infer that the same applies if perfectly reverse-transformed to the probability distributions of those same random variables.
For any random variable \(X\), the characteristic function is a transformation defined as:
\[
\phi_X(t) = \mathbb{E}[e^{itX}]
\]
where \(i=\sqrt{-1}\) i.e. the imaginary unit, and \(t\) is a real number.
As an example, ket’s derive the characteristic function of the bernoulli distribution:
\[
\displaylines{
\begin{align}
\phi_X(t)
& = \mathbb{E}[e^{itX}] \\ \\
& = \int_{-\infty}^{\infty} e^{itx} \times \mathbb{P}(X=x) \\
& \equiv \left[ e^{itx} \times \mathbb{P}(X=0) \right]
+ \left[ e^{itx} \times \mathbb{P}(X=1) \right] \\ \\
& = \left( e^{it(0)} \times (1-p) \right)
+ \left( e^{it(1)} \times p \right) \\
& = (1-p) + p e^{it}
\end{align}
}
\]
And more usefully for our case, here is the characteristic function of the normal distribution: \[
\displaylines{
\begin{align}
\phi_X(t)
& = \mathbb{E}[e^{itX}] \\
& = \int_{-\infty}^{\infty} e^{itx} \times \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \, dx \\
& = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{itx -\frac{(x-\mu)^2}{2\sigma^2} \right\} } \, dx \\
& = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{
-\frac{1}{2\sigma^2} \left(-2\sigma^2itx + \left( x^2 + \mu^2 - 2x\mu \right) \right)
\right\} } \, dx \\
& = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{
-\frac{1}{2\sigma^2} \left(
\underbrace{(x - \mu - \sigma^2 it)^2}_{
x^2 + \mu^2 - \sigma^4t^2 - 2\sigma^2itx - 2x\mu + 2\mu i\sigma^2t
}
+ \sigma^4t^2 + 2\mu i\sigma^2t
\right)
\right\} } \, dx \\
& = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp{\left\{
-\frac{1}{2\sigma^2}
\left(x - \mu + i\sigma^2t \right)^2
- \frac{1}{2}\sigma^2t^2 + i\mu t
\right\} } \, dx \\
& = \frac{1}{\sqrt{2\pi}} \exp\left\{ i\mu t - \frac{\sigma^2t^2}{2} \right\} \underbrace{ \int_{-\infty}^{\infty} \exp{\left\{
-\frac{1}{2\sigma^2}
\left(x - \mu + i\sigma^2t \right)^2
\right\} } \, dx }_{\text{Gaussian integral} \Rightarrow\sqrt{2\pi}} \\ \\
& = \exp\left\{i\mu t - \frac{\sigma^2t^2}{2} \right\}
\end{align}
}
\]
See this post on the gaussian integral for more details for why \(\int_{-\infty}^{+\infty}{e^{-z^2}\,dx}=\sqrt{\pi}\)
So we are looking to show that the characteristic function of \(\bar{X}_n\) converges to this same characteristic function for the normal distribution: \(\lim_{n \rightarrow \infty}\mathbb{E}[e^{it\bar{X}}] = \exp\left\{i\mu t - \frac{\sigma^2t^2}{2} \right\}\)
Characteristic function of the sample means:
Motivation to use the scaled, centered sample mean
It might seem easiest to use the sample mean directly. But let’s look at its variance:
\[
\displaylines{
\begin{align}
\mathbb{V}[\bar{X}_n] & = \mathbb{V}\left[\frac{1}{n}\sum_{i=1}^{n}X_i\right] \\
& = \frac{1}{n^2}\sum_{i=1}^{n}\mathbb{V}[X_i] \\
& = \frac{1}{n^2}\sum_{i=1}^{n}\sigma^2 \\
& = \frac{\sigma^2}{n}
\\
\\
\therefore \mathbb{V}[\bar{X}_n] & \rightarrow 0 \quad \text{as} \quad n \rightarrow \infty
\end{align}
}
\]
This reveals that the variance of the distribution decreases as the sample size \(n\) increases. This is a nice (and intuitive) result: the larger the sample, the closer the sample mean is likely to be to the true population.
Howwever, it does make it more tricky to prove that the sample mean converges to a normal distribution: since as we increase \(n\), the variance shinks, so the variance is not constant. Instead, if we scale and center the sample mean by \(\sqrt{n}\), then we ensure its variance is a constant \(\sigma^2\) for a given sample size \(n\):
\[
\displaylines{
\begin{align}
Z_n & = \sqrt{n}\left(\bar{X}_n - \mu\right)
\\
\therefore \mathbb{E}[Z_n]
& = \sqrt{n}\left(\mathbb{E}[\bar{X}_n] - \mu\right)
\\ & = \sqrt{n}\left(\mu - \mu\right)
\\ & = 0
\\
\therefore \mathbb{V}[Z_n]
& = \mathbb{V}\left[\sqrt{n}\left(\bar{X}_n - \mu\right)\right]
\\ & = n\mathbb{V}[\bar{X}_n] & \because \mu \text{ is a constant}
\\ & = n\frac{\sigma^2}{n}
\\ & = \sigma^2
\end{align}
}
\]
And since this is just a scalar transformation, we know it applies to the underlying distribution too.
Characteristic function of the scaled, centered sample mean
We derive the characteristic function as follows:
\[
\displaylines{
\begin{align}
\phi_{\bar{Z}_n}(t)
& = \mathbb{E}[e^{it\bar{Z}_n}] \\
& = \mathbb{E}\left[e^{it\sqrt{n}\left(\bar{X}_n - \mu\right)}\right] \\
& = \mathbb{E}\left[e^{it\sqrt{n}\left(\sum_{i=0}^n\frac{1}{n} \left(X_i - \mu \right) \right)}\right] \\
& = \mathbb{E}\left[e^{it\left(\sum_{i=0}^n\frac{1}{\sqrt{n}} \left(X_i - \mu \right) \right)}\right] \\
& = \mathbb{E}\left[\prod_{i=1}^{n}e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right] & \because e^{(a+b)}=e^a\times e^b \\
& = \prod_{i=1}^{n}\mathbb{E}\left[e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right] & \text{as independent} \\
& = \left(\mathbb{E}\left[e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right]\right)^n & \text{as identically distributed}
\end{align}
}
\]
So we see that the characteristic function of the scaled centered sample mean is the characteristic function of the random variable \(X_i\) minus \(\mu\), divided by \(\sqrt{n}\), and raised to the \(n\)th power.
If we just focus on the bit inside the brackets for now, this can be reformulated as a taylor series (if you want a refresher on Taylor series, see the appendix):
\[
\displaylines{
\begin{align}
& \mathbb{E}\left[\e^{\left\{\frac{it}{\sqrt{n}}(X_i-\mu)\right\}}\right] \\
= & \mathbb{E}\left[\sum_{k=0}^{\infty} \frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^k}{k!}\right]
& \because e^x = \sum_{k=0}^{\infty}{\frac{x^k}{k!}} \\
= & \mathbb{E}\left[\sum_{k=0}^{\infty} \frac{ \left( (it)^k(X_i-\mu)\right)^k}{k! \times n^{k/2}}\right] \\
= & \mathbb{E}\left[
\frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^0}{0!} +
\frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^1}{1!} +
\frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^2}{2!} +
\frac{ \left( \frac{it}{\sqrt{n}}(X_i-\mu)\right)^3}{3!} + \dots
\right] \\
= & \mathbb{E}\left[
1
+ \frac{it}{n^{1/2}}(X_i-\mu)
- \frac{t^2}{2n}(X_i-\mu)^2
+ \frac{it^3}{6n^{3/2}}(X_i-\mu)^3
+ \dots
\right] \\
= &
\mathbb{E}[1]
+ \frac{it}{n^{1/2}}\underbrace{\mathbb{E}[(X_i-\mu)]}_{\mathbb{E}[X_i] = \mu \quad \therefore \mathbb{E}[X_i] - \mu = 0}
- \frac{t^2}{2n}\underbrace{\mathbb{E}[(X_i-\mu)^2]}_{\mathbb{V}[X_i] = \sigma^2}
+ \frac{it^3}{6n^{3/2}}\mathbb{E}[(X_i-\mu)^2]
+ \ldots \\
= & 1 - \frac{1}{n} \left(
\frac{t^2\sigma^2}{2}
- \frac{it^3}{6n^{1/2}}\mathbb{E}[(X_i-\mu)^3]
- \ldots
\right)
\end{align}
}
\]
And we can now plug that into the characteristic function for \(Z_n\), and take it to the limit as we increase the sample size \(n\) to infinity:
\[
\displaylines{
\begin{align}
\phi_{\bar{Z}_n}(t)
& = \left(\mathbb{E}\left[e^{it\frac{1}{\sqrt{n}}(X_i-\mu)}\right]\right)^n
\\ \\
\lim_{n \rightarrow \infty} \phi_{\bar{Z}_n}(t)
& = \lim_{n \rightarrow \infty} \left(
1 - \frac{1}{n} \left(
\frac{t^2\sigma^2}{2}
- \underbrace{\frac{it^3}{6n^{1/2}}\mathbb{E}[(X_i-\mu)^3] - \ldots}_{
\rightarrow 0 \text{ as } n \rightarrow \infty
}
\right)
\right)^n
\\ \\
& = \exp{\left\{ \frac{-t^2\sigma^2}{2} \right\}}
& \because e^x = \lim_{n \rightarrow \infty} \left(1 + \frac{x}{n}\right)^n
\end{align}
}
\]
And we now see this is identical to the characteristic function of the normal distribution where \(\mu = 0\).
Fin.