« The US in rice | Main | I'm not satisfied with student satisfaction »
Sunday
Feb202011

The central limit theorem

Another in my occasional series of more technical blog posts, the following is a review of the most important result in statistics, the central limit theorem.

The central limit theorem

Strictly speaking the central limit theorem is not a theorem about means at all: it's more general than that. It says that the distribution of  the sum of n observations will be approximately Normally and this approximation gets better as n increases. In other words, ∑X~N, approximately.

You can try this out for yourself by running this java applet. It will allow you to roll one die a thousand times and it will draw a corresponding histogram of the results. That histogram will look pretty much rectangular. This is because you have only taken one observation each time: n=1. If you roll the die twice and add up the total score, and then do that a thousand times, the histogram will look triangular. This is because the middle values (6,7,8) are more likely than the low values (2,3,4) and the high values (10,11,12). So this time you've taken two observations (n=2) and added them together and the result is a little bit Normal. But if you roll the die five times (n=5) and add up the total score, and then do this a thousand times the resulting histogram looks like a Normal distribution – and that's the central limit theorem kicking in even though you're only making five observations each time and adding them up. (Note that the fact we repeat each set of die rolls a thousand times is arbitrary. The value of n is the number of die rolls we make each time, not how many times we repeat this. The large number of repetitions is so that we can build up a clear picture of the distribution of the total score.)

Approximating distributions

The central limit theorem can be used in distributional approximations. For example a binomial B(50,0.5) can be thought of as the sum of fifty Bernoulli distributions. So B(50,0.5)=∑B(1,0.5). Since we are adding up observations, the central limit theorem applies and tells us that B(50,0.5)~N, approximately.

In general B(n,p)~N, approximately, since B(n,p)=∑B(1,p). The closeness of the approximation depends, of course, on the values of n and p. If p is close to 0.5, n can be quite small: 10, say. As p gets smaller, n has to get larger for the approximation to be reasonable.

Similarly, a Poisson Po(50) can be considered to be a sum of fifty Po(1). So Po(50)=∑Po(1) and again the central limit theorem applies and gives us Po(50)~N, approximately.

The underlying distribution and the sampling distribution of the mean

Before I discuss hypothesis tests we need to be clear about two things. The population we take our sample from is called the underlying population. It will have a distribution which we may or may not know and it may or may not be Normal. Having taken our sample we will calculate its mean. If we repeated this procedure many times we could build up a whole new distribution: the sampling distribution of the mean. The central limit theorem tells us about that sampling distribution: it says that -- whatever the underlying distribution -- the sampling distribution of the mean will be approximately Normal, and this approximation will be good for means based on large samples.

Hypothesis testing for the mean

So far as testing means is concerned there are two distinct situations: where we know the population variance and where we don't.

If we don't know the population variance then the central limit theorem doesn't come into it at all, strictly speaking. The appropriate tests (t-tests) are based on the t-distribution and that is a completely different thing. Because we don't know the population variance we have to give the underlying mathematics something to work with and the requirement is that the underlying population from which we're sampling is Normally distributed. That being said, the t-tests are quite robust to deviations from this requirement.

If we do know the population variance we'll be using a z-test and we will be using the central limit theorem butonly if the underlying distribution is not Normal (or is not known). If the underlying distribution is Normal then the distribution of the mean is exactly Normal no matter what the size of the sample and the central limit theorem is unnecessary. If the underlying distribution is not Normal, then the central limit theorem tells us that the mean will nonetheless be approximately Normal and that this approximation will be good for "large" sample sizes. How large is large? It depends on how far from Normal the underlying distribution is. If the underlying distribution is a chi-squared with 20-degrees of freedom then the sample size need not be that large. Why? Because a chi-squared distribution with 20-degrees of freedom looks a lot like a Normal distribution. (Why? Because it's the sum of twenty chi-squared distributions with one degree of freedom and so the central limit theorem tells us it will be approximately Normal! Note that the central limit theorem is being used here to make an inference about the underlying distribution, not the sampling distribution of the mean.)

Finally, if the sample size is large but we don't know the population variance, we may still use a z-test rather than a t-test because we may consider our unbiased estimate of the population variance (based on our large sample) as being very reliable so that really we "know" the population variance. And the critical values from the t-distribution are very similar to those from the Normal distribution if the sample size is large.

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>