ExplanationThe Mean is Not Enough
The mean of a sample tells you something about the sample. In practice, however, we almost always want to say something about the mean of the population from which the sample was drawn. It would be nice if we could assume that the population mean was the same as the sample mean but we can't, since samples are a only part of the population. If you took two samples, you would probably get two different sample means. They can't both be the correct population mean as that value is fixed. The truth is that we cannot know the true value of the population mean without measuring every member of the population and that is often impossible.
The best we can say about a population mean based on a sample mean is that there is a high chance that it falls somewhere in a given range. This range is the confidence interval of the population mean.
For example, instead of saying, "The mean age of students is 20 years" we can say, "It is highly likely that the mean age of students is between 18 and 22 years". This might seem like a step towards vagueness as the single mean seems more accurate, but it is actually a step towards honesty. We are saying "from the sample we took, the best we can say is...", which tells you a lot about the quality of the sample.
How Confident are You?
When we report confidence intervals, we must pick a confidence level - that is how confident we are that what we say is true. So we should actually say, "With 95% confidence, the mean age of students is between 18 and 22 years".
Here are the key points to remember:
- The confidence interval is a range from somewhere below the sample mean to the same distance above it. This is the range that we believe the population mean falls into.
- The confidence level specifies how confident we are that the population mean falls into the confidence interval. It is usually set at 95% or 99%.
As the confidence level grows, the confidence interval becomes wider. So the range into which you are 99% confident that the population mean will fall is larger than the range for the 95% level.
Here is an example. Imagine you are in a class of 500 students who have taken an exam. You want to know the average mark gained by the students but the results are not published and you cannot ask all 500! You can ask 15 of your friends, however, and they tell you how they did. The class of 500 is the population and you and your 15 friends make up the sample. If you found that your sample mean was 76, then you might conclude that the population mean (the average mark for the whole class) must also be 76, but you know that you might have got a different sample average if you had asked a different 15 people, so you know that you cannot conclude that the population average is 76.
This is where you use confidence intervals. Using your sample, you could calculate the 95% confidence interval for your sample mean and that would tell you the range into which the population mean probably falls. Imagine you find it to be from 68 to 84 (8 marks either side of your sample average). Your 95% confidence interval is 68 to 84. You can be 95% certain that the average mark for the population of 500 students is somewhere in that range, but you cannot say where. If you wanted to be 99% certain, you could calculate the interval at the 99% confidence level. Obviously, the range would be wider as you want to be more sure of capturing the population mean. It might extend from 64 to 88 at the 99% level. You might feel that this tells you very little about the population mean, as it could be anywhere in a large range, but it tells you a lot. It tells you that the variability in your sample suggests that you do not have sufficient data to guess accurately what the population mean really is.
Look at two pictures below. They show the 95% and 99% confidence intervals for our example plotted as error bars. You will see error bars showing confidence intervals in the other two sections on this page and they are a common tool, so it is a good idea to get to understand them now. The notation in red is there to show you how to understand the plot, it is not usually included on an error bar plot.
Notice how the 99% confidence interval is wider than that at the 95% level.
Be careful not to interpret confidence intervals incorrectly. Take the example above:
- It DOES NOT mean that 95% of the sample (or the population) data is between 68 and 84
- It DOES NOT mean that the population mean will fall into this range 95 times out of a 100. There is only one population mean and it is fixed, we just don't know what it is.
Note that the wider the confidence interval gets, the less useful the mean is, as it could just as easily be anywhere in a large interval. Confidence intervals are calculated from a sample of data. They are based on the size and variation of the sample. Level two shows you how to calculate confidence intervals but for now you should be able to see the larger the sample and the less variation there is in the sample, the more specific you can be about the population mean.