What's the difference between a census – a full count – and a sample survey? The census will always be superior, right? Not really.
With talk that the Bureau of Statistics and the government are considering changing our census of population and housing from five-yearly to 10-yearly and making up for this with regular sample surveys, the difference between the two has suddenly become a question of interest to more people than just students of statistical theory.
Since all of us have to fill in the census form, many people have opinions on the topic. And it seems from the feedback backbenchers are getting that some of us quite enjoy census night. There's a feeling of togetherness as families across the land sit up answering a seemingly unending questionnaire for each person in the family.
In principle, a census provides a true measure of the population because, by definition, it doesn't involve any risk of sampling error. But if you think that makes censuses foolproof, you're mistaken.
For a start, in practice you fall well short of a 100 per cent "enumeration". When you've got to get forms from everyone, no matter how elusive or remotely located, you're bound to come up short. So you have to adjust the figures for this undercount, which you do by (get this) conducting a "post-enumeration survey".
For another thing, the answers you collect may be wrong, because people misunderstood the question or are being less co-operative than they should be. Errors in the census are both expensive and difficult to reduce.
Censuses are conducted on a particular day, which may or may not be representative of other times during the year. From that day on, the counts become ever-more-outdated. The things we're measuring are often too important for us to wait another five years for another count, but anything you do to update the figures in the meantime won't have the same certainty as a census.
Attempting to question every person in the country is such a huge exercise that it's hugely expensive. The census in 2011 cost taxpayers about $440 million. And because there's so much data on so many subjects, it takes ages to process. The figures can be maybe two years old before we get to see them.
It's such a major exercise that the bureau begins planning the next census two years before the latest one has been conducted.
A big part of the reason some people have been dismayed by news that a move from five- to 10-yearly censuses is being considered is that they've heard only half the story. They know what they'd lose, but not what would be put in its place. Researchers and interest groups who make great use of a particular part of the census have visions of going 10 years between drinks.
But another part of the problem, I suspect, is that a lot of people don't know much about the wonders of the science of statistics, a branch of mathematics that draws particularly on probability theory.
One way of thinking of statistical science is that it's the study of ways to be sure you're drawing accurate conclusions from a bunch of puzzling data. Another way is that statistics is the search for ways to draw accurate conclusions about a "population" (of people, things or events, such as all the road accidents in Victoria in 2014) as quickly, easily and cheaply as possible.
Get it? Statistics is the discovery of mathematical tricks that allow us to avoid all the hassle, delay and cost involved in always having to do censuses of this, that and the other.
The truth is that, as interestingly told by an article in the Christmas issue of The Economist, we've made great strides in this just since World War II.
In which case, why shouldn't we take advantage of this technological advance, just as we unhesitatingly take advantage of advances in computer science? Why run to the great expense of frequent censuses when we can get results that are almost as reliable, and in other respects better, much more easily, quickly and cheaply by using sample surveys?
That, after all, is why we've developed sampling theory – being able to take just a small sample of a population and draw accurate conclusions about the characteristics of that population.
The trick to sampling is to ensure the sample has been drawn at random from the population – to be sure it's representative of that population – and to ensure the sample is large enough to make conclusions reliable.
Sampling theory tells us how big a random sample needs to be, given the size of the population. It does so using probability theory. In the case of the population and housing census, we get information about innumerable, quite small sub-populations – such as the proportion of dwellings in the Sydney CBD that are owned outright by owner-occupiers. The smaller the sub-group, the bigger the sample needs to be to maintain accuracy.
The Americans conduct their census only every 10 years and keep it very short. But they make up for this by having an annual survey of the population covering a host of questions, with a sample size of (get this) three million households, representing 1 per cent of the population.
It seems that if we decide to go to 10-yearly censuses, we'll introduce a similar, detailed annual survey, with a sample size covering about a million people. (Our present monthly household survey – from which we get our estimates of unemployment – covers about 55,000 people.)
This would leave us with a 10-yearly census to "benchmark" all our surveys against, but give us much more frequent, less outdated, accurate information about a host of census topics, doing so more flexibly.
It would do so quickly, easily and much more cheaply, enabling us to spend the saving on replacing the bureau's ancient computer systems.