RANDOM SAMPLING IS HARD

Although random sampling is the gold standard, it is often not used. There are a number of reasons for this.

The first step in selecting a random sample is finding a sampling frame — a list of every member of the population. After all, each member of the population cannot have an equal chance of being selected if you do not know who they all are. But this introduces a problem straight away: do you in fact know who they all are — can you construct a sampling frame? If you can’t, you can’t do random sampling.

Assuming that you can, you must then select your sample using random numbers. In the previous section I discussed why human beings are terrible at generating random numbers. In the olden days, tables of random numbers were used. In today’s modern world we use computers.

Having got your list of sample members, you then have to collect data from them all. This can be hugely difficult and time-consuming. They may be all over the place. It could take forever to find them and actually get the information you want from them. Imagine taking a random sample of size 100 from all A level students in Harrogate. They’ll go to a range of different schools and colleges. Many of them may not even live in Harrogate: they just commute to Harrogate for their education. Some will go to boarding schools and may live very far away.

One way of dealing with this is to use cluster sampling. This is appropriate when your population is geographically diverse, but tends to group into clusters: for example, a school. Suppose we wish to take a sample of 200 A level students from the whole of North Yorkshire. We could attempt a random sample from all of the A level students in the county. But they will live all over the place. It will take a long time to get to them all and collect the data we want. So, rather than selecting our sample from all of the students, we first select, say, 10 schools at random. We then select 20 students from each school, again at random in each case. Now, we only have to visit 10 schools. Of course this method violates both of the conditions for a (simple) random sample. Not every student has an equal chance of being selected: because 20 students are taken from each school that is selected, students in larger schools will have a smaller chance of being selected. And not every sample has an equal chance of being the sample: we have forced the sample to come from ten schools — so a sample comprising students from only one school is impossible.

The problems with simple random sampling don’t stop there. Suppose you’ve constructed a sampling frame and you’ve selected a sample at random. And you’ve actually managed to contact one of your sample members. You ask them the question that is the centrepiece of your research: “Do you take illegal drugs?”

There are three possible answers to this question: “yes”, “no” and “mind your own business.” What do you do if they say the last one?

Maybe they’re refusing to answer because they take drugs and don’t want to admit it. Maybe they’re refusing to answer because they believe strongly in individual privacy. There’s no way to know.

Such an answer is called a non-response and it’s fatal to a random sample. A single non-respondent invalidates your sample. But it’s worse than that. There is no way to recover. You cannot replace the non-respondent because then not every member of the population would have an equal chance of being selected. You cannot start all over again and draw a new sample because then not every sample would have an equal chance of being the sample.

Your entire research is dead in the water because one person refused to cooperate. You can try to avoid this happening by using the technique described in “Are you gay?”

Random sampling is really only appropriate if you have a “well-behaved” population. In other words one that gives you the data you want without arguing about it. Which means animals and plants and rocks. Things you can control. Random sampling human beings is a nightmare.

And it’s not just non-response that causes difficulties. Another problem with humans is that they lie. Including to themselves. Opinion polls are notoriously unreliable. There is a huge difference between being asked the theoretical question, “If there were a general election tomorrow, who would you vote for?” and actually standing in a polling booth with a little pencil and a ballot slip. The answer to the opinion poll doesn’t really matter so there’s no motivation to take it seriously. But there’s something very serious about marking a real ballot paper with a cross. And when you are marking that cross you do so in private. With an opinion poll you are (often) telling another person how you’d vote, and you may not want to admit who you really support. This was given as one reason why the opinion polls prior to the 1992 general election predicted the wrong winner: so-called “shy Tories”.

systematic.jpg