# MathCS.org - Statistics

## 1.3 What is a "Random Sample"

In the previous section we discussed the following example:

Example: What's the average income of people living in NYC?

The most accurate approach apparently would be to ask everyone living in NYC about their income, add it up, and divide by the total number of people asked (which will give the precise average).

However, that is not only impractical (very time consuming and expensive), it could not even theoretically work. For example, when asking the last people in NYC about their income, the ones we asked first may have moved out of NYC, their income may have changed by now, or new people might have moved into NYC that we not on our original list.

Therefore, instead of finding the exact average, we can try to estimate it. So, our first problem will be to randomly select a small sample, say of size 1000, find the average income of that sample (which is perfectly within our capabilities), and then draw conclusions from that sample about the whole population.

We might try to use the following procedure to select our random sample:

1. Open the latest New York City phone book
2. Select one page "at random"
3. Select the first 1000 people starting from that page.

Call them and ask them for their income. Compute the average of that group, and say that this average is representative for the average income of all people in NYC, approximately.

But this is not at all a "legal" procedure to obtain a "random sample": all people selected will most likely be from one borough, or all may have a name starting with "Mc" (and are likely to be of Irish ancestry, which introduces bias).

Even if we somehow managed to select people from the phone book without any bias, it's still not good enough: we will be missing people with unlisted numbers (usually high income), as well as people with no phone (usually low income), and some of the people selected may choose to not answer our questions.

Before we continue, we need to define clearly what we mean by "random sample":

A random sample of size n is a sample that is selected by a process such that any other sample of that size has the same chance of being selected.

In other words, a random sample is a sample where the selection has taken place without any bias of any sort.

In the real world, selecting a 'random sample' is difficult and often impossible. However, we will next learn a procedure for doing that in special cases. In general, though, we will avoid that problem and simply assume that a random sample somehow has been selected.

Discussion Topic: Immediately after an election, TV channel A forecasts that candidate X will receive 52% of the vote, with a margin of error of 2%. At the same time channel B predicts that the same candidate will receive 47% of the vote, also with a margin of error of 2%. Discuss what's wrong with this picture, and how this could happen. Do you recall any actual occasion where the winner of a major election has been incorrectly predicted based on statistical analysis?

Here is a simple example that illustrates how a random sample can be selected.

Random Sample Selection Procedure

To select, for example, a random sample of size n = 5 from a population of 2000 measurements we proceed as follows:

1. Label all measurements from 0 to 1999, in any order
2. Start a computer program that can generate random numbers
3. Use that computer program to generate 5 random numbers between 0 and 1999
4. Select the 5 measurements from the total population that correspond to those random numbers

This procedure will give a random sample (assuming the computer's random number generator is working correctly), but is not always applicable. The biggest problem is that of being able to label all numbers (or outcomes) in the population.

For example, if you want to find the average pollution of a certain river, you can not label all possible measurements.

In addition, we need to know more details about how to "start a computer program that can generate random numbers" .... we will learn that in later sections.

From now on, we will take a very simple approach: we will ignore that problem and assume that somehow a random sample has been selected (possible by the above method). We will, however, learn how to use Excel to select a random sample in case we can label all outcomes of our population.