1.2 What is Statistics
Everybody relies on data in one way or another:
- corporate presidents decide company policy based on quarterly sales figures
- politicians decide on campaign strategy based on polls
- teachers decide grading curves based on a bell curve
- you and I decide whether to smoke or not based on health records of other people
Therefore, we need a comprehensive and understandable way to deal with data:
Statistics is the study of making sense of data.
There are four components:
- collecting data
- summarizing data
- analyzing data
- presenting data
The basic objective of statistics is to get information about a larger group just by looking at a small part of that group. We will define
the term population to stand for the set of all measurements of interest
A population could be (1) the set of all photographs of Mars, or (2) the set of heights of people in the US Army, or (3) the set of all measurements of water quality taking from the Hudson river, or (4) the set of all problems that can be solved using statistics. On the other hand,
the term sample will denote any subset of measurements selected from the population
For example, a sample could be (a) the pictures selected from population (1) from a specific region of Mars, or (b) the heights of people in a particular division of the US Army, or (3) the set of water measurments of the Hudson River taken on 7/24/2003, or (4) the statistical problems we are solving in this class. In addition,
the term statistical inference stands for an estimate, prediction, or other generalization about a population based on information contained in a sample.
We can now rephrase the basic objective of statistics as follows:
With Statistics we want to make inferences about a population based on information collected from a sample.
To be more precise, we will approach a "generic statistical problem" using these four steps:
Example: A tax auditor is responsible for 25,000 accounts. How many accounts are in error (resulting in a loss of revenue)?
- Problem definition
- what is the population of interest, and what are the variables that are to be investigated
- Data collection
- describe and select the sample from the population
- Data analysis
- make some statistical inferences from the sample about the population
- Analysis Reporting
- report the inference together with a measure of reliability for the inference where we use the term variable to mean a characteristic or property of an individual population where the observations can vary.
The steps involved in trying to find a suitable answer to this question are:
Defining the problem: The entire population consists of all 25,000 accounts. Our goal is to obtain a reasonable estimate for the number of accounts that are, in all likelihood, in error. Our variable x counts whether an account is in error.
Data collection and summary: The auditor decides to select 2000 accounts at random (somehow), tests each of these, and finds that 84 of them are in error.
Data analysis: Some statistical theory is applied to allow drawing a conclusion from the sample of 2000 accounts and applying it to all 25,000 accounts. In this case, the likely theory involves computing 84/2000 = 4.2%, but possible other formulas are necessary as well.
Analysis reporting: Based on our data analysis we infer that approximately 4.2% of the accounts will be in error. That guess has an error of +/- 0.9%
Note that we have not clarified how to obtain the error for our guess, but will describe that in a later module. The analysis of the data is usually done by using a calculator, or - more frequently these days - with the help of a software package. In our course we will use Microsoft Excel to help us perform statistical analysis.
A second detail not mentioned in the above example is how the 2000 accounts were selected at random. Of course, one would like to select that sample as 'unbiased' or as 'randomly' as possible. It turns out that selecting such a random sample is not easy. In fact, it is frequently the most difficult process in applying statistics in the real world!
Example: What's the average income of people living in NYC?
The most accurate approach apparently would be to ask everyone living in NYC about their income, add it up, and divide by the total number of people asked (which will give the precise average).
However, that is not only impractical, it could not even theoretically work.
Discussion Topic: Discuss the difficulties involved when trying to question everyone living in New York City about their income to determine the average income of NYC residents.
Therefore, instead of finding the exact average, we can try to estimate it. So, our first problem will be to randomly select a small sample, say of size 1000, find the average income of that sample (which is perfectly within our capabilities), and then draw conclusions from that sample about the whole population.
We might try to use the following procedure to select our random sample:
- Open the latest New York City phone book
- Select one page "at random"
- Select the first 1000 people starting from that page.
Call them and ask them for their income. Compute the average of that group, and say that this average is representative for the average income of all people in NYC, approximately.
Discussion Topic: Discuss whether the above method of randomly selecting 1000 people in New York City will indeed yield a "random sample", using your intuitive understanding of what a "random sample" ought to be. If you find flaws in the above procedure, try to come up with a better one.